200th post

8 August, 2017

Yeaaaj, 100 more badly written articles and barely working guides ! Well seems I enjoyed making them and somehow something likes to read them. Some statistics to show off !

Like last time, users/bots seem to drop off during the weekend, compared to last time the effect is even easier to see. I assume part of the traffic comes from people searching for help during work hours. Long live weekends.

Overall traffic has been increasing, not really sure why there was a drop last month, neither do I really care. Obv. the last point is still new (since its only 8 days in the month)

I was not aware but clearly Thursdays are my blog day. Last time, Friday (2) and Wednesday (3) where close follow-ups. Tuesday who came up as second to last, is now my 4th most active day, and Saturday keeps being the last on the row.

On to the next 100 articles ūüôā

I wanted to redo/rework the Passbolt install on Centos for a while. It’s seems like a horribly long and complex process, but in fact it’s not. With the recently released Passbolt 1.6¬†and my wish to play with asciinema¬†for a while I thought why not combine both ūüôā Considering this is my first attempt, don’t shoot me ! If you prefer a readable version feel free to use my text guide version.

Read More

It’s not the first time I received this error. It is the error you receive when the file (or large input in general) you tried to upload is too large and the server is declining it. Here is how you can fix it :

Open your server {} specific config file for nginx :

nano /etc/nginx/conf.d/svennd.conf

and add, 10M is ~10MB.

client_max_body_size        10M;

Note that PHP also has an upload limit (see the docs).

Reload Nginx to activate :

service nginx reload

or 

systemctl reload nginx

note that clent_max_body_size can be in both http and server context, I prefer server as it is more specific.

I have some NFS servers (read: the main function is to store data and share it over NFS) to maintain and mostly they simply work (like everything in Linux). Most variables have been battle tested for … well forever. So rarely do you need to check on the best trade-off. Well enter the NFS thread counter.

Read More

I recently learned about tcp_sack while I certainly don’t understand every detail of this feature, its clear it should be a huge help in cases where packets (in TCP protocol) are dropped and latency is relative high. From my basic understanding, when sending TCP protocol packages, every package has a number (sequence) when tcp_sack is enabled on both client/server.¬†Tcp_sack will be able to respond to the server which range has been dropped. When¬†tcp_sack¬†is not enabled, the client will only send the “last” sequentially received packet, and everything has to be resend from the last received packet.

Eg : packet 1-10 are received, packet 11-14 are lost, packet 15-35 are received; with tcp_sack : client will tell that it received packet 10 and packet 15, and hence the server can respond with packets 11-14. Without tcp_sack : client will tell that it received up till packet 10, and hence the server will have to resend packets 11-35

In all the distro’s I could get my hands on (Centos, Debian, Ubuntu, …), it was on by default! The question however, is how many packets are commonly dropped and does communication even have “high” latency ? At what cost does tcp_sack come ? I found little data about resource consumption by this feature, but since its “on-by-default” I assume its trivial. I did however find this article¬†that claimed on a ~4MB file, with emulated connection, that tcp_sack actually made the transfer slower (above 2 min vs below 2 min with tcp_sack for a 1% loss) ¬†That seems to defeat he purpose of tcp_sack all together. ¬†I am not as interested in these situations, as¬†my environment is local servers talking to each other, I don’t really care that much if they go faster or slower in packet loss situations, as its a red flag is latency or packet loss happens all together.

I copied over a random payload to check if the parameter has any influence on the time spend to transfer.

Read More

proxmoxWoohoo, Proxmox VE 5.0 has been released, this version is based on Debian 9. (Linux Kernel 4.10) It includes allot of new features, but sadly updates are still pointing to the enterprise repository for updates. This results in ugly error message when trying apt-get update such as :

W: The repository 'https://enterprise.proxmox.com/debian/pve stretch Release' does not have a Release file.
N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
N: See apt-secure(8) manpage for repository creation and user configuration details.
E: Failed to fetch https://enterprise.proxmox.com/debian/pve/dists/stretch/pve-enterprise/binary-amd64/Packages  401  Unauthorized
E: Some index files failed to download. They have been ignored, or old ones used instead.

To fix this, its similar to V3.X and V4.X. We need to add one repository : nano /etc/apt/sources.list

add :

deb http://download.proxmox.com/debian stretch pve-no-subscription

Then disable or remove

rm -f /etc/apt/sources.list.d/pve-enterprise.list

(for disable add a # in front of the first line starting with deb)

Now you can run those updates ūüôā ¬†take note only apt-get update and apt-get dist-upgrade are supported by Proxmox !

Now I dislike the way Proxmox pushes people to take a subscription and there pricing method, but it is an amazing piece of free software! Well done Proxmox devs !

After reading cron.weekly a few weeks ago, I was intrigued by binsnitch.py, a tool that creates a baseline file with the md5/sha256/… hash of every file you wish to monitor. In case you think you have a virus, malware or cryptovirus you can verify easely what files have been changed. This is kinda fun, the sad part is, it uses Python, and requires python >= 3 which restricts the use on Centos (python 2 default). I dislike a unneeded dependency like that on my servers. So I wrote a quick and dirty alternative to it. Only requirements are bash and md5sum (or if you wish some other sum tool such as sha256sum) which I believe are common on every Linux server.

You can download & adapt it here.

I have not found easy explanation on what kind of error states are in Rocks, Grid Engine, so I am collecting them here as I find them.

Show states

First let’s find the overview of the nodes; this can be done using qstaf -f

qstat -f

Result should be something like :

# qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
----------------------------------------------------------------
all.q@compute-0-0.local        BIP   0/0/24         0.00     linux-x64 
----------------------------------------------------------------
all.q@compute-0-1.local        BIP   0/0/24         0.00     linux-x64
----------------------------------------------------------------
all.q@compute-0-2.local        BIP   0/0/24         0.02     linux-x64
----------------------------------------------------------------
all.q@compute-0-3.local        BIP   0/0/24         0.00     linux-x64 
----------------------------------------------------------------
all.q@compute-0-4.local        BIP   0/0/24         0.00     linux-x64
----------------------------------------------------------------
all.q@compute-0-5.local        BIP   0/0/24         0.00     linux-x64    
----------------------------------------------------------------
Error state : E

E stands for (hard) error, which means something bad is going on. The result is a decision by the headnode to not use this node anymore until manual intervention. This happens, to make sure there is not a job sinkhole created.

Example :

qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------
all.q@compute-0-0.local        BIP   0/0/24         0.01     linux-x64     E
---------------------------------------------------------------
all.q@compute-0-1.local        BIP   0/0/24         0.00     linux-x64
---------------------------------------------------------------
all.q@compute-0-2.local        BIP   0/0/24         0.02     linux-x64
---------------------------------------------------------------
all.q@compute-0-3.local        BIP   0/0/24         0.00     linux-x64     E
---------------------------------------------------------------
all.q@compute-0-4.local        BIP   0/0/24         0.00     linux-x64
---------------------------------------------------------------
all.q@compute-0-5.local        BIP   0/0/24         0.00     linux-x64     E
---------------------------------------------------------------

The error states here where probably due to a full root disk on these nodes. There is a tool for finding out which jobs failed to find out what was happening at the time (-explain E)

qstat -f -explain E
queuename                      qtype resv/used/tot. load_avg arch          states
-------------------------------------------------------------
all.q@compute-0-0.local        BIP   0/0/24         0.00     linux-x64     E
        queue all.q marked QERROR as result of job 542032's failure at host compute-0-0.local
        queue all.q marked QERROR as result of job 542033's failure at host compute-0-0.local
        queue all.q marked QERROR as result of job 542036's failure at host compute-0-0.local

What is more important, is the fact that Error state will survive a reboot. So we should clean it up if the underlying issue has been resolved : (this will clear all errors)

qmod -c '*'

source

Disabled state : d

d means the node has been disabled, this normally should not happen automatically. We can disable a node from getting anymore jobs, but the running jobs will continue to run.

You can disable a node from further jobs using the qmod command.

qmod -d all.q@compute-0-5.local

You can re-enable a node again using

qmod -e all.q@compute-0-5.local

Example :

[root@server ~]# qmod -d all.q@compute-0-5.local
root@server.local changed state of "all.q@compute-0-5.local" (disabled)
[root@server ~]# qmod -e all.q@compute-0-5.local
root@server.local changed state of "all.q@compute-0-5.local" (enabled)

 

au : Alarm, Unreachable

The state au, u means unreachable this happens when sge_execd¬†on the node does not respond to the¬†sge_qmaster¬†on the headnode¬†within a configured timeout window. The a state is alarm, this will happen when the node does not report the load, in which case a load of 99.99 is assumed. This results in the scheduler to not assign more work to the node. The¬†au state can happen when a NFS server is being hammered and the complete node is waiting for the “slow” network disk. ¬†(when hard mounted nfs) This state can resolve itself if the problem gets resolved.

source

[root@server ~]# qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@compute-0-0.local        BIP   0/0/24         0.00     linux-x64     d
---------------------------------------------------------------------------------
all.q@compute-0-1.local        BIP   0/3/24         3.25     linux-x64     d
---------------------------------------------------------------------------------
all.q@compute-0-2.local        BIP   0/6/24         6.32     linux-x64     d
---------------------------------------------------------------------------------
all.q@compute-0-3.local        BIP   0/0/24         0.00     linux-x64     dE
---------------------------------------------------------------------------------
all.q@compute-0-4.local        BIP   0/5/24         5.28     linux-x64     d
---------------------------------------------------------------------------------
all.q@compute-0-5.local        BIP   0/0/24         0.00     linux-x64     dE
---------------------------------------------------------------------------------
all.q@compute-0-7.local        BIP   0/0/24         -NA-     linux-x64     adu

Useful link :

In the previous article on Bareos, we setup a quick and dirty backup job to run every night. This was pretty easy, but it has some flaws. (1)¬†the first flaws, -after a full backup- only increment backups are created, forever. This makes it difficult to get a restore going down the line, as all the incremental backups need to poke the initial full backup before you can start recovering. (2) A second flaw, we ran backups but never checked if we can restore them. We need to take into account the Schrodinger’s backup : “The condition of any backup is unknown until a restore is attempted”. Perhaps a bit out of this scope but (3) we did not look where the backups are stored, and how they are being stored. There are plenty of options in Bareos, so let’s take a look and fine tune the backup setup.

Read More

I hit upon this error :

Cannot open your terminal '/dev/pts/0' - please check.

after being logged in under root and su to a user profile;

su user

and then willing to resume a screen :

# screen -r copy.pid
Cannot open your terminal '/dev/pts/3' - please check.

The issue is resolved using :

script /dev/null

I previously had similar issues, but this was using lxc-attach. See this previous post.