Understanding ZFS : Checksum

6 February, 2019

Ever wondered what kind of checksum ZFS uses to check for bit rot ? Probably not, but it turns out you can change the used algorithm. However like most settings, the defaults are chosen by smart people. So changing it might not be doing you any favors.

As it turns out, the default checksum used is Fletcher’s checksum;This algorithm is comparable to CRC error detection, but outperforms it by nearly 20 times per byte (source). So it looks as if this is really speedy.

This algorithm can actually use some of the more modern Intel CPU optimizations. (source) To check if your CPU has these optimizations, check your CPU flags, on most distro’s this can be done using :

cat /proc/cpuinfo | grep flags

or 

lscpu | grep -i flags

Flags of interest are sse2, ssse3, avx2 and avx512f. I’m well outside the scope of knowledge here; but it seems those optimizations can vary between CPU’s and so its not always sure if the latest optimization is the best; so ZFS actually mini-benchmarks just after its kernel module is loaded to determine what algorithm is best. Even nicer, the results are stored in /proc and you can check what your CPU’s scores are; You can check it in :¬†/proc/spl/kstat/zfs/fletcher_4_bench

In a recent server (Intel Xeon Silver 4110), avx512f/avx2 is picked :

cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 1054412431662536 2941001241716525
implementation   native         byteswap
scalar           3455422459     2778248102
superscalar      4626459244     3440503186
superscalar4     4008342143     3352064854
sse2             7888619803     4445292423
ssse3            7891663628     7030719031
avx2             12054042156    10904840790
avx512f          19645275791    6985259129
fastest          avx512f        avx2

While on an ancient test machine(Intel Pentium Dual CPU E2220) , superscalar is picked over “newer” flags such as sse2.

cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 3970297751 15299241313174195
implementation   native         byteswap
scalar           3690080571     2332985088
superscalar      4357932898     2616994380
superscalar4     3849236054     2487645215
sse2             2809499310     2262775111
ssse3            2809413022     2308094121
fastest          superscalar    superscalar

Pretty nice work, all done behind the scene in a transparant way.

Back to the stuff we can actually play with, changing the algorithm; It turns out, that one can change this to SHA256, which is required to run deduplication and that’s about it; The other alternatives have been deprecated or are not implemented in zfsonlinux, these are fletcher2, SHA512, skein and edon-r. You can also disable the checksum for a certain dataset or for an entire pool, but then the question is why even chose for ZFS.

Changing the checksum can be done like most values (as per docs) :

zfs set checksum=sha256 pool_name/dataset_name
zfs set checksum=fletcher4 pool_name/dataset_name

So while its really interesting to know what is going on behind the scenes, I doubt many people should play with this unless you know what you are doing. In which case this article is not aimed for you ūüėČ

As always, relevant information, and basically the source of this post can be found on the Github wiki of zfsonlinux, here.

I installed PHP 7.1 through webtatic, except with Apache instead of Nginx, so not the -fpm.  Installing PHP stats package is easy, if you know how, so follow along.

First we need to provide our system with a compiler, and the pear package which contains pecl (yum whatprovides pecl)

yum install gcc php71w-pear

After that you can run pecl to install it :

pecl install stats

This will however fail, and tell you you need a PHP 5.* version … however if you specify the latest version it wil just install fine :

pecl install stats-2.0.3

And bam, installed. Now we need to activate it in the PHP config :  (new file, or else /etc/php.ini)

nano /etc/php.d/stats.ini

add :

extension=stats.so

After that, just restart apache and you are good to go :

systemctl restart httpd

 

 

I installed a new kernel (kernel-ml, from¬†mainline 4.18) to make sure on next boot it got selected this is what I had to do : (Since its headless and I won’t see grub when it boots)

Install kernel mainline :

yum install kernel-ml

Find the current installed kernels :

cat /boot/grub2/grub.cfg | grep ^menuentry | cut -c1-50 | nl -v 0

for me :

0  menuentry 'CentOS Linux (4.18.16-1.el7.elrepo.x86_
1  menuentry 'CentOS Linux (3.10.0-693.5.2.el7.x86_64
2  menuentry 'CentOS Linux (0-rescue-daa04fd470a24a8f

So I want the first kernel (starting from 0) as default; This can be added in /etc/default/grub :

GRUB_DEFAULT=0

Then remake grub config :

grub2-mkconfig -o /boot/grub2/grub.cfg

Reboot and the new kernel is :

[[email protected] ~]# uname -a
Linux svennd.local 4.18.16-1.el7.elrepo.x86_64 #1 SMP Sat Oct 20 12:52:50 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux

 

 

Auch, I got the same error as ma.ttias.be pretty much two years ago. I have no idea why it failed, however the solution seemed to work similar to what he suggested.

[Mon Dec 03 15:07:55.871866 2018] [auth_digest:notice] [pid 3352] AH01757: generating secret for digest authentication ...
[Mon Dec 03 15:07:55.871892 2018] [auth_digest:error] [pid 3352] (2)No such file or directory: AH01762: Failed to create shared memory segment on file /run/httpd/authdigest_shm.3352
[Mon Dec 03 15:07:55.871897 2018] [auth_digest:error] [pid 3352] (2)No such file or directory: AH01760: failed to initialize shm - all nonce-count checking, one-time nonces, and MD5-sess algorithm disabled
[Mon Dec 03 15:07:55.871901 2018] [:emerg] [pid 3352] AH00020: Configuration Failed, exiting

The directory /run/httpd is gone. So recreate it and give correct permissions; This in my case (on Centos 7X was apache instead of httpd)

mkdir /run/httpd
chown root.apache /run/httpd
chmod 0710 /run/httpd

Another solution, sadly the mystery is not solved.

Ordinarily I use “static” references in /etc/fstab to mount NFS shares to a server. Doing this on a Rocks Cluster however is a bit “hacky” and adapting every node can be automated by using¬†rocks run host¬†however there is a alternative way. (not sure if its better ūüėČ ) but Rocks uses Autofs to load NFS mounts when they are required. So I felt brave and wanted to learn something new. This is the documentation on that journey.¬†¬†

Read More

On Centos 7.5 base install, there are no utility’s installed by default. So when I tried to mount an NFS share I got this error :

mount: wrong fs type, bad option, bad superblock on svennd.be:/data,
       missing codepage or helper program, or other error
       (for several filesystems (e.g. nfs, cifs) you might
       need a /sbin/mount.<type> helper program)

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

This clearly tells us it cannot communicate with the NFS share, fixing can be done installing : nfs-utils for Centos.

yum install nfs-utils

for Debian variants (Mint, Ubuntu, …) this would be :

apt install nfs-common

After that, I see happy shares ūüôā

mount.nfs: timeout set for Wed Nov 14 12:57:50 2018
mount.nfs: trying text-based options 'soft,intr,retrans=2,rsize=32768,wsize=32768,nfsvers=3,tcp,addr=svennd.be'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: trying svennd.be prog 100003 vers 3 prot TCP port 2049
mount.nfs: prog 100005, trying vers=3, prot=6
mount.nfs: trying svennd.be prog 100005 vers 3 prot TCP port 20048
/data    : successfully mounted

 

Rename a volume group on LVM

8 November, 2018

In a recovery of a system I found myself in annoying pickle of having two disks with a similar LVM layout and more precise with a equally named volume group. This can cause issues, however in my case it just stopped me from mounting the original device. To make matters worse the sizes of the volumes where nearly identically in size. Not really a user of LVM, it was time for google to save me.

Activating both identical volume failed :

[email protected]:~# vgchange -ay
  device-mapper: create ioctl on pve-swapLVM-Do7QVk4UeQXTTqVtB5b0pISJQy1YL1YJ4RB7lZXsAvYVGcYs9e7TZcFuHfkUBZJz failed: Device or resource busy
  device-mapper: create ioctl on pve-rootLVM-Do7QVk4UeQXTTqVtB5b0pISJQy1YL1YJgnTssDJJQj59LISgspFSLEH2pzqQLYw9 failed: Device or resource busy
  1 logical volume(s) in volume group "pve" now active
  device-mapper: create ioctl on pve-dataLVM-46zXzJmc3xbdOtJdsVXrSlb5siaAacfRmhJ3Ac2ET7201zotB4DEX2kVu9gSafDC-tpool failed: Device or resource busy
  2 logical volume(s) in volume group "pve" now active

So first thing I found was lvmdiskscan this will search for disk that are formatted with LVM. Not surprisingly I found two disks (new and old one) and a third data disk. (RAID)

[email protected]:~# lvmdiskscan
  /dev/sda2 [     256.00 MiB]
  /dev/sda3 [     465.51 GiB] LVM physical volume
  /dev/sdb2 [     510.00 MiB]
  /dev/sdb3 [     465.15 GiB] LVM physical volume
  /dev/sdc  [      29.10 TiB]
  1 disk
  2 partitions
  0 LVM physical volume whole disks
  2 LVM physical volumes

note : partition /dev/sda3 and /dev/sdb3 are dangerously similar !

Now we know that the partitions are very similar, however perhaps I might be lucky that the logical volumes where different lvscan :

[email protected]:~# lvscan
  inactive          '/dev/pve/swap' [58.12 GiB] inherit
  inactive          '/dev/pve/root' [96.00 GiB] inherit
  ACTIVE            '/dev/pve/data' [295.03 GiB] inherit
  ACTIVE            '/dev/pve/swap' [8.00 GiB] inherit
  ACTIVE            '/dev/pve/root' [96.00 GiB] inherit
  inactive          '/dev/pve/data' [338.60 GiB] inherit

note : there is a slight different in sizing of the volumes, this is good information. However the names are equal. This is causing the issue.

If we can change the volume group name (pve) this should be enough to mount it. However I don’t want to take the active ones, as that might crash my current system (?) So let’s check using¬†vgdisplay :

[email protected]:~# vgdisplay
  --- Volume group ---
  VG Name               pve
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  64
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                3
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               465.15 GiB
  PE Size               4.00 MiB
  Total PE              119078
  Alloc PE / Size       114983 / 449.15 GiB
  Free  PE / Size       4095 / 16.00 GiB
  VG UUID               Do7QVk-4UeQ-XTTq-VtB5-b0pI-SJQy-1YL1YJ

  --- Volume group ---
  VG Name               pve
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  7
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                3
  Open LV               2
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               465.51 GiB
  PE Size               4.00 MiB
  Total PE              119170
  Alloc PE / Size       115075 / 449.51 GiB
  Free  PE / Size       4095 / 16.00 GiB
  VG UUID               46zXzJ-mc3x-bdOt-JdsV-XrSl-b5si-aAacfR

Since there is a slight difference in size (449.51 vs 449.15) and I know the current disk in use is my new system (/dev/sda)  (check using df -h).

I know I need /dev/sdb3 with 465.15 GB in size, this is the first in the list with UUID (Do7QVk-4UeQ-XTTq-VtB5-b0pI-SJQy-1YL1YJ)

Finally we know what we want to rename, it seems almost too simple using vgrename :

[email protected]:~# vgrename Do7QVk-4UeQ-XTTq-VtB5-b0pI-SJQy-1YL1YJ pveb
  Processing VG pve because of matching UUID Do7QVk-4UeQ-XTTq-VtB5-b0pI-SJQy-1YL1YJ
  Volume group "Do7QVk-4UeQ-XTTq-VtB5-b0pI-SJQy-1YL1YJ" successfully renamed to "pveb"

After this we can activate the volume group using vgchange :

vgchange -ay

And voila one can mount the new device :

mkdir -p /data/oldroot
mkdir -p /data/data

mount /dev/pveb/root /data/oldroot
mount /dev/pveb/data /data/data

Tadaaa ! The data is available once more ūüôā

We bought our first 10GBase-T switch. Pretty nice, but setting it up proved a bit more complex then our SuperMicro switches. Here I documented my search to get servers to talk to each other over a VLAN using the switch to assign IP addresses, using DHCP.

Login & Connect

Logging in proved rather easy and was similar to SuperMicro’s, connect “through the back” using a RJ45 alike cable that is actually a serial cable to USB. (I don’t think this was included, but there alternatives) Once connected I could find the “COM” port windows assigned checking Bluetooth devices :

windows screen of bluetooth & other devices

One could also open the device manager and find it under COM ports. I used putty as my VT100 terminal emulator, on any Linux distro there are plenty of options but the easiest option would be screen. The setting of this switch are : 115200 baud speed, 8 data bits, 1 stop bits, none parity, none flow control.

Connect and enter twice and voila you can see the cli interface, the default login is “admin” and no password.

To get the assigned IP address :

# set ip
ip management vlan 1 8.8.8.8

# get ip
show ip management

# or 
show ip vlan

After you get the IP address or guess it, you can go to the more simple web-interface configuration option. In my case the IP was 169.254.100.100. to be greeted with a login page.

Make VLAN

First I needed to make a VLAN, Switching -> VLAN -> basic -> VLAN Configuration. Giving a VLAN ID and a name (disable static).  If everything works it should look like :

vlan configuration m4300

After this, you need to define the VLAN membership this can be done in Switching -> VLAN -> Advanced -> VLAN Membership. Go to VLAN ID you just made (in my case 400) and select group operation tag all then untag all, so that all ports show as untagged. (U).

untag VLAN 400 members

From my basic understanding of VLAN’s and tagged packets, I figured that a port can only be part of one untagged VLAN. I’m probably wrong, but this “worked”. So I made all my ports that I intend to use for data to this new VLAN. Switching -> VLAN -> Advanced -> Port PVID Configuration. I changed VLAN member for all except my management port I’m using (1/0/1). Do this by entering the VLAN ID into VLAN Member and accepting the changes. It should look like :

port VLAN PVID configuration

Now for some reason, ports are still not able to communicate. Time to add a routing interface.

Routing Interface VLAN

This is totally new to me so, bare along : Routing -> VLAN -> VLAN Routing Configuration; We need to add routing configuration, from the dropdown select our PVID (400) and create a IP address for the routing (end with 1) I decided on 10.0.0.1 for the IP and subnet mask 255.255.255.0 so eventually IP’s allowed will be 10.0.0.[0-255].

It’s important to remember this IP adres, as we will re-use it once we setup the DHCP.

Setup DHCP service

Seamless next step, would you not say ? We need to setup the DHCP, which can be found : System -> Services -> DHCP Server -> DHCP Server Configuration.

We need to set the Admin Mode to enabled, and exclude our routing interface, so that this is not given out by the DHCP. It should look like :

After this we are ready to create the DHCP pool, do this under System -> Services -> DHCP Server -> DHCP Pool Configuration.

Pool name : (random) Type of Binding : Dynamic. Network Addressshould be the one we picked for the routing interface and excluded in the previous step. But instead of 1 use 0. In our case Network Address is 10.0.0.0 and the netmask is 255.255.255.0 or length 24. I setup lease time, to 7 days, which is a weekly thing. Should end up looking like :

After this the only thing remaining is testing & saving the configuration. If you are happy with how it works save your configuration.  Maintenance -> Save Configuration

 

 

While updating our network on the Rocks Cluster, the nodes had to reinstall (this is default protocol). Now however the nodes got stuck during PXE (over the net automatic installation) on the setting of the language.

install language Centos

a screen like this (source).

This is annoying as it would mean connect a screen and a keyboard to every node to install. This however is an indication that something is wrong all together, however finding what proved a little bit tricky, that’s why I share it here.

To find if you face the same issue, execute :

sudo -u apache /opt/rocks/bin/rocks list host profile compute-0-0

This will show the configuration the node pulled from the head-node.  In my case it looked like :

Traceback (most recent call last):
  File "/opt/rocks/bin/rocks", line 301, in <module>
    command.runWrapper(name, args[i:])
  File 
"/opt/rocks/lib/python2.6/site-packages/rocks/commands/__init__.py", 
line 2194, in runWrapper
    self.run(self._params, self._args)
  File 
"/opt/rocks/lib/python2.6/site-packages/rocks/commands/list/host/profile/__init__.py", 
line 301, in run
    for host in self.getHostnames(args):
  File 
"/opt/rocks/lib/python2.6/site-packages/rocks/commands/__init__.py", 
line 773, in getHostnames
    min,max = self.db.fetchone()
TypeError: 'NoneType' object is not iterable

While this should be a large XML file structure like. After allot of extensive google skills (and this 2013 topic) I found out that a simple MySQL update had dropped the root password out of the global configuration, this can be found :

/opt/rocks/etc/my.cnf

a update, generally saves it here :

/opt/rocks/etc/my.cnf.rpmsave

it should look like this :

[mysqld]
user            = rocksdb
port            = 40000
socket          = /var/opt/rocks/mysql/mysql.sock
datadir         = /var/opt/rocks/mysql

[client]
user            = rocksdb
port            = 40000
socket          = /var/opt/rocks/mysql/mysql.sock
password        = <password>

You don’t have to restart the MySQL or the service, just let the node reboot and it will install properly ūüôā

Good luck

Rocks distro is a cluster system. It comes with SNMP configured out of the box. It is polled using Ganglia. Which is working nicely, but I like to have all SNMP data in my favorite SNMP system, LibreNMS. Changing the SNMP configuration to be able to poll from LibreNMS should be a rather straight forward process, however those nodes have no connection to the public network. They have a private VLAN to talk to the head-node and a private VLAN to communicate with the storage array. So to get the SNMP data to Librenms we will have to get crafty with Iptables to get this data to LibreNMS on the public net.

forward snmp from 161 to 3161

Configuration

First let’s check the /etc/snmp/snmpd.conf¬†file from the¬†Rocks¬†installation :

com2sec        notConfigUser   default         public
group  notConfigGroup  v1              notConfigUser
group  notConfigGroup  v2c             notConfigUser
view    all             included        .1             80

access  notConfigGroup  "" any noauth exact all all all

This config is a bit complex and I figure I won’t go back, so I commented it. I decided not to remove it completely, since I don’t want to break the possibility to go back to Ganglia should it be an important system in Rocks. (note : it’s not)¬†I added :

# this create a  SNMPv1/SNMPv2c community named "my_servers"
# and restricts access to LAN adresses 192.168.0.0/16 (last two 0's are ranges)
rocommunity my_servers <our_public_net>/16
rocommunity my_servers 10.1.0.0/16

# setup info
syscontact  "svennD"

# open up
agentAddress  udp:161

# run as
agentuser  root

# dont log connection from UDP:
dontLogTCPWrappersConnects yes

Important here is, that I added two IP ranges, I’m not sure if the private VLAN (10.1.0.0/16) is even required, but since traffic is going over those devices I just added it.

Next thing is setting up the Iptables on the head-node. Since Rocks is already pretty protective (good !) I had to add an extra rule to even allow SNMP polling from the device :

-A INPUT -p udp -m udp --dport 161 -j ACCEPT

Allow the head-node to be polled by LibreNMS by accepting incoming UDP packets over port 161.

To receive packets send from the node on port 161 to the head-node, but forward this to port 3161 externally to LibreNMS (circumventing most known ports and the REJECT rule in Rocks for port 1-1023.) can be done with prerouting rule :

-A PREROUTING -i eth1 -p udp -m udp --dport 3161 -j DNAT --to-destination 10.1.255.244:161
-A POSTROUTING -o eth1 -j MASQUERADE

Note : 10.1.255.244 is the private IP of the node.

So from now on packets should come in, this can be checked using tcpdump, which came in handy during the debugging of this project : (on the node)

tcpdump 'port 161'

To be able to let snmpd answer we needed the information to be forwarded on the head-node, this can be done with a forward rule :

-A FORWARD -d 10.1.255.244/32 -p udp -m udp --dport 161 -m state --state NEW,RELATED,ESTABLISHED -j

note : again 10.1.255.244 is the ip of the node.

Surprisingly the node was unable to answer to the incoming requests. This was due to the fact that, the default route (route -n) was pointing towards one of the storage servers. To add a default gateway we can add it using route :

route add default gw 10.1.1.1 eth0

note : 10.1.1.1 is the private VLAN ip of the head-node.

Conclusion

Bam! LibreNMS can talk to the node using the public IP of the head-node on port 3161 and to the head-node on port 161. One issue that remains unsolved is on reboot, this setup is lost. Rocks by default will reinstall nodes that reboot. This can be resolved by adapting the configurations on Rocks and rebuild the distribution (rocks create distro). However this is rather advanced and (IMHO) difficult to debug. So I did not use that system for this project. Another problem is that its rather work-intense to add all the configuration to all the nodes. (this is only for a single node) This can be resolved most easily using scripts and using rocks run host to execute bits on all the nodes. I decided that I only want one node to be polled as a sample. I already track opengridscheduler using an extend on the head-node. So this is mostly for debugging. Good luck !