Create ZFS Raidz2 pool

20 March, 2019

This is a quick howto, I made a raidz2 pool on ZFS, its very similar to how to create a mirror.

find out what disks you are giving to the pool :

[email protected]:~# fdisk -l /dev/sd* | grep Disk
Disk /dev/sda: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
Disk /dev/sdb: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
Disk /dev/sdc: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
Disk /dev/sdd: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
Disk /dev/sde: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
Disk /dev/sdf: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
Disk /dev/sdg: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
Disk /dev/sdh: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
Disk /dev/sdi: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
Disk /dev/sdj: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
Disk /dev/sdk: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors
Disk /dev/sdl: 9.1 TiB, 10000831348736 bytes, 19532873728 sectors

I was lucky that the main system is on a nvme so not /dev/sd* so basically all disks can be in the pool.

Create the pool, the name is panda, and the “raid” level raidz2 meaning 2 disks can fail before data loss occurs similar to RAID6.

zpool create panda raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl

After that the pool is made :

[email protected]:~# zfs list
panda   768K  80.4T   219K  /panda
[email protected]:~# zpool status
  pool: panda
 state: ONLINE
  scan: none requested

        NAME        STATE     READ WRITE CKSUM
        panda       ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sdh     ONLINE       0     0     0
            sdi     ONLINE       0     0     0
            sdj     ONLINE       0     0     0
            sdk     ONLINE       0     0     0
            sdl     ONLINE       0     0     0

errors: No known data errors

Note : you can rename the disks to be sure they are consistent across hardware changes, but it never gave any issues for me personally …

Don’t forget to change the defaults settings to something you prefer :

zfs set xattr=sa panda
zfs set acltype=posixacl panda
zfs set compression=lz4 panda
zfs set atime=off panda
zfs set relatime=off panda

See the basic tuning tips for more info.

CellProfiler is not an easy to install tool; or perhaps I was clumsy on the first attempt (building from source) but I could not get it to work properly on a Linux machine; After another attempt using miniconda, I managed to get it running. This is just a documentation of how I got it working. In case I have to do it again. By no means am I an expert on the case.


I don’t know if these are all the dependencies but at some point I had to install them.

yum install java-1.8.0-openjdk java-1.8.0-openjdk-devel bzip2 mariadb-devel libstdc++-devel gcc-c++ gtk2 ImageMagick ImageMagick-devel

Another issue is that the installation requires is which can not be found, however jbigkit-libs provides; which can be soft-linked and it will then work;

cd /usr/lib64/
ln -s

Install Miniconda

Get the latest version for Linux 64bit (or other) and make it executable; On Centos python 2.7 is default, and I believe it is required for CellProfiler.

chmod 755

Install CellProfiler

Finally we are ready to try the installation; Create a directory “cellprofiler” (not specific)

mkdir cellprofiler
cd cellprofiler
touch environment.yml

environment.yml should contain (using nano, vi, vim, emacs,…)

# run: conda env create -f environment.yml
# run: conda env update -f environment.yml
# run: conda env remove -n cellprofiler
name: cellprofiler
# in order of priority: highest (top) to lowest (bottom)
  - anaconda
  - goodman # mysql-python for mac
  - bioconda
  - cyclus # java-jdk for windows
  - conda-forge # libxml2 for windows
  - BjornFJohansson # wxpython for linux
  - appdirs
  - boto3
  - cython
  - h5py
  - ipywidgets
  - java-jdk
  - joblib
  - jupyter
  - libtiff
  - libxml2
  - libxslt
  - lxml
  - packaging
  - pillow
  - pip
  - python=2
  - pyzmq=15.3.0
  - mahotas
  - matplotlib!=2.1.0,>2.0.0
  - mysqlclient
  - numpy
  - raven
  - requests
  - scikit-image>=0.13
  - scikit-learn
  - scipy
  - sphinx
  - tifffile
  - wxpython=
  - pip:
    - cellh5
    - centrosome
    - inflect
    - prokaryote==2.4.0
    - javabridge==1.0.15
    - python-bioformats==1.4.0
    - git+[email protected]


After this, make the environment; using :  (will take a while)

conda env create -f environment.yml

While debugging you can also use : (to update)

conda env update -f environment.yml

This kinda works but generates two warnings that don’t seem to impact the tools (but perhaps I haven’t use specific functions that depend on these)

cellprofiler 3.1.8 has requirement prokaryote==2.4.1, but you'll have prokaryote 2.4.0 which is incompatible.
cellprofiler 3.1.8 has requirement python-bioformats==1.5.2, but you'll have python-bioformats 1.4.0 which is incompatible.

Once that is finished, we can activate & run.

conda activate cellprofiler

Since this is on a server, I also needed to allow X11 forwarding over ssh;

So I was happily using sanoid, when someone made me aware of pyznap (thanks !). Ever since that comment, it was on my to-do list to check out pyznap, but if a system works, why change it ? Definitely a important cog such as automated snapshots.

The new version of sanoid, breaks comparability with older versions (at least at config level) and is not documented well at the moment; One has to look into the pull requests to actually understand what is required to get it running. I know open-source project are sometimes in large changes, and its all run on love & joy, but its a sorry state for a project of ~39 contributors. I also find it useful if a project or tool is simple in its setup and understanding for people not looking at this daily.  I might sound critically, but I still think sanoid is a wonderful tool, but personally I just need to get it up & running in 10 minutes and then move on. The feature list of sanoid and its companion syncoid seem ever growing and with it the complexity to find out what is going wrong, that it was time to take pyznap for a run.

And pyznap is actually a shining gem, I got it up & running in 10 minutes. Nothing fancy, just works out of the box. So here’s how I set it up;

First I needed to install Python 3.5+ version on Centos 7, I won’t go into detail, cause its basically part of every fresh install these days.

yum install yum-utils groupinstall development
yum install
yum install python36u python36u-pip

After that, install pyznap using pip; we need to specify the version otherwise it will take 2.7 (still default on Centos 7)

pip3.6 install pyznap

This installs all dependency’s, optional you can install pv and mbuffer to visualize transfer speed on sending snapshots to backup locations.

yum install pv mbuffer

Now onto the setup/configuration; This tool wil generate a default config and directory if you set it up :

[[email protected] ~]# pyznap setup -p /etc/pyznap
Feb 21 16:06:30 INFO: Starting pyznap...
Feb 21 16:06:30 INFO: Initial setup...
Feb 21 16:06:30 INFO: Creating directory /etc/pyznap...
Feb 21 16:06:30 INFO: Creating sample config /etc/pyznap/pyznap.conf...
Feb 21 16:06:30 INFO: Finished successfully...

After the setup; its time for configuration, all items are clearly documented. One remark : don’t put # (hashtag) behind to comment; this will generate errors as only lines started with # (hashtag) are ignored;

# default settings parent
  # every $cron runtime (~15 minutes)
  frequent = 2
  hourly = 6
  daily = 3
  weekly = 1
  monthly = 0
  yearly = 0
  # take snapshots ?
  snap = no
  # clean snapshots ?
  clean = no
  snap = yes
  clean = yes
  dest = ssh:22:[email protected]:backup/brick1
  dest_key = /root/.ssh/id_rsa_backup
  daily = 24
  snap = yes
  clean = yes

To give some information I have one pool split up in multiple sub-file systems. (data is the parent and data/brick* are the actually location for data) This means that I can setup defaults for the data but don’t really want snapshots of that, as no data resides there; Only in the bricks, so I overwrite the defaults (snap/clean);

A special case is dest = that’s a build-in backup system; just make a password-less ssh login for server and you can leverage this feature; One thing to remark is that the backup server needs to be a ZFS filesystem and not the actually physical location. (a bit of trickery cause the location is /backup/brick1 but one needs to remove the first / for ZFS) if you use this feature, perhaps its worth downgrading a package cryptography cause the latest version generates a insane amount of warnings about upcoming deprecation of some function calls. See the issue.

pip3.6 install cryptography==2.4.2

A way to clean up the backup location is also provided by pyznap, one can setup a remote cleanup job for the brick by adding :

# cleanup on backup
[ssh:22:[email protected]:backup/brick1]
        frequent = 2
        hourly = 6
        daily = 3
        weekly = 1
        key = /root/.ssh/id_rsa
        clean = yes

The only thing to do now is to either make a cron, which is the easiest :

*/15 * * * *   root    /usr/bin/pyznap snap | logger
0 * * * *   root    /usr/bin/pyznap send | logger

Take snapshots every 15 minutes (frequent) and sync to the backup location (dest=) once per hour; Other snapshots are taken based on the need; (note | logger sends output to /var/log/messages, by being processed by rsyslog, you could also log to a static file, such as explained by the docs)

Alternative on systemd systems like Centos 7, you can leverage systemd timers. Instead of cron, but who wants to go into that mess ?

And that’s it, automated snapshots <3. Thanks yboetz for an small but useful tool 🙂

Understanding ZFS : Checksum

6 February, 2019

Ever wondered what kind of checksum ZFS uses to check for bit rot ? Probably not, but it turns out you can change the used algorithm. However like most settings, the defaults are chosen by smart people. So changing it might not be doing you any favors.

As it turns out, the default checksum used is Fletcher’s checksum;This algorithm is comparable to CRC error detection, but outperforms it by nearly 20 times per byte (source). So it looks as if this is really speedy.

This algorithm can actually use some of the more modern Intel CPU optimizations. (source) To check if your CPU has these optimizations, check your CPU flags, on most distro’s this can be done using :

cat /proc/cpuinfo | grep flags


lscpu | grep -i flags

Flags of interest are sse2, ssse3, avx2 and avx512f. I’m well outside the scope of knowledge here; but it seems those optimizations can vary between CPU’s and so its not always sure if the latest optimization is the best; so ZFS actually mini-benchmarks just after its kernel module is loaded to determine what algorithm is best. Even nicer, the results are stored in /proc and you can check what your CPU’s scores are; You can check it in : /proc/spl/kstat/zfs/fletcher_4_bench

In a recent server (Intel Xeon Silver 4110), avx512f/avx2 is picked :

cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 1054412431662536 2941001241716525
implementation   native         byteswap
scalar           3455422459     2778248102
superscalar      4626459244     3440503186
superscalar4     4008342143     3352064854
sse2             7888619803     4445292423
ssse3            7891663628     7030719031
avx2             12054042156    10904840790
avx512f          19645275791    6985259129
fastest          avx512f        avx2

While on an ancient test machine(Intel Pentium Dual CPU E2220) , superscalar is picked over “newer” flags such as sse2.

cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 3970297751 15299241313174195
implementation   native         byteswap
scalar           3690080571     2332985088
superscalar      4357932898     2616994380
superscalar4     3849236054     2487645215
sse2             2809499310     2262775111
ssse3            2809413022     2308094121
fastest          superscalar    superscalar

Pretty nice work, all done behind the scene in a transparant way.

Back to the stuff we can actually play with, changing the algorithm; It turns out, that one can change this to SHA256, which is required to run deduplication and that’s about it; The other alternatives have been deprecated or are not implemented in zfsonlinux, these are fletcher2, SHA512, skein and edon-r. You can also disable the checksum for a certain dataset or for an entire pool, but then the question is why even chose for ZFS.

Changing the checksum can be done like most values (as per docs) :

zfs set checksum=sha256 pool_name/dataset_name
zfs set checksum=fletcher4 pool_name/dataset_name

So while its really interesting to know what is going on behind the scenes, I doubt many people should play with this unless you know what you are doing. In which case this article is not aimed for you 😉

As always, relevant information, and basically the source of this post can be found on the Github wiki of zfsonlinux, here.

I installed PHP 7.1 through webtatic, except with Apache instead of Nginx, so not the -fpm.  Installing PHP stats package is easy, if you know how, so follow along.

First we need to provide our system with a compiler, and the pear package which contains pecl (yum whatprovides pecl)

yum install gcc php71w-pear

After that you can run pecl to install it :

pecl install stats

This will however fail, and tell you you need a PHP 5.* version … however if you specify the latest version it wil just install fine :

pecl install stats-2.0.3

And bam, installed. Now we need to activate it in the PHP config :  (new file, or else /etc/php.ini)

nano /etc/php.d/stats.ini

add :

After that, just restart apache and you are good to go :

systemctl restart httpd



I installed a new kernel (kernel-ml, from mainline 4.18) to make sure on next boot it got selected this is what I had to do : (Since its headless and I won’t see grub when it boots)

Install kernel mainline :

yum install kernel-ml

Find the current installed kernels :

cat /boot/grub2/grub.cfg | grep ^menuentry | cut -c1-50 | nl -v 0

for me :

0  menuentry 'CentOS Linux (4.18.16-1.el7.elrepo.x86_
1  menuentry 'CentOS Linux (3.10.0-693.5.2.el7.x86_64
2  menuentry 'CentOS Linux (0-rescue-daa04fd470a24a8f

So I want the first kernel (starting from 0) as default; This can be added in /etc/default/grub :


Then remake grub config :

grub2-mkconfig -o /boot/grub2/grub.cfg

Reboot and the new kernel is :

[[email protected] ~]# uname -a
Linux svennd.local 4.18.16-1.el7.elrepo.x86_64 #1 SMP Sat Oct 20 12:52:50 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux



Auch, I got the same error as pretty much two years ago. I have no idea why it failed, however the solution seemed to work similar to what he suggested.

[Mon Dec 03 15:07:55.871866 2018] [auth_digest:notice] [pid 3352] AH01757: generating secret for digest authentication ...
[Mon Dec 03 15:07:55.871892 2018] [auth_digest:error] [pid 3352] (2)No such file or directory: AH01762: Failed to create shared memory segment on file /run/httpd/authdigest_shm.3352
[Mon Dec 03 15:07:55.871897 2018] [auth_digest:error] [pid 3352] (2)No such file or directory: AH01760: failed to initialize shm - all nonce-count checking, one-time nonces, and MD5-sess algorithm disabled
[Mon Dec 03 15:07:55.871901 2018] [:emerg] [pid 3352] AH00020: Configuration Failed, exiting

The directory /run/httpd is gone. So recreate it and give correct permissions; This in my case (on Centos 7X was apache instead of httpd)

mkdir /run/httpd
chown root.apache /run/httpd
chmod 0710 /run/httpd

Another solution, sadly the mystery is not solved.

Ordinarily I use “static” references in /etc/fstab to mount NFS shares to a server. Doing this on a Rocks Cluster however is a bit “hacky” and adapting every node can be automated by using rocks run host however there is a alternative way. (not sure if its better 😉 ) but Rocks uses Autofs to load NFS mounts when they are required. So I felt brave and wanted to learn something new. This is the documentation on that journey.  

Read More

On Centos 7.5 base install, there are no utility’s installed by default. So when I tried to mount an NFS share I got this error :

mount: wrong fs type, bad option, bad superblock on,
       missing codepage or helper program, or other error
       (for several filesystems (e.g. nfs, cifs) you might
       need a /sbin/mount.<type> helper program)

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

This clearly tells us it cannot communicate with the NFS share, fixing can be done installing : nfs-utils for Centos.

yum install nfs-utils

for Debian variants (Mint, Ubuntu, …) this would be :

apt install nfs-common

After that, I see happy shares 🙂

mount.nfs: timeout set for Wed Nov 14 12:57:50 2018
mount.nfs: trying text-based options 'soft,intr,retrans=2,rsize=32768,wsize=32768,nfsvers=3,tcp,'
mount.nfs: prog 100003, trying vers=3, prot=6
mount.nfs: trying prog 100003 vers 3 prot TCP port 2049
mount.nfs: prog 100005, trying vers=3, prot=6
mount.nfs: trying prog 100005 vers 3 prot TCP port 20048
/data    : successfully mounted


Rename a volume group on LVM

8 November, 2018

In a recovery of a system I found myself in annoying pickle of having two disks with a similar LVM layout and more precise with a equally named volume group. This can cause issues, however in my case it just stopped me from mounting the original device. To make matters worse the sizes of the volumes where nearly identically in size. Not really a user of LVM, it was time for google to save me.

Activating both identical volume failed :

[email protected]:~# vgchange -ay
  device-mapper: create ioctl on pve-swapLVM-Do7QVk4UeQXTTqVtB5b0pISJQy1YL1YJ4RB7lZXsAvYVGcYs9e7TZcFuHfkUBZJz failed: Device or resource busy
  device-mapper: create ioctl on pve-rootLVM-Do7QVk4UeQXTTqVtB5b0pISJQy1YL1YJgnTssDJJQj59LISgspFSLEH2pzqQLYw9 failed: Device or resource busy
  1 logical volume(s) in volume group "pve" now active
  device-mapper: create ioctl on pve-dataLVM-46zXzJmc3xbdOtJdsVXrSlb5siaAacfRmhJ3Ac2ET7201zotB4DEX2kVu9gSafDC-tpool failed: Device or resource busy
  2 logical volume(s) in volume group "pve" now active

So first thing I found was lvmdiskscan this will search for disk that are formatted with LVM. Not surprisingly I found two disks (new and old one) and a third data disk. (RAID)

[email protected]:~# lvmdiskscan
  /dev/sda2 [     256.00 MiB]
  /dev/sda3 [     465.51 GiB] LVM physical volume
  /dev/sdb2 [     510.00 MiB]
  /dev/sdb3 [     465.15 GiB] LVM physical volume
  /dev/sdc  [      29.10 TiB]
  1 disk
  2 partitions
  0 LVM physical volume whole disks
  2 LVM physical volumes

note : partition /dev/sda3 and /dev/sdb3 are dangerously similar !

Now we know that the partitions are very similar, however perhaps I might be lucky that the logical volumes where different lvscan :

[email protected]:~# lvscan
  inactive          '/dev/pve/swap' [58.12 GiB] inherit
  inactive          '/dev/pve/root' [96.00 GiB] inherit
  ACTIVE            '/dev/pve/data' [295.03 GiB] inherit
  ACTIVE            '/dev/pve/swap' [8.00 GiB] inherit
  ACTIVE            '/dev/pve/root' [96.00 GiB] inherit
  inactive          '/dev/pve/data' [338.60 GiB] inherit

note : there is a slight different in sizing of the volumes, this is good information. However the names are equal. This is causing the issue.

If we can change the volume group name (pve) this should be enough to mount it. However I don’t want to take the active ones, as that might crash my current system (?) So let’s check using vgdisplay :

[email protected]:~# vgdisplay
  --- Volume group ---
  VG Name               pve
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  64
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                3
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               465.15 GiB
  PE Size               4.00 MiB
  Total PE              119078
  Alloc PE / Size       114983 / 449.15 GiB
  Free  PE / Size       4095 / 16.00 GiB
  VG UUID               Do7QVk-4UeQ-XTTq-VtB5-b0pI-SJQy-1YL1YJ

  --- Volume group ---
  VG Name               pve
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  7
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                3
  Open LV               2
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               465.51 GiB
  PE Size               4.00 MiB
  Total PE              119170
  Alloc PE / Size       115075 / 449.51 GiB
  Free  PE / Size       4095 / 16.00 GiB
  VG UUID               46zXzJ-mc3x-bdOt-JdsV-XrSl-b5si-aAacfR

Since there is a slight difference in size (449.51 vs 449.15) and I know the current disk in use is my new system (/dev/sda)  (check using df -h).

I know I need /dev/sdb3 with 465.15 GB in size, this is the first in the list with UUID (Do7QVk-4UeQ-XTTq-VtB5-b0pI-SJQy-1YL1YJ)

Finally we know what we want to rename, it seems almost too simple using vgrename :

[email protected]:~# vgrename Do7QVk-4UeQ-XTTq-VtB5-b0pI-SJQy-1YL1YJ pveb
  Processing VG pve because of matching UUID Do7QVk-4UeQ-XTTq-VtB5-b0pI-SJQy-1YL1YJ
  Volume group "Do7QVk-4UeQ-XTTq-VtB5-b0pI-SJQy-1YL1YJ" successfully renamed to "pveb"

After this we can activate the volume group using vgchange :

vgchange -ay

And voila one can mount the new device :

mkdir -p /data/oldroot
mkdir -p /data/data

mount /dev/pveb/root /data/oldroot
mount /dev/pveb/data /data/data

Tadaaa ! The data is available once more 🙂