Tuning of ZFS module

The more difficult part of ZOL is the fact that there are plenty of tune able kernel module parameters, and hence ZFS can be used in many kinds of systems for many different reasons. (laptops, file-servers, database-servers, file-clusters) However some of the parameters come bad out of the box for file serving systems. Here are some of the options that might help you tune ZFS on Linux.


All these parameters are set in :

/etc/modprobe.d/zfs.conf
zfs_arc_min/zfs_arc_max

The ARC (Adaptive Replacement Cache) is a cache that’s used to speed up read actions, (also async writes can be grouped here before flushed to disk (?)) by default zfs_arc_max is going to take (all physical memory – 1 GB) if the total physical memory is larger then 4GB. At least if I can believe this document however I think ZOL uses 75% by default. On a storage-only device, this can be increased, however on anything else, while the memory will be used for ARC, the ARC size will lower when applications require it. The only disadvantage is; some applications will require a certain amount of memory to start, or request memory on a rate that is quicker then ARC size can lower, resulting in a crash/swapping.  The zfs_arc_min is the smallest value the ARC will become when there is memory pressure. You could lower it on for example laptops/desktops that only use ZFS for a part of the system (backup storage).

The options are expressed in bytes…

# i did not adapt zfs_arc_min
# options zfs_arc_min=

# i increased zfs_arc_max as this is a 128GB memory machine only serving files
options zfs_arc_max=100000000000

Possible options to check are arcstat :

# /opt/zfs/arcstat.py -f "time,read,hit%,hits,miss%,miss,arcsz,c" 1
    time  read  hit%  hits  miss%  miss  arcsz     c
16:21:42   52K    98   52K      1   716    48G   48G
16:21:43   69K    98   68K      1   937    48G   48G
16:21:44   38K    98   38K      1   459    48G   48G
16:21:45   41K    98   40K      1   623    48G   48G
16:21:46   47K    98   46K      1   483    48G   48G

As you see, ZFS dynamically adapts ARC, seems my server is not working very hard at this moment, and hence the arcsize is only ~48GB large. Useful stats are the hit%. If you get a low value, this would mean ARC is not catching much requests, so increasing might help when you expect read requests should be able to be cached.

It is also possible to disable ARC per dataset, or reduce what it is used for, of course this will slow any file action hugely and you should only do it when you really don’t care about speed in that particular data-set.  There is a primarycache (RAM), and a secondarycache (cache devices, such as SSD’s), by default they are both used for both metadata and data. The options are :

  • all : both metadata and data are cached
  • metadata : only metadata is cached (file names, sizes, attributes, …)
  • none : no cache is used
# zfs get all jbod1 | grep cache
jbod1  primarycache          all                    default
jbod1  secondarycache        all                    default

I left as is (default is all), and would advice you (based on this mailinglist) to also leave it default. A much better option is to lower the zfs_arc_max since ZFS will still be able to cache some of the data/metadata.

zfs_vdev_[async|sync]_[read|write]_[min|max]_active

ZFS has five I/O queues for each of the different I/O types :

  • sync reads : normal read operation by applications
  • async reads : reads by ZFS prefetcher, this will try to “guess” what writes an application might need and request them and cache them
  • sync writes : a write that requires fsync(); to be written to “long term storage”, basically disk or when a separate ZIL is available written to ZIL (intent log).
  • async writes :  a ‘normal’ write. (called bulk writes of dirty data)
  • scrub :  those are scrub and resilver operation, scrub checks for wrong checksums, resilver tries to repair. (resilver : I would say, raid rebuild)

This is straight from the ZFS on Linux sourcecode :

The ratio of the queues’ max_actives determines the balance of performance between reads, writes, and scrubs. E.g., increasing zfs_vdev_scrub_max_active will cause the scrub or resilver to complete more quickly, but reads and writes to have higher latency and lower throughput

These value’s however are low for modern hardware and can be increased. Now I don’t completely understand what is at play here, so feel free to research on this, but these value’s seem to increase throughput at cost of latency. For example; Scrubbing is now a few factors quicker … from a few days to 18h / ~100TB. (that makes it I can cron it, more on scrub frequently)

Each of the zfs_vdev_*_*_max_active limits the number of IO’s issues to a single vdev.  Finding the right value’s can be done by monitoring IO throughput and latency under load, and increase the value until you find the point where throughput no longer increased and latency is still acceptable. These are the value’s I use :

# increase them so scrub/resilver is more quickly at the cost of other work
options zfs zfs_vdev_scrub_min_active=24
options zfs zfs_vdev_scrub_max_active=64

# sync write
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_sync_write_max_active=32

# sync reads (normal)
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_read_max_active=32

# async reads : prefetcher
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_read_max_active=32

# async write : bulk writes
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=32

At current writing these are the default value’s in the ZoL repo:

zfs_vdev_sync_read_min_active = 10;
zfs_vdev_sync_read_max_active = 10;
zfs_vdev_sync_write_min_active = 10;
zfs_vdev_sync_write_max_active = 10;
zfs_vdev_async_read_min_active = 1;
zfs_vdev_async_read_max_active = 3;
zfs_vdev_async_write_min_active = 1;
zfs_vdev_async_write_max_active = 10;
zfs_vdev_scrub_min_active = 1;
zfs_vdev_scrub_max_active = 2;
zfs_dirty_data_max_percent

As far as I understand, dirty data are non-sync writes, so basically the applications says to the file system, write this down, but doesn’t wait or expects a reply. Since I have a L2ARC and ARC, this should cache it up first and only then we should start writing it to “slow” disks. This value limits the amount of data that can be at any given time in the pool. (the cap is at 25% of physmem)

Default out of the box is 10%;

options zfs zfs_dirty_data_max_percent=40
zfs_top_maxinflight

Maximum number of scrub I/O per top-level vdev, by default 32. Increases zfs scrub speed. (at what cost, no idea)

options zfs zfs_top_maxinflight=320
zfs_txg_timeout

There is a time before async writes are written to disk, this makes it possible for ZFS to write a larger piece. The old default was 30 seconds, but due to fluctuating write performance on some machines was lowered to 5 seconds. Decent servers should be able to hold a bit of data, so I increased it to 15 seconds, more might be fine, depending on the workload.

options zfs zfs_txg_timeout=15
zfs_vdev_scheduler

I see this online frequently where this setting is set; I contacted an expert and  Noop and Deadline are both fine.  You can read more about noop scheduler and the deadline scheduler, but I don’t understand enough to give a fair value. I did get the feeling deadline is more focused on databases pattern but I could be wrong. I did not change it.

# default : noop
options zfs zfs_vdev_scheduler=deadline
zfs_prefetch_disable

ZFS can predict what data will be requested next, and can cache this until the request comes, this is easier and faster for spinning disks, by default its disabled (it was a surprise to me) setting this value to zero will activate the setting.

# use the prefetch method
options zfs zfs_prefetch_disable=0
l2arc_write_max

Only applies if you have cache device such as a ssd, when ZFS was created, ssd’s where new and could only be written to a few times, so zfs has some prehistoric limits to save the SSD of the hard labor. l2arc_write_max is such a value, by default only 8mb/s can be written to the ssd. Clearly you can increase this. (at the cost of more SSD use)

# max write speed to l2arc
# tradeoff between write/read and durability of ssd (?)
# default : 8 * 1024 * 1024
# setting here : 500 * 1024 * 1024
options zfs l2arc_write_max=524288000
l2arc_headroom

Number of max devices writes to precache, can be increased.

# number of max device writes to precache
# default : 2
options zfs l2arc_headroom=12
 zfs_immediate_write_sz || zil_slog_limit

zfs_log_write() handles TX_WRITE transactions. The specified callback is called as soon as the write is on stable storage (be it via a DMU sync or a * ZIL commit).

a useful article on this tune-able parameter.  On the other hand, zil_slog_limit is the max commit in byte to the separate log device, in short another attempt to not overuse the slog device. (ssd)

# default : 32768
options zfs zfs_immediate_write_sz=131072 

# default : 1024*1024 = 1mb
options zfs zil_slog_limit=536870912

 

These once adaptable value’s have been removed :

  • zfs_resilver_delay
  • zfs_scrub_delay

Important note : I received / found these value’s floating on the internet, if you did not yet realize I am not an expert at this, this is my “best” attempt to understand these value’s and settings, and this information might just be plain wrong (or dated when you read it), do your research before changing value’s in production. You can’t blame me if your servers turns out to be the next skynet.

New source :