ZFS remains one of the most technically advanced and feature-complete filesystems since it appeared in October of 2005. Code for Sun's original Zettabyte File System was released under the CDDL open source license, and it has since become a standard component of FreeBSD and slowly migrated to various BSD brethren, while maintaining a strong hold over the descendents of OpenSolaris, including OpenIndiana and SmartOS.
Oracle is the owner and custodian of ZFS, and is in a peculiar position with respect to Linux filesystems. BtrFS (the main challenger to ZFS) began development at Oracle, where it is a core component of Oracle Linux despite stability issues. RedHat's recent decision to deprecate BtrFS likely introduces compatibility and support challenges for Oracle's Linux road map. Oracle obviously has deep familiarity with the Linux filesystem landscape (having recently released "dedup" patches for XFS). ZFS is the only filesystem option that is stable, protects our data, is proven to survive in most hostile environments, and has a lengthy usage history with well-understood strengths and weaknesses.
ZFS has been (mostly) kept out of Linux due to CDDL incompatibility with Linux's GPL license. It is the clear hope of the Linux community that Oracle will relicense ZFS in a form that can be included in Linux, and we should all gently cajole Oracle to do so. It is obvious that a relicense of ZFS will have a clear impact on BtrFS and the rest of Linux, and we should all work to understand Oracle's position as the holder of these tools. However, Oracle continues to gift large software projects for independent leadership (incomplete examples of their largesse include OpenOffice, and recently Java Enterprise Edition), so it is not inconceivable that Oracle's generosity may at some point extend to ZFS.
To further this conversation, we shall investigate the various versions of ZFS for Linux. Starting within an RPM-centric environment, we will first install the minimally-invasive FUSE implementation, then proceed with a native install of ZFS modules from source. Finally, leaving RPM behind, we proceed to the Antergos distribution which implements native ZFS as a supported installation option.
ZFS is similar to other storage management approaches, but in some ways radically different. ZFS does not normally use the Linux Logical Volume Manager or disk partitions, and it is usually convenient to delete partitions and LVM structures prior to preparing media for a zpool.
The zpool is the analog of the LVM. A zpool spans one or more storage devices, and members of a zpool may be of several various types. The basic storage elements are single devices, mirrors, and raidz. All of these storage elements are called "vdevs."
Mirrored vdevs in a zpool present storage which is the size of the smallest
A mirrored vdev can be upgraded (i.e. increased in size) by
drives to the mirrorset and "resilvering" (synchronizing the mirrors), then
detaching the smaller drives from the set. Resilvering a mirror will only
involve copying used blocks to the target device - unused blocks are not
touched, which can make resilvering much faster than hardware-maintained
disk mirroring (which copies unused storage).
ZFS can also maintain RAID devices, and unlike most storage controllers it can do so without battery-backed cache (as long as the physical drives honor "write barriers"). ZFS can create a raidz vdev with multiple levels of redundancy, allowing the failure of up to three physical drives while maintaining array availability. Resilvering a raidz also involves only used blocks, and can be much faster than a storage controller that copies all disk blocks during a RAID rebuild. A raidz vdev should normally compose 8-12 drives (larger raidz vdevs are not recommended). Note that the number of drives in a raidz cannot be expanded.
ZFS greatly prefers to manage raw disks. RAID controllers should be configured to present the raw devices, never a hardware RAID array. ZFS is able to enforce storage integrity far better than any RAID controller, as it has intimate knowledge of the structure of the filesystem. All controllers should be configured to present "Just a Bunch Of Disks" (JBOD) for best results in ZFS.
Data safety is an important design feature of ZFS.
All blocks written in a zpool are aggressively checksummed to ensure the
consistency and correctness of the data. The checksum algorithm can be selected
from sha256, fletcher2, or fletcher4. The checksum on user data can also be
disabled, which is specifically never recommended (this setting might
be useful on a scratch/tmp filesystem where speed is critical, while consistency
and recovery are irrelevant, however
sync=disabled is the recommended setting for temporary filesystems
The checksum algorithm can be changed at any time,
and new blocks will use the updated algorithm. A checksum is stored separate
from the data block, with the parent block, in the hope that localized block
damage can be detected. If a block is found to disagree with
the parent's checksum, an alternate copy of the block is retrieved from either a
mirror or raidz device, rewritten over the bad block, then the I/O completed
without incident. ZFS filesystems can use these techniques to "self-heal"
and protect themselves from "bitrot" data changes on hard drive platters
that are caused by controller errors,
power loss/fluctuations in the read/write heads,
and even the bombardment of cosmic rays.
ZFS can implement "deduplication" by maintaining a searchable index of block checksums and their locations. If a new block to be written matches an existing block within the index, then the existing block is used instead, and space is saved. In this way, multiple files may share content by maintaining single copies of common blocks, from which they will diverge if any of their content changes. The documentation states that a "dedup-capable checksum" must be set before dedup can be enabled, and sha256 is offered as an example - the checksum must be "collision-resistant" to uniquely identify a block to assure the safety of dedup. Be warned that memory requirements for ZFS expand drastically when deduplication is enabled, which can quickly overwhelm a system lacking sufficient resources.
The zpool can hold datasets, snapshots, clones, and volumes. A "dataset" is a standard ZFS filesystem that has a mountpoint and can be modified. A "snapshot" is a point-in-time copy of a filesystem, and as the parent dataset is changed, the snapshot will collect the original blocks to maintain a consistent past image. A "clone" can be built upon a snapshot, and allows a different set of changes to be applied to the past image, effectively allowing a filesystem to branch - the clone and original dataset will continue to share unchanged blocks, but will otherwise diverge. A "volume" is similar to a block device, and can be loopback-mounted with a filesystem of any type, or perhaps presented as an iscsi target. Checksums are enforced on volumes. Note that, unlike partitions or logical volumes, elements in a zpool can be intermingled. ZFS knows that the outside edge of a disk is faster than the interior, and it may decide to mix blocks from multiple objects in a zpool at these locations to increase performance. Due to this commingling of filesystems, forensic analysis of zpools is difficult and expensive:
But, no matter how much searching you do, there is [sic] no ZFS recovery tools out there. You are welcome to call companies like Ontrack for data recovery. I know one person that did, and they spent $3k just to find out if their data was recoverable. Then they spent another $15k to get just 200GB of data back.
There are no fsck or defrag tools for ZFS datasets. The boot process will never be delayed because a dataset was not cleanly unmounted. There is a "scrub" tool which will walk a dataset and verify the checksum of every used block on all vdevs, but the scrub takes place on mounted and active datasets. ZFS can recover very well from power losses or otherwise dirty dismounts.
Fragmentation in ZFS is a larger question, and appears related more to
remaining storage capacity than rapid file growth and reduction.
Performance of a heavily-used dataset will begin to degrade
when it is 50% full, and it will dramatically drop over 80% usage when ZFS
begins to use "best-fit" rather than "first-fit" to store new blocks.
Regaining performance after dropping below 50% usage can involve dropping and
resilvering physical disks in the containing vdev until all of the dataset's
blocks have migrated. Otherwise, the dataset should be completely unloaded and
erased, then reloaded with content that does not exceed 50% usage (the zfs
receive utilities are useful for this purpose).
It is important to provide ample free disk space to
datasets that will see heavy use.
It is strongly encouraged to use ECC memory with ZFS. Error-correcting memory is advised as critical for the correct processing of checksums which maintain zpool consistency. Memory can be altered by system errors and cosmic rays - ECC memory can correct single-bit errors, and panic/halt the system when multi-bit errors are detected. ECC memory is normally found in servers, but becomes somewhat rare with desktops and laptops. Some warn of the "scrub of death" and describe actual lost data from non-ECC RAM. However, one of the creators of ZFS says that all filesystems are vulnerable when non-ECC memory is in use, and ZFS is actually more graceful in failure than most, and further describes undocumented settings which force ZFS to repeatedly recompute checksums in memory, which minimizes dangers from non-ECC RAM. There is a lengthy configuration guide that addresses ZFS safety in a non-ECC environment with these undocumented settings, but the guide does not appear to cover the FUSE implementation.
The Linux implementation of FUSE received a ZFS port in 2006. FUSE is an interface that allows a filesystem to be implemented by a process that runs in user space. Fedora has maintained zfs-fuse as an RPM package for some time, but this package does not appear in any of the RedHat-based distributions, including Oracle Linux. RedHat appears to have intentionally omitted any relevant RPM for ZFS support.
The FUSE implementation is likely the only way to (currently) use ZFS on Linux in a manner that is fully compliant with both the CDDL and the GPL.
The FUSE port is relatively slow compared to a kernel ZFS implementation. FUSE is not generally installed in a manner that is compatible with NFS, so a zfs-fuse filesystem cannot be exported over the network without preparing a FUSE version with NFS support (NFSv4 might be available if an fsid= is supplied). The zfs-fuse implementation is likely reasonable for local, archival, and potentially compressed datasets. Some have used BtrFS for ad-hoc compressed filesystems, and zfs-fuse is certainly an option for similar activity.
The last version of zfs-fuse that will work in Oracle Linux 7.4 is the RPM in Fedora 25 (a new ZFS release is in Fedora 26, it fails to install on Oracle Linux 7.4 due to an OpenSSL dependency - RedHat's OpenSSL is now too old). Below we install the ZFS RPM:
# rpm -Uvh zfs-fuse-0.7.0-23.fc24.x86_64.rpm Preparing... ################################# [100%] Updating / installing... 1:zfs-fuse-0.7.0-23.fc24 ################################# [100%] # cat /etc/redhat-release /etc/oracle-release Red Hat Enterprise Linux Server release 7.4 (Maipo) Oracle Linux Server release 7.4
The zfs-fuse userspace agent must be executed before any zpools can be manipulated (note a systemd unit is included for this purpose):
# zfs-fuse #
For an easy example, we will retask a small hard drive containing a Windows 7 installation:
# fdisk -l /dev/sdb Disk /dev/sdb: 160.0 GB, 160000000000 bytes, 312500000 sectors Disk label type: dos Disk identifier: 0x8d206763 Device Boot Start End Blocks Id System /dev/sdb1 * 2048 206847 102400 7 HPFS/NTFS/exFAT /dev/sdb2 206848 312496127 156144640 7 HPFS/NTFS/exFAT
It is usually most convenient to dedicate an entire disk to a zpool, so we delete all the existing partitions:
# fdisk /dev/sdb Welcome to fdisk (util-linux 2.23.2). Changes will remain in memory only, until you decide to write them. Be careful before using the write command. Command (m for help): d Partition number (1,2, default 2): 2 Partition 2 is deleted Command (m for help): d Selected partition 1 Partition 1 is deleted Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks.
Now a zpool can be added on the drive (note that creating a pool adds a dataset of the same name, which we see here is automatically mounted):
# zpool create vault /dev/sdb # df | awk 'NR==1||/vault/' Filesystem 1K-blocks Used Available Use% Mounted on vault 153796557 21 153796536 1% /vault # mount | grep vault vault on /vault type fuse.zfs
Creating a zpool on non-redundant devices is informally
known as "hating your data" and should only be contemplated for demonstration
purposes. However, zpools on non-redundant media (i.e. flash drives) have
data-consistency and compression advantages to VFAT, and the
copies parameter can be adjusted for such a dataset to force
all blocks to be recorded on the media multiple times (up to 3) to increase
Mirrored drives might be created with
zpool create vault mirror /dev/sdb /dev/sdc".
Additional drives can be added as mirrors to an existing drive with
A simple RAIDset might
be created with "
zpool create vault raidz /dev/sdb /dev/sdc /dev/sdd".
The standard umount command should (normally) not be used to unmount ZFS datasets - use the zpool/zfs tools instead (note the "unmount" rather than "umount" spelling):
# zfs unmount vault # df | awk 'NR==1||/vault/' Filesystem 1K-blocks Used Available Use% Mounted on # zfs mount vault # df | awk 'NR==1||/vault/' Filesystem 1K-blocks Used Available Use% Mounted on vault 153796557 21 153796536 1% /vault
A ZFS dataset can be mounted in a new location by altering the "mountpoint":
# zfs unmount vault # mkdir /root/vault # zfs set mountpoint=/root/vault vault # zfs mount vault # df | awk 'NR==1||/vault/' Filesystem 1K-blocks Used Available Use% Mounted on vault 153796547 21 153796526 1% /root/vault # zfs unmount vault # zfs set mountpoint=/vault vault # zfs mount vault # df | awk 'NR==1||/vault/' Filesystem 1K-blocks Used Available Use% Mounted on vault 153796547 21 153796526 1% /vault
The mountpoint is retained, and is persistent across reboots.
Creating an additional dataset (and mounting it) is as easy as creating a directory (note this command can take some time):
# zfs create vault/tmpdir # df | awk 'NR==1||/(vault|tmpdir)/' Filesystem 1K-blocks Used Available Use% Mounted on vault 153796496 800 153795696 1% /vault vault/tmpdir 153795717 21 153795696 1% /vault/tmpdir # cp /etc/yum.conf /vault/tmpdir/ # ls -l /vault/tmpdir/ -rw-r--r--. 1 root root 813 Sep 23 16:47 yum.conf
ZFS supports several types of compression in a dataset. Gzip of varying degrees, zle, and lzjb can all be present in a single mountpoint. The checksum algorithm can also be adjusted on the fly.
# zfs get compress vault/tmpdir NAME PROPERTY VALUE SOURCE vault/tmpdir compression off local # zfs get checksum vault/tmpdir NAME PROPERTY VALUE SOURCE vault/tmpdir checksum on default # zfs set compression=gzip vault/tmpdir # zfs set checksum=fletcher2 vault/tmpdir # cp /etc/redhat-release /vault/tmpdir # zfs set compression=zle vault/tmpdir # zfs set checksum=fletcher4 vault/tmpdir # cp /etc/oracle-release /vault/tmpdir # zfs set compression=lzjb vault/tmpdir # zfs set checksum=sha256 vault/tmpdir # cp /etc/os-release /vault/tmpdir
Note that the GZIP compression factor can be adjusted (the default is 6, just as in the GNU GZIP utility). This will directly impact the speed and responsiveness of a dataset.
# zfs set compression=gzip-1 vault/tmpdir # cp /etc/profile /vault/tmpdir # zfs set compression=gzip-9 vault/tmpdir # cp /etc/grub2.cfg /vault/tmpdir # ls -l /vault/tmpdir -rw-r--r--. 1 root root 6308 Sep 23 17:06 grub2.cfg -rw-r--r--. 1 root root 32 Sep 23 17:00 oracle-release -rw-r--r--. 1 root root 398 Sep 23 17:00 os-release -rw-r--r--. 1 root root 1795 Sep 23 17:05 profile -rw-r--r--. 1 root root 52 Sep 23 16:59 redhat-release -rw-r--r--. 1 root root 813 Sep 23 16:58 yum.conf
Should the dataset no longer be needed, it can be dropped:
# zfs destroy vault/tmpdir # df | awk 'NR==1||/(vault|tmpdir)/' Filesystem 1K-blocks Used Available Use% Mounted on vault 153796523 800 153795723 1% /vault
We can demonstrate a recovery in ZFS by copying a few files and creating a snapshot:
# cp /etc/passwd /etc/group /etc/shadow /vault # ls -l /vault -rw-r--r--. 1 root root 965 Sep 23 14:41 group -rw-r--r--. 1 root root 2269 Sep 23 14:41 passwd ----------. 1 root root 1255 Sep 23 14:41 shadow # zfs snapshot vault@goodver # zfs list -t snapshot NAME USED AVAIL REFER MOUNTPOINT vault@goodver 0 - 27K -
Then we can simulate more file manipulations that involve the loss of a critical file:
# rm /vault/shadow rm: remove regular file '/vault/shadow'? y # cp /etc/resolv.conf /etc/nsswitch.conf /etc/services /vault/ # ls -l /vault -rw-r--r--. 1 root root 965 Sep 23 14:41 group -rw-r--r--. 1 root root 1760 Sep 23 16:14 nsswitch.conf -rw-r--r--. 1 root root 2269 Sep 23 14:41 passwd -rw-r--r--. 1 root root 98 Sep 23 16:14 resolv.conf -rw-r--r--. 1 root root 670311 Sep 23 16:14 services
Normally, snapshots are visible in the
.zfs directory of the dataset.
However, this functionality does not exist within the zfs-fuse implementation,
so we are forced to create a clone to retrieve our lost file:
# zfs clone vault@goodver vault/history # ls -l /vault/history -rw-r--r--. 1 root root 965 Sep 23 14:41 group -rw-r--r--. 1 root root 2269 Sep 23 14:41 passwd ----------. 1 root root 1255 Sep 23 14:41 shadow
It should be noted that the clone is not read-only, and we can modify it. The two mountpoints will maintain a common set of blocks, but are otherwise independent.
# cp /etc/fstab /vault/history # ls -l /vault/history -rw-r--r--. 1 root root 541 Sep 23 16:23 fstab -rw-r--r--. 1 root root 965 Sep 23 14:41 group -rw-r--r--. 1 root root 2269 Sep 23 14:41 passwd ----------. 1 root root 1255 Sep 23 14:41 shadow
Assuming that we have completed our recovery activity, we can destroy the clone and snapshot. A scrub of the parent dataset to verify its integrity at that point might be wise, then we can list our zpool history to see evidence of our session.
# zfs destroy vault/history # zfs destroy vault@goodver # zpool scrub vault # zpool status vault pool: vault state: ONLINE scrub: scrub in progress for 0h1m, 30.93% done, 0h3m to go config: NAME STATE READ WRITE CKSUM vault ONLINE 0 0 0 sdb ONLINE 0 0 0 errors: No known data errors # zpool history vault
For our final words on zfs-fuse, we list the software version history for zpool and zfs. Please note - it is critical that you create your zpools with the lowest ZFS version that you wish to use, which in this case is zpool version 23, and zfs version 4:
# zpool upgrade -v This system is currently running ZFS pool version 23. The following versions are supported: VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history 5 Compression using the gzip algorithm 6 bootfs pool property 7 Separate intent log devices 8 Delegated administration 9 refquota and refreservation properties 10 Cache devices 11 Improved scrub performance 12 Snapshot properties 13 snapused property 14 passthrough-x aclinherit 15 user/group space accounting 16 stmf property support 17 Triple-parity RAID-Z 18 Snapshot user holds 19 Log device removal 20 Compression using zle (zero-length encoding) 21 Deduplication 22 Received properties 23 Slim ZIL # zfs upgrade -v The following filesystem versions are supported: VER DESCRIPTION --- -------------------------------------------------------- 1 Initial ZFS filesystem version 2 Enhanced directory entries 3 Case insensitive and File system unique identifier (FUID) 4 userquota, groupquota properties
zfs.ko kernel module can be obtained from the ZFS on Linux site and loaded into Linux which will provide
high-performance ZFS with full functionality. In order to install this
package, the FUSE version of ZFS must be removed (assuming it was installed
as in the previous section):
# rpm -e zfs-fuse Removing files since we removed the last package
After the FUSE removal, a new yum repository must be installed on the target system. ZFS on a RedHat-derivative will likely require network access to the ZFS repository (stand-alone installations will be more difficult, and are not covered here).
# yum install \ http://download.zfsonlinux.org/epel/zfs-release.el7_4.noarch.rpm ... ========================================================================= Package Repository Size ========================================================================= Installing: zfs-release /zfs-release.el7_4.noarch 2.9 k ========================================================================= Install 1 Package Total size: 2.9 k Installed size: 2.9 k Is this ok [y/d/N]: y ... Installed: zfs-release.noarch 0:1-5.el7_4 Complete!
After the repository is configured, the GPG key must be loaded:
# gpg --quiet --with-fingerprint /etc/pki/rpm-gpg/RPM-GPG-KEY-zfsonlinux pub 2048R/F14AB620 2013-03-21 ZFS on Linux
Key fingerprint = C93A FFFD 9F3F 7B03 C310 CEB6 A9D5 A1C0 F14A B620 sub 2048R/99685629 2013-03-21
At this point, we are ready to proceed with a native ZFS installation.
The test system used here, Oracle Linux 7.4, normally can boot from one of two kernels. There is a "RedHat-Compatible Kernel," and also an "Unbreakable Enterprise Kernel" (UEK). While the FUSE version is completely functional under both kernels, the native ZFS installer does not work with the UEK (meaning further that Oracle Ksplice is precluded with the standard ZFS installation). If you are running Oracle Linux, you must be booted on the RHCK when manipulating a native ZFS configuration, and this includes the initial install. Do not attempt installation or any other native ZFS activity while running the UEK.
# rpm -qa | grep ^kernel | sort kernel-3.10.0-693.2.2 kernel-devel-3.10.0-693.2.2 kernel-headers-3.10.0-693.2.2 kernel-tools-3.10.0-693.2.2 kernel-tools-libs-3.10.0-693.2.2 kernel-uek-4.1.12-188.8.131.52 kernel-uek-firmware-4.1.12-184.108.40.206
The ZFS installation will actually use yum to compile C source code in
the default configuration (DKMS), then prepare an initrd with dracut
(use top to monitor this during the install). This installation
will take some time,
and there are notes on using a pre-compiled
zfs.ko collection in an
alternate installation configuration (kABI).
The test platform used here is Oracle Linux, and the
RedHat-Compatible Kernel may not be fully interoperable with the precompiled
zfs.ko collection (not tested while preparing this document),
so the default DKMS build was retained. An example
installation session is below:
# yum install kernel-devel zfs ... ========================================================================= Package Repository Size ========================================================================= Installing: zfs zfs 405 k Installing for dependencies: dkms epel 78 k libnvpair1 zfs 29 k libuutil1 zfs 35 k libzfs2 zfs 129 k libzpool2 zfs 587 k spl zfs 29 k spl-dkms zfs 454 k zfs-dkms zfs 4.9 M ========================================================================= Install 1 Package (+8 Dependent packages) Total download size: 6.6 M Installed size: 29 M Is this ok [y/d/N]: y ... - Installing to /lib/modules/3.10.0-693.2.2.el7.x86_64/extra/ spl: splat.ko: zavl: znvpair.ko: zunicode.ko: zcommon.ko: zfs.ko: zpios.ko: icp.ko: Installed: zfs.x86_64 0:0.7.1-1.el7_4 Complete!
After the yum session concludes, the native
zfs.ko can be loaded
"RHCK" Linux kernel, which will pull in a number of dependent modules:
# modprobe zfs # lsmod | awk 'NR==1||/zfs/' Module Size Used by zfs 3517672 0 zunicode 331170 1 zfs zavl 15236 1 zfs icp 266091 1 zfs zcommon 73440 1 zfs znvpair 93227 2 zfs,zcommon spl 102592 4 icp,zfs,zcommon,znvpair
At this point, the pool created by FUSE can be imported back into the system (note the error):
# /sbin/zpool import vault cannot import 'vault': pool was previously in use from another system. Last accessed at Sun Sep 24 2017 The pool can be imported, use 'zpool import -f' to import the pool. # /sbin/zpool import vault -f
The import will automatically mount the dataset:
# ls -l /vault -rw-r--r--. 1 root root 965 Sep 23 14:41 group -rw-r--r--. 1 root root 1760 Sep 23 16:14 nsswitch.conf -rw-r--r--. 1 root root 2269 Sep 23 14:41 passwd -rw-r--r--. 1 root root 98 Sep 23 16:14 resolv.conf -rw-r--r--. 1 root root 670311 Sep 23 16:14 services
We can create a snapshot, then delete another critical file.
# /sbin/zfs snapshot vault@goodver # rm /vault/group rm: remove regular file '/vault/group'? y
At this point, we can search the
/vault/.zfs directory for the
missing file (note that
.zfs does not appear with
ls -a but it is
# ls -la /vault drwxr-xr-x. 2 root root 6 Sep 25 17:47 . dr-xr-xr-x. 19 root root 4096 Sep 25 17:17 .. -rw-r--r--. 1 root root 1760 Sep 23 16:14 nsswitch.conf -rw-r--r--. 1 root root 2269 Sep 23 14:41 passwd -rw-r--r--. 1 root root 98 Sep 23 16:14 resolv.conf -rw-r--r--. 1 root root 670311 Sep 23 16:14 services # ls -l /vault/.zfs dr-xr-xr-x. 2 root root 2 Sep 23 13:54 shares drwxrwxrwx. 2 root root 2 Sep 25 17:47 snapshot # ls -l /vault/.zfs/snapshot/ drwxr-xr-x. 2 root root 7 Sep 24 18:58 goodver # ls -l /vault/.zfs/snapshot/goodver -rw-r--r--. 1 root root 965 Sep 23 14:41 group -rw-r--r--. 1 root root 1760 Sep 23 16:14 nsswitch.conf -rw-r--r--. 1 root root 2269 Sep 23 14:41 passwd -rw-r--r--. 1 root root 98 Sep 23 16:14 resolv.conf -rw-r--r--. 1 root root 670311 Sep 23 16:14 services
Native ZFS implements newer software versions of zpool and zfs - remember, it is critical that you create your zpools with the lowest ZFS version that you ever intend to use, which in this case is zpool version 28, and zfs version 5. The FUSE version is far simpler to install on a fresh RedHat OS for recovery purposes, so consider carefully before upgrading to the native ZFS versions.
# /sbin/zpool upgrade -v ... 23 Slim ZIL 24 System attributes 25 Improved scrub stats 26 Improved snapshot deletion performance 27 Improved snapshot creation performance 28 Multiple vdev replacements # /sbin/zfs upgrade -v ... 4 userquota, groupquota properties 5 System attributes
Stong words of warning should accompany the use of native ZFS on a RedHat-derivative.
Kernel upgrades are a cause for concern - if the
family of modules are not installed correctly, then no pools can be brought
online. For this reason, it is far more imperative to retain known working
kernels when upgraded kernels are installed. As we have previously noted,
Oracle's UEK is not ZFS-capable when using the default native installation.
OS release upgrades also introduce even more rigorous warnings. Before attempting an upgrade, remove all of the ZFS software. Upon upgrade completion, repeat the ZFS software installation using a yum repository that is specific for the new OS release. The ZFS on Linux site currently lists repositories for RedHat releases 6, 7.3 and 7.4. It is wise to stay current on patches and releases, and strongly consider upgrading a 7.0 - 7.2 RedHat-derivative where native ZFS installation is contemplated or desired.
Note also that Solaris ZFS has encryption and Windows SMB capability - these are not functional in the Linux port.
Perhaps someday Oracle will permit the RedHat family to bundle native ZFS by relaxing the license terms. That will be a very good day.
Definite legal ambiguity remains with ZFS. While Ubuntu recently announced
support for the
zfs.ko module for their container subsystem, their
legal analysis remains murky. Unsurprisingly,
none of the major enterprise Linux distributions have been willing to bundle
ZFS as a first-class supported filesystem.
Into this void comes Antergos, a descendent of Arch Linux. The Antergos installer will download and compile ZFS source code into the installation kernel in a manner similar to the previous section. While the example installation detailed here did not proceed without incident, it did leave a working, mirrored zpool for the root filesystem running the same version release as the native RPM installs.
What Antergos did not do was
install the Linux kernel itself to both drives.
A separate ext4 partition was configured for
/boot on only one drive,
does not support ZFS and there appears to be a current lack of alternatives
for booting Linux from a ZFS dataset.
I had expected to see an installation similar to MirrorDisk/UX for HP-UX,
where the firmware is configured with primary and alternate boot paths, and
the OS is intelligent enough to manage identical copies of the boot and root
filesystems on multiple drives.
What I actually found was the root filesystem mirrored by ZFS, but
the kernel in
/boot is not, nor is the system bootable if the
partition fails. A fault-tolerant Andergos installation will require
RAID hardware - ZFS is not sufficient.
The Antergos Live ISO was downloaded and written as a bootable image to a flash drive with the command:
# dd bs=4M if=antergos-17.9-x86_64.iso of=/dev/sdc
Please note that the Antergos Minimal ISO does not support ZFS - it's only in the Live ISO. Internet access is required while the installer is running - the latest packages will be downloaded in the installer session, and very little is pulled from the ISO media.
After booting your system on the live ISO, ensure that you are connected to the internet and activate the installer dialog. Note the warnings of beta software status - whether this refers to ZFS, BtrFS, or other Linux RAID configurations is an open question.
Select your territory or locale, time zone, keyboard layout (I suggest the "euro on 5"), and choose your desktop environment. After I chose GNOME, I also added Firefox and the SSH Service. Finally, a ZFS option is presented - enable it:
Below, two SATA drives have been configured in a zpool mirror. I named the pool "root" which may have caused an error at first boot below. Note also the 4k block size toggle - this is a performance-related setting that might be advisable for some configurations and usage patterns.
The next pages prompt for the final confirmation before the selected drives are wiped, after which you will be prompted to create a default user.
While the installer is running, we can examine the zpool.
After opening a terminal and running
sudo sh I found the following
information about the ZFS configuration:
sh-4.4# zpool history History for 'root': 2017-09-30 16:10:28 zpool create -f -m /install root mirror /dev/sda2 /dev/sdb zpool set bootfs=root root zpool set cachefile=/etc/zfs/zpool.cache root zfs create -V 2G root/swap zfs set com.sun:auto-snapshot=false root/swap zfs set sync=always root/swap zpool export -f root zpool import -f -d /dev/disk/by-id -R /install 13754361671922204858
/dev/sda2 has been mirrored to
/dev/sdb, showing that
Antergos has installed a zpool on an MBR partition.
More importantly, these drives are not configured identically - this is not
a true redundant mirror with the ability to boot from either drive.
After the installation packages have been fetched and installed,
zfs.ko - you can see the calls to gcc if you run the top
command in a terminal window.
My installation session completed normally, and the system rebooted. GRUB presented me with the Antergos boot splash, but after booting I was thrown into single user mode:
starting version 234 ERROR: resume: no device specified for hibernation ZFS: Unable to import pool root. cannot import 'root': pool was previously in use from another system. Last accessed by <unknown> (hostid=0) at Tue Oct 3 00:06:34 2017 The pool can be imported, use 'zpool import -f' to import the pool. ERROR: Failed to mount the real root device. Bailing out, you are on your own. Good luck. sh: can't access tty; job control turned off [rootfs ]# zpool import -f root cannot mount '/': directory is not empty [rootfs ]# zfs create root/hold [rootfs ]# cat /dev/vcs > /hold/vcs.txt
The zpool import error above was also encountered
when the FUSE pool was imported by the native driver. I ran the force import
zpool import -f root) which succeeded, then
created a new dataset and copied the terminal to it so you can the session here.
CTRL-ALT-DELETE, the system booted normally. Naming the zpool "root" in the
installer may have caused this problem.
My test system does not have ECC memory, so I attempted to adjust the undocumented kernel parameter below, followed by a reboot:
echo options zfs zfs_flags=0x10 >> /etc/modprobe.d/zfs.conf
After the test system came up, I checked the flags and found that the ECC memory feature had not been set. I set it manually, then ran a scrub:
# cat /sys/module/zfs/parameters/zfs_flags 0 # echo 0x10 > /sys/module/zfs/parameters/zfs_flags # cat /sys/module/zfs/parameters/zfs_flags 16 # zpool scrub root # zpool status root pool: root state: ONLINE scan: scrub in progress since Sun Oct 1 12:08:50 2017 251M scanned out of 5.19G at 25.1M/s, 0h3m to go 0B repaired, 4.72% done config: NAME STATE READ WRITE CKSUM root ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 wwn-0x5000cca20cda462e-part2 ONLINE 0 0 0 wwn-0x5000c5001a0d9823 ONLINE 0 0 0 errors: No known data errors
I also found that the kernel and initrd do not incorporate version numbers in their filenames, indicating that an upgrade may overwrite them. It will likely be wise to copy them to alternate locations within boot to ensure that a fallback kernel is available (this would need extra menu entries in GRUB).
# ls -l /boot -rw-r--r-- 1 root root 26729353 Sep 30 17:25 initramfs-linux-fallback.img -rw-r--r-- 1 root root 9225042 Sep 30 17:24 initramfs-linux.img -rw-r--r-- 1 root root 5474064 Sep 21 13:34 vmlinuz-linux
We continue our investigation into the Antergos zpool mirror by probing the drives with fdisk:
sh-4.4# fdisk -l /dev/sda Disk /dev/sda: 232.9 GiB, 250059350016 bytes, 488397168 sectors Disklabel type: dos Device Boot Start End Sectors Size Id Type /dev/sda1 * 2048 1048575 1046528 511M 83 Linux /dev/sda2 1048576 488397167 487348592 232.4G 83 Linux sh-4.4# fdisk -l /dev/sdb Disk /dev/sdb: 149 GiB, 160000000000 bytes, 312500000 sectors Disklabel type: gpt Device Start End Sectors Size Type /dev/sdb1 2048 312481791 312479744 149G Solaris /usr & Apple ZFS /dev/sdb9 312481792 312498175 16384 8M Solaris reserved 1
Antergos appears to be playing fast and loose with the partition types. We also learn that the /boot partition is a non-redundant ext4:
# grep -v ^# /etc/fstab UUID=f9fc... /boot ext4 defaults,relatime,data=ordered 0 0 /dev/zvol/root/swap swap swap defaults 0 0 # df|awk 'NR==1||/boot/' Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 498514 70732 418454 15% /boot
Andergos is not configuring a completely fault-tolerant drive mirror,
and this is a known problem.
The ext4 partition holding the kernel is a single point of failure,
apparently required for GRUB. In the event of the loss of
the Live ISO could be used to access the zpool, but restoring
full system availability would require much more effort.
The same will likely apply to raidz.
ZFS is the filesystem that is "often imitated, never duplicated."
The main contenders for ZFS functionality appear to be BtrFS, Apple APFS, and Microsoft's ReFS. After many years of BtrFS development, it still lacks performance and maturity ("we are still refusing to support 'Automatic Defragmentation', 'In-band Deduplication' and higher RAID levels, because the quality of these options is not where it ought to be"). Apple very nearly bundled ZFS into OSX, but backed out and produced APFS instead. Microsoft is also trying to create a next-generation filesystem named ReFS, but in doing so they are once again proving Henry Spencer's famous quote, "Those who do not understand Unix are condemned to reinvent it, poorly." ReFS will lack compression, deduplication, and copy-on-write snapshots.
All of us have critical data that we do not wish to lose. ZFS is the only filesystem option that is stable, protects our data, is proven to survive in most hostile environments, and has a lengthy usage history with well-understood strengths and weaknesses. While ZFS will likely be loaded by many Linux administrators who need its features, the installation and maintenance tools have obvious shortcomings that can trap the unwary.
It is time once again to rely on Oracle's largesse, and to ask them to fully open the ZFS filesystem to Linux for the benefit of the community. This will solve many problems, including Oracle's, and will engender goodwill in the Linux community that, at least from a filesystem perspective, is sorely lacking.