Network Management: ZFS: The Next Word

Reference Material

Wednesday, October 7, 2009

ZFS: The Next Word

ZFS: The Next Word

Abstract

ZFS is the latest in disk and hybrid storage pool technology from Sun Microsystems. Unlike competing 32 bit file systems, ZFS is a 128-bit file system, allowing for near limitless storage boundaries. ZFS is not a stagnant architecture, but a dynamic one, where changes are happening often to the open source code base.

What's Next in ZFS?

Jeff Bonwick and Bill Moore did a presentation at The Kernel Conference Australia 2009 regarding what was happening next in ZFS. A lot of the features were driven by the Fishworks team as well as Lustre clustering file system.
[埋込みオブジェクト:https://slx.sun.com/sites/slx.sun.com/modules/flowplayer/FlowPlayerLight.swf?config=%7BmenuItems%3A%5Bfalse%2Cfalse%2Cfalse%2Cfalse%2Ctrue%2Ctrue%2Cfalse%5D%2CallowFullScreen%3Atrue%2CshowFullScreenButton%3Atrue%2CshowPlayListButtons%3Afalse%2CshowStopButton%3Atrue%2CusePlayOverlay%3Afalse%2CautoPlay%3Afalse%2CautoBuffering%3Afalse%2CstartingBufferLength%3A1%2CshowMenu%3Atrue%2CemailVideoLink%3A%271179275620%27%2CemailPostUrl%3A%27https%3A%2F%2Fslx%2Esun%2Ecom%2F%27%2CvideoFile%3A%27https%3A%2F%2Fslx%2Esun%2Ecom%2Flimelight%2Ffilevault%2F1179275620%2F0%2F11792756201254087565%2Eflv%27%2CsplashImageFile%3A%27http%3A%2F%2Fsuncms%2Evo%2Ellnwd%2Enet%2Fo18%2Fs%2Fslx%2F11792756201254087565%2Ejpg%3Fh%3D9af704e0911a46c7fed8a3d1274ec48b%27%2CwatermarkUrl%3A%27https%3A%2F%2Fslx%2Esun%2Ecom%2Ffiles%2Flogo%2Epng%27%2CwatermarkLinkUrl%3A%27https%3A%2F%2Fslx%2Esun%2Ecom%2F%27%2CshowWatermark%3A%27fullscreen%27%2Cloop%3Afalse%2Cwidth%3A530%2Cheight%3A400%2CcontrolsOverVideo%3A%27ease%27%2CcontrolBarGloss%3A%27low%27%2CinitialScale%3A%27fit%27%2CbaseURL%3A%27https%3A%2F%2Fslx%2Esun%2Ecom%2Fsites%2Fslx%2Esun%2Ecom%2Fmodules%2Fflowplayer%27%2Cembedded%3Atrue%7D]
What are the new enhancements in functionality?

Enhanced Performance
Enhancements all over the system
Quotas on a per-user basis
Always had quotas on a per-filesystem basis, originally thought each user would get a filesystem, this does not scale well for thousands of users with many existing management tools
Works with industry standard POSIX based UID's & Names
Works with Microsoft SMB SID's & Names
Pool Recovery
Disk drives often "out-right lie" to operating system when they re-order the writing of the blocks.
Disk drives often "out-right lie" to operating systems when they receive a "write barrier", indicating that the write was completed, when the write was not completed.
If there is a power outage in the middle of the write, even after a "write barrier" was done, the drive will often silently drop the "write commit", making the OS thinking that the writes were safe, when they were not - resulting in a pool corruption.
Simplification in this area - during a scrub, go back to an earlier uber-block, and correct pool... and never over-write a recently changed transaction group, in the case of a new transaction.
Triple Parity RAID-Z
Double parity RAID-Z has been around from the beginning (i.e. lose 2 out of 7 drives)
Triple parity RAID-Z allows for disks with bigger, higher, faster high-BER drive usage
Quadruple Parity is on the way (i.e. lose 3 out of 10 drives)
De-duplication
This is very nice capacity enhancement with application, desktop, and server virtualization
Encryption
Shadow Migration (aka Brain Slug?)
Pull out that old file server and replace it with a ZFS [NFS] server without any downtime.
BP Rewrite & Device Removal
Dynamic LUN Expansion
Before, if a larger drive was inserted, the default behavior was to resize the LUN
During a hot-plug, tell the system admin that the LUN has been resized
Property added to make LUN expansion automatic or manual
Snapshot Hold property
Enter an arbitrary string for a tag, issue the snapshot, issue a delete, when an "unhold" is done, the destroy is done.
Makes ZFS look sort of like a relational database with transactions.
Multi-Home Protection
If a pool is shared between two hosts, works great as long as clustering software is flawless.
The Lustre team prototyped a heart-beat protocol on the disk to allow for multi-home-protection inherent in ZFS
Offline and Remove a separate ZFS Log Device
Extend Underlying SCSI Framework for Additional SCSI Commands
SCSI "Trim" command, to allow ZFS to direct less wear leveling on unused flash areas, to increase life and performance of flash
De-Duplicate in a ZFS Send-Receive Stream
This is in the works, to make backups & Restores more efficient

Performance Enhancements include:

Hybrid Storage Pools
Makes everything go (alot) faster with a little cache (lower cost) and slower drives (lower cost.)
- Expensive (fast, reliable) Mirrored SSD Enterprise Write Cache for ZFS Intent Logging
- Inexpensive consumer grade SSD cache for block level Read Cache in a ZFS Level 2 ARC
- Inexpensive consumer grade drives with massive disk storage potential with a 5x lower energy consumption
New Block Allocator
This was a extremely simple 80 line code segment that works well under empty pools, that was finally re-engineered for performance when the pool gets full. ZFS will now use both algorithms.
Raw Scrub
Increase performance by running through the pool and metadata to ensure checksums are validated without uncompressing data in the block.
Parallel Device Open
Zero-Copy I/O
From the folks in Lustre cluster storage group requested and implemented the feature.
Scrub Prefetch
A scrub will now prefetch blocks to increase utilization of the disk and decrease scrub time
Native iSCSI
This is part of the COMSTAR enhancements. Yes, this is there today, under OpenSolaris, and offers tremendous performance improvements and simplified management
Sync Mode
NFS benchmarking in Solaris is shown to be slower than Linux, because Linux does not guarantee a write to NFS actually makes it to disk (which violates the NFS protocol specification.) This feature allows Solaris to use a "Linux" mode, where writes are not guaranteed, to increase performance, at the expense of .
Just-In-Time Decompression
Prefetch hides latency of I/O, but burns CPU. This allows prefetch to get the data without decompressing the data, until needed, to save CPU time, and also conserve kernel memory.
Disk drives with higher capacity and less reliability
Formatting options to reduce error-recovery on a sector-by-sector basis
30-40% improved capacity & performance
Increased ZFS error recovery counts
Mind-the-Gap Reading & Writing Consolidation
Consolidate Read Gaps in the case of reads, to ingle aggregate read can be used, reading data between adjacent sectors, and throw away intermediate data, since fewer I/O's allow for streaming data from drives more efficiently
Consolidate Write Gaps in the case of a write, so single aggrigate write can be used, even if adjacent regions have a blank sector gap between them, streaming data to drives more efficiently
ZFS Send and Receive
Performance has been improved using the same Scrub Prefetch code

Conclusion

The ZFS implementation in Solaris 10-2009 release actually has some of the ZFS features detailed in the most recent conferences.

Posted by David at 10:38 AM

Labels: Flash, Log, Lustre, SCSI, Solaris, Solaris 10, SSD, ZFS

Network Management

Reference Material

Wednesday, October 7, 2009

ZFS: The Next Word

No comments:

Post a Comment