DRAFT DRAFT DRAFT DRAFT DRAFT
Close-To-Open Cache Consistency
in the Linux NFS Client
Chuck Lever,
Network Appliance, Inc.
cel@netapp.com
DRAFT DRAFT DRAFT DRAFT DRAFT
Abstract
To support close-to-open cache consistency,
the Linux NFS client aggressively times out its DNLC entries.
This doesn't guarantee total consistency, however, and
results in unpredictable behavior.
In this report,
we describe the current Linux DNLC entry revalidation mechanism,
compare the network behavior of the Linux NFS client implementation
with other client implementations,
and discuss possible improvements.
We describe a solution that makes the Linux NFS client
guarantee cache consistency.
Finally, we show how to shut off consistency checking for
the sake of performance in specific environments.
Introduction
UNIX(tm) was one of the first operating systems to
support
hierarchical file systems
[
3].
A hierarchical file system allows users to organize files
with a common purpose into directories.
It also allows directories to contain other directories,
creating a hierarchy of directories that often resembles the
branches of an inverted tree.
Generally, an operating system invokes a
lookup operation
to find files contained within a directory.
File system operations on files stored in a hierachical file system
traverse the directory structure via multiple
lookup operations,
one directory at a time.
Lookup operations are so numerous on active systems
that a mechanism is required to help speed lookup
operations so that they don't become a bottleneck
for system performance.
Modern flavors of UNIX(tm) maintain a cache of results from recent
file system directory lookup operations
[9].
In this report we refer to this cache as the operating system's
directory name lookup cache, or DNLC for short.
In the Linux kernel, this cache is called the directory entry cache, or
dcache [1].
In most UNIX(tm) systems, the DNLC is only part of the pathname
resolution logic, but the dcache is integrated into
the Linux kernel's virtual file system (VFS) layer.
For file systems where data is accessed on the same system
where it is stored permanently, entries in a system's DNLC
can last as long as there is room to keep them in the cache.
In this instance, applications run on the same operating
system that controls the disk and file system metadata.
The operating system is fully aware of any change to local
filenames, so the DNLC is always kept up-to-date.
However when files are stored on remote systems, some kind
of cache coherency must be maintained for any file metadata
stored on systems where remote file data is accessed
and modified.
Clients of NFSv2 and v3 file servers, for example,
usually expire file system metadata periodically
so that it will be revalidated the next time it is accessed.
This applies to any entries in a client's DNLC, and
to file attributes cached by the client.
Network file systems such as AFS go to great lengths
to help a client maintain a coherent view of file systems
it shares with other clients
[2, 7].
On Linux, every lookup operation that results in a DNLC hit
invokes a file system dependent operation to revalidate the
cached entry before the entry is made available for use by other parts
of the operating system.
Most file systems that maintain file data locally do not need any
cache entry revalidation.
The Linux NFS client, however, takes this opportunity to revalidate
the cached entry.
If the entry is considered invalid, the NFS client requests a fresh
on-the-wire lookup to validate the file's name and parent directory,
it's file handle, and any file attributes
corresponding to the cached entry.
To support certain aspects of the NFS standard, the Linux client
aggressively times out its DNLC entries under certain circumstances.
This is not enough to guarantee cache consistency, however.
In this report,
we describe the current Linux dcache entry revalidation mechanism,
compare the network behavior of the Linux NFS client with other
client implementations,
and discuss possible improvements.
Dcache Operation
Dcache entries are linked together in a tree-like structure.
Each entry points to its parent.
Each entry that refers to a directory contains a list of
its known children (
i.e. children discovered via
previous
lookup operations).
Each entry contains a name, and a pointer to an associated
inode (the in-memory structure that Linux uses like a vnode).
Dcache entries are also inserted into a hash table, which
constitutes the system's directory entry cache.
They are hashed via their name and the address of their
parent dcache entry.
A lookup operation in this cache starts with the parent's
dentry and the name to be looked up, and returns a dentry
that matches on the name and the parent.
Dcache entry life cycle
Linux dcache entries behave in some ways like vnodes behave
on other flavors of UNIX(tm) [
4].
On Linux, in-memory file metadata is split between dcache entries
and inodes.
The Linux VFS layer stores filename metadata in the dcache,
and per-file metadata in the inode cache.
In most instances, the VFS layer first looks up a file's
dcache entry, then retrieves the inode from the
d_inode
field in the dcache entry.
This makes inode cache lookups rare.
Inodes can outlive dcache entries.
If a file is renamed, its inode stays, but its dcache entry
is deleted and a new one is created for the new name.
Dcache entries can also represent negative lookup results,
if the entry's inode pointer is NULL valued.
Dcache entry operations vector
Every dcache entry has an optional operations vector
associated with it.
This vector allows file system implementors to overload
standard dentry operations.
File systems can choose to leave this operations vector
NULL, in which case the VFS layer uses default behavior.
File systems may also choose to implement only some of
the operations, in which case they leave the unimplemented
function pointers as
NULL values.
In the Linux kernel current as of this writing (2.4.2),
this vector contains six virtual functions:
-
int d_revalidate(struct dentry * dentry, int flags)
-
When a VFS lookup request finds a result
in the dcache, this operation, if it exists, is
invoked to ensure the cached result is still valid.
The flags argument contains flags that indicate
the type of lookup operation (last in pathname, return
the parent of this dentry, require a directory as the
final component, and so on).
This function returns an integer value: one if the dcache
entry is valid or has been revalidated, and zero if the
dcache entry should be invalidated and refreshed with a
file system dependent lookup operation.
When the VFS layer encounters a zero return from
d_revalidate, it unhashes the dentry from its
parent and does a fresh real lookup to attempt to replace it.
Most filesystems leave this NULL, because all their
dentries in the dcache are always valid.
The NFS client defines this operation, using it to expire and
revalidate dcache entries.
-
int d_hash(struct dentry * dentry, struct qstr * name)
-
If it exists,
d_hash is invoked in place of the dcache's standard hash
function.
It can be used for file systems that have special naming requirements
which make the standard hash function inefficient or unusable.
The dcache's hash function hashes on a file's name to determine
its location in the cache.
This function returns zero normally, and a negative errno-value
if some error occurs.
The NFS client leaves this operation as NULL, since
it can use POSIX naming conventions supported by the VFS layer
by default.
-
int d_compare(struct dentry * dentry, struct qstr * name1,
struct qstr * name2)
-
This function, if it exists, is
invoked to determine whether two names are equivalent.
It can be used for file systems that have special naming requirements
which make the standard name comparison function inefficient or unusable
(for example, if the file system uses case insensitive file names).
The dentry argument is the parent directory
that contains the two names.
This function returns one if the two names are considered
the same by the file system,
or zero if the two names are not equivalent.
The NFS client leaves this operation as NULL, since
it can use POSIX naming conventions supported by the VFS layer
by default.
-
int d_delete(struct dentry * dentry)
-
This function, if it exists, is
invoked when a dcache entry's reference count lapses.
If this function returns one, the dcache entry is immediately
removed from the dcache.
If this function returns zero, the dcache entry is
made dormant but is left in the cache,
unless it is unreachable,
in which case it is removed from the dcache.
The NFS client defines this operation to clean up after silly
renames.
-
- void d_iput(struct dentry * dentry, struct inode * inode)
-
This function, if it exists, is
invoked just before the VFS layer unbinds
the associated inode from the dcache entry.
This happens when a file is deleted or when the kernel's
memory manager needs to reclaim dcache entries and inodes.
The NFS client defines this operation to clean up after silly
renames.
-
void d_release(struct dentry * dentry)
-
This function, if it exists, is
invoked when a dcache entry is returned to the dentry SLAB cache
(that is, when it is deallocated).
The NFS client, like most file systems,
leaves this operation as NULL.
Linux NFS Client Implementation
The difference between dcache entries and file attributes
The dcache is responsible for caching filename information,
while the file attribute cache retains per-file metadata
information.
This distinction is made very clear when considering
the rename operation.
During a rename, the file attributes remain the same (except for
ctime),
but the name of the file, and hence any cached directory
information, must change.
The file attribute cache and the dcache use separate time out values.
Attribute cache time out logic uses time out values
stored in the inode field nfs_i.attrtimeo.
Dcache time out logic uses time out values stored in the
dentry field d_time.
This field is
reserved specifically for use by file systems;
the VFS layer does not touch this field.
In certain special cases, of course, the time out values
can be ignored.
These time out values themselves vary
as the client discovers how often an object changes.
Ramfication of stale file handles
A stale file handle represents a file object that was deleted,
and a new file object has been created with the same name as the
deleted file.
What does this mean for the dcache and for file attribute
caches?
When a client discovers a file handle is stale, it should
invalidate the DNLC entry for the file and re-read the
file attributes.
For Linux, this means it must unhash the dcache entry,
free the inode, then re-instantiate a new dcache entry
and new inode.
Close-to-open cache coherency
The NFS standard requires clients to maintain
close-to-open cache coherency
when multiple clients access the same files
[
5,
6,
10].
This means flushing all file data and metadata changes
when a client closes a file, and immediately and unconditionally
retrieving a file's attributes when it is opened
via the
open() system call API.
In this way, changes made by one client appear as soon as a file is
opened on any other client.
The ext2 file system is the standard local file system
on Linux, and is the most frequent local file system exported
by the Linux NFS server.
It uses 32-bit wide timestamps that
count the number of seconds after January 1, 1970.
Thus ext2 on-disk inodes don't resolve changes
that happen during the same second.
This is acceptable for local file access.
However, the Linux NFS server exports timestamps
with resolution down to only a second;
changes to a file or directory that happen during the
same second are not reflected in the timestamps.
In order to detect sub-second changes to directories, the
Linux NFS client currently uses dcache entry revalidation
to achieve close-to-open cache coherency.
Each operation that involves altering a directory, such as
rmdir, create, and so on, time stamps the
parent's directory entry.
These operations store updated directory attributes
returned by server requests
into the attribute cache.
Whenever a directory's inode attributes are updated as a
result of one of these operations, its
dcache entry time stamp is updated to the current time
on the client.
When a dcache entry is revalidated, the dcache entry's time stamp
is compared with the current time on the client.
In most cases, if the difference is larger than the directory's
attribute timeout value, the dcache is revalidated
by executing an on-the-wire lookup request, and comparing the
result to information cached on the client.
Normally this information doesn't change, so the dentry may
be used as-is.
If the information has changed (for example, if the file
has been renamed) the dentry is invalidated, and another
on-the-wire lookup is requested by the VFS layer to acquire
the new information.
The last component of pathname lookups are a special case, however.
If the last component's parent directory has changed recently,
the time out value is set to zero, causing the dcache
entries of files in active directories to be revalidated
immediately.
Simple Benchmark
First we present the results of a simple benchmark that compares
the number of lookup operations required by an unmodified client
with the number required by a client that is less aggressive about
timing out dcache entries.
The Linux kernel, in both cases, is 2.4.1.
The client is an AMD K6 running at 233Mhz, with 128M of memory,
and attached to the server via a 100Mb ethernet switch.
The client mounts the server via NFSv3, and uses an
rsize
and
wsize of 8,192 bytes.
Our workload is generated by building the Linux kernel
in an NFS-mounted file system.
The build configuration is held fixed for each build.
The kernel is built immediately after the system is booted,
providing a cold cache.
NFS statistics are gathered via nfsstat.
The modification consists of removing logic in nfs_dentry_force_reval
that shortens the attribute time-out value of the last component
of pathnames.
Operation type
Linux kernel, unmodified
Linux kernel, modified
Total packets
108,810
62,718
Fragments
16,388
16,407
Lookup requests
54,405
11,176
Getattr requests
284
384
Write requests
7,913
7,914
Read requests
3,361
3,364
Create requests
770
770
Readdir requests
409
410
While the number of read, write, create, and readdir requests remain about
equal for both runs, the modified kernel generated considerably fewer
lookup requests, resulting in a packet count that is nearly half that of
the unmodified client.
This test illustrates an artificial lower bound for client packet count.
In the case of a single client and a single server,
the client can trust that it is the only accessor of these files, and
thus can safely dispense with extra lookup operations.
Our results show how good network traffic could be without these
operations.
Our goal is to create a client that approaches this lower bound,
but effectively implements close-to-open cache coherency.
Workload Analysis
What workloads exacerbate this problem, and what workloads
and applications require the extra time-out logic we removed?
A compilation-intensive workload is especially onerous
in this case, but we can identify other common workloads
that would be impacted by excessive DNLC timeouts:
-
Compilation
-
Web browser caches
-
MH mail folders
-
MTA mail spool directories
We feel that, in fact, most common workloads don't share directories
between NFS clients, and that when they do, the applications themselves
can easily be responsible for notifying remote instances of changes.
Thus this type of excessive timeout is likely unnecessary for all
but a few unique types of workload.
The theory of least surprise, however, requires that close-to-open
cache consistency be maintained by default. System administrators
might find a mount option useful to identify file systems that don't require
strict close-to-open cache consistency.
Possible Solutions
The fundamental problem here is that many things in the
Linux VFS layer cause a dcache lookup, but only some things
require an immediate entry revalidation.
Consider these possible solutions:
-
Make f_op->open revalidate the dentry
-
Summary:
Remove faulty timeout trashing logic from nfs_dentry_force_reval.
Add logic to nfs_open to revalidate the dcache entry before
returning to the VFS layer.
Pros:
This is the most straight-forward design.
It makes it clear that a file's attributes are refreshed
immediately and unconditionally whenever a file is opened
on a client.
Cons:
The
f_op->open method in the case of the NFS client
is
nfs_open.
Invoking this function occurs late during open processing;
the dentry has already been looked up.
If
nfs_open should find that the file handle is stale,
it must dissociate the dcache entry's current inode and get a
new one; the only safe way for this to happen is for
nfs_open
to return
-ESTALE and have the VFS layer handle the problem.
Note that the Solaris VFS layer recovers from this by invalidating
DNLC and dropping file attributes, then reacquires them.
If we don't expect recovering from stale file handles in open
processing to be a performance path, this might be the cleanest
solution.
-
Add more nfs_revalidate calls to the VFS layer
-
Summary:
Remove faulty timeout trashing logic from nfs_dentry_force_reval.
Add nfs_revalidate to open_namei and open_exec.
Pros:
This takes a familiar approach.
Cons:
The VFS layer invokes
nfs_revalidate before calls such as
stat.
(Why doesn't it use this before open?)
If nfs_revalidate discovers a stale file descriptor,
it must dissociate the dcache entry's current inode and get a
new one.
Extra logic must be added to recover a new file handle;
see above.
Finally, nfs_revalidate uses the normal timeout
mechanism, so some indication that the timeout should be
ignored must be passed to it.
-
Zero the d_time field when a file is closed
-
Summary:
Remove faulty timeout trashing logic from nfs_dentry_force_reval.
Zero the d_time field when closing a file to force
d_revalidate to revalidate a dcache entry immediately
if it is looked up again.
Pros:
No changes are necessary to the VFS layer.
Cons:
The first open of a file will find no dcache entry; the dcache
entry will be looked up properly.
A close on one client will cause that client to retrieve the
file's attributes again properly.
A second lookup after only an open will not cause the file's
attributes to be retrieved from the server.
-
Set a flag during lookups that require immediate revalidation
-
Summary:
Define a flag to d_revalidate that
open_namei and open_exec can use to
indicate to file system specific routines that
when looking up a dentry, it will need immediate revalidation.
Replace faulty timeout trashing logic from nfs_dentry_force_reval
with a check for the new flag.
If the new flag is present, trigger an immediate revalidation.
Pros:
This is an easy-to-implement solution, requiring few changes
to the VFS layer and NFS client.
Cons:
This solution relies on a side effect of on-the-wire lookup requests.
The lookup request revalidates cached filename information,
but also returns a fresh set of file attributes.
Note that only open and fopen need to guarantee that they
get a consistent handle to a particular file for reading and writing.
stat and friends are not required to retrieve fresh attributes,
in fact.
Thus, for the sake of close-to-open cache coherence, only
open and fopen are considered an "open event"
where fresh attributes need to be fetched immediately from the
server.
Solaris handles the case where client A has an open file, and
tries to open the same file again, but discovers that it has
been replaced by client B, thereby making the file handle cached
on client A "stale."
In this case, Solaris's VFS layer invalidates any DNLC entries
and attributes for the file, then rebuilds its state.
Other interesting behaviors
The case of
open(".") is interesting.
On Linux, when a pathname containing simply "." is resolved,
the implementation simply retrieves the dentry for the
process's current working directory from it's
task
structure, and returns it to
open().
Because neither a lookup nor a revalidation is done to obtain
this dentry, it is possible for the pathname resolution logic
to return a dentry for a deleted directory.
This is the only case on Linux where an application is allowed
to open a deleted directory.
This is a problem both for local file system implementations
such as ext2 and for NFS. If a directory that is
a current working directory of some process is deleted, that
process is still allowed to open("."). If the directory
is deleted on a remote client, there is no way to tell it is
gone until something tries to use the directory.
No lookup for "." also means that the NFS client implementation
is not invoked to retrieve or refresh the directory's attributes.
With the current implementation of pathname resolution on Linux,
it is impossible to guarantee close-to-open cache consistency
for current working directories.
We also note that
the extra do_revalidate code in the VFS
layer support for stat and friends is, at this time,
redundant.
Each of these system calls uses path_walk to find
the dentry for the target object, and path_walk
eventually invokes cached_lookup which will
revalidate both the DNLC and inode cache.
Following the path_walk call in each of these
system routines, there appears a do_revalidate
which invokes the inode's i_op->revalidate
method.
Conclusions
We implemented the last solution (passing a flag to the
dentry revalidation operation), and found that, while
NFS close-to-open behavior improved, performance was slightly worse
due to an increase in the number of on-the-wire lookups.
We mitigated the performance problem by implementing support
for a pre-existing client mount option called "nocto" (which
stands for "no close-to-open").
For certain workloads where we know there will be little or
no data sharing, we can dispense with extra lookup operations
to verify file attributes during open() processing,
and rely simply on attribute and dcache timeouts.
Using this mount option, we obtain very close to optimal
on-the-wire lookup counts.
Next we tried implementing the first solution from above;
namely, adding logic to nfs_open to retrieve
attributes from the server during open processing.
This solution was easier to implement than we had estimated,
and provided three benefits over our first attempt.
First, open(".") is correctly supported.
Second, we are closer to removing nfs_lookup_revalidate
entirely.
Finally,
instead of on-the-wire lookups, this client implementation
uses on-the-wire GETATTR requests, which results
in a measureable performance improvement for lookup-intensive
workloads.
Future Work
Dentry revalidation is a performance path for lookup-intensive
workloads like compilation.
Special attention to the efficiency of this part of the NFS
client could have as much payoff as making the read and write
paths fast.
Entirely eliminating the need to revalidate lookup results could
improve NFS performance.
Allowing servers and clients to form a special agreement about
directories such that clients can have exclusive access to them
might help tremendously.
Clients would no longer be burdened by checking back with the
server to see if directories have changed, reducing the number
of on-the-wire lookup requests significantly.
References
2.
Kazar, Michael Leon.
"Synchronization and Caching Issues in the Andrew File System."
USENIX Conference Proceedings, pp. 27-36.
Winter 1988.
3.
Ritchie, Dennis M., and Thompson, Ken.
"The Unix time-sharing system."
Communications of the ACM, 17(7):365-375.
October 1974.
4.
Kleiman, S. R.
"Vnodes: An Architecture for Multiple File System Types in Sun Unix."
USENIX Conference Proceedings.
Atlanta 1986.
5.
Sun Microsystems, Inc.
"RFC 1094 - NFS: Network File System Protocol specification."
IETF Network Working Group.
March 1989.
6.
Sun Microsystems, Inc.
"RFC 1813 - NFS: Network File System Version 3 Protocol Specification."
IETF Network Working Group.
June 1995.
7.
Howard, John H. Kazar, Michael L.
Menees, Sherri G. Nichols, David A.
Satyanarayanan, M. Sidebotham, Robert N.
West, Michael J.
"Scale and performance in a distributed file system."
ACM Transactions on Computer System, volume 6(1).
February 1988.
8.
McKusick, Marshall Kirk. Joy, William N.
Leffler, Samuel J. Fabry, Robert S.
A Fast File System for UNIX.
ACM Transactions on Computer Systems 2,
Volume 3, pp. 181-197.
August 1984.
9.
Leffler, Samuel J. McKusick, Marshall Kirk.
Karels, Michael J. Quarterman, John S.
The Design and Implementation of the 4.3BSD UNIX
Operating System.
Addison-Wesley Publishing Company, 1990.
10.
Callaghan, Brent.
NFS Illustrated,
Addison-Wesley Longman, Inc., 2000.
Last Modified:
Tue Jul 10 11:14:52 EDT 2001