The farsite file system
Follow us:. Share this page:. Distributed directory service in the farsite file system. It also must automatically adapt to the arrival and departure of machines and changes in machine availability, and it must be able to autonomically repartition its data and metadata as necessary to balance load and alleviate hotspots.
We describe the history of the project, including its multiple versions of major system components, the unique programming style and software-engineering environment we created to facilitate development, our distributed debugging framework, and our experiences with formal system specification.
We also report on the lessons we learned during this development. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page.
It is slightly more involved because, for example, one can create a file only if its parent exists, so in the flattened name space, one could create a new name only if a constrained prefix of the name already exists. Some prior distributed file systems have divided the name space into user-visible partitions and disallowed renames across partitions; examples include volumes in AFS [ 19 ], shares in Dfs [ 25 ], and domains in Sprite [ 27 ].
Instead, we designed our metadata service to present a semantically seamless name space. Given a fully functional rename operation, name-space consistency is necessary to avoid permanently disconnecting parts of the file system, as a simplified example illustrates. Client X renames file C to be a child of file G, as shown in Fig. No single server is directly involved in both rename operations, and each independent rename is legal.
Yet, if both renames were allowed to proceed, the result would be an orphaned loop, as shown in Fig. Once we had a mechanism for scalable name-space consistency, it seemed reasonable to use this mechanism to provide a consistent name space for all path-based operations, not merely for renames.
Although Farsite is not intended to gracefully support long-term client disconnections, we still need to decide what to do when such disconnections occur. Since we employ a lease-based framework, we attach expiration times — typically a few hours — to all leases issued by servers. When a lease expires, the server reclaims its authority over the leased metadata. If a client had performed operations under the authority of this lease, these operations are effectively undone when the lease expires.
Farsite thus permits lease expiration to cause metadata loss, not merely file content loss as in previous distributed file systems. However, these situations are not as radically different as they may first appear, because the user-visible semantics depend on how applications use files. For example, Microsoft's email program Outlook [ 28 ] stores all of its data, including application-level metadata, in a single file. Under Outlook, content-lease expiration can cause email folder metadata updates to get lost; whereas under maildir, expiring a file-system metadata lease can cause the same behavior.
Our paper [ 1 ] envisioned a distributed directory service, for which it presented several ideas that seemed reasonable in abstract, but which proved problematic as we developed a detailed design. Our intent had been to partition file metadata among servers according to file path names. Each client would maintain a cache of mappings from path names to their managing servers, similar to a Sprite prefix table [ 43 ].
The client could verify the authority of the server over the path name by evaluating a chain of delegation certificates extending back to the root server. To diffuse metadata hotspots, servers would issue stale snapshots instead of leases when the client load got too high, and servers would lazily propagate the result of rename operations throughout the name space. Partitioning by path name complicates renames across partitions, as detailed in the next section.
In the absence of path-name delegation, name-based prefix tables are inappropriate. Similarly, if partitioning is not based on names, consistently resolving a path name requires access to metadata from all files along a path, so delegation certificates are unhelpful for scalability. Stale snapshots and lazy rename propagation allow the name space to become inconsistent, which can cause orphaned loops, as described above in Section 3.
For these reasons, we abandoned these design ideas. In building a distributed metadata service, the most fundamental decision is how to partition the metadata among servers. One approach is to partition by file path name, as in Sprite [ 27 ] and the precursor to AFS [ 32 ], wherein each server manages files in a designated region of name space. The former approach — migration without delegation — is insufficient, because if a large subtree is being renamed, it may be not be manageable by a single server.
We avoid these problems by partitioning according to file identifiers, which are not mutable. Partitioning by file identifier is not original to Farsite; it is also the approach taken by AFS [ 19 ] and xFS [ 2 ].
All three systems need to efficiently store and retrieve information on which server manages each identifier. AFS addresses this problem with volumes [ 34 ], and xFS addresses it with a similar unnamed abstraction. All files in a volume reside on the same server. Volumes can be dynamically relocated among servers, but files cannot be dynamically repartitioned among volumes without reassigning file identifiers.
Specifically, we considered four issues:. Abstractly, a file identifier is a sequence of positive integers, wherein the null sequence identifies the file-system root. Each server manages identifiers beginning with a specified prefix, except for those it has explicitly delegated away to other servers. The file identifier space is thus a tree, and servers manage subtrees of this space. At any moment, some portion of each file identifier determines which server manages the file; however, the size of this portion is not fixed over time, unlike AFS file identifiers.
For example, in the partitioning of Fig. This arbitrary partitioning granularity addresses issue 1 above. Because file identifiers have unbounded sequence length, each server manages an infinite set of file identifiers. This addresses issue 2. Variable-length identifiers may be unusual, but they are not complicated in practice. Our code includes a file-identifier class that stores small identifiers in an immediate variable and spills large identifiers onto the heap.
Class encapsulation hides the length variability from all other parts of the system. To store information about which servers manage which regions of file-identifier space, clients use a file map , which is similar to a Sprite prefix table [ 43 ], except it operates on prefixes of file identifiers rather than of path names. The file map is stored in an index structure adapted from Lampson et al.
With respect to the count of mapped prefixes, the storage cost is linear, and the lookup time is logarithmic, thereby addressing issue 4. For example, in Fig. This rule tends to keep files that are close in the name space also close in the identifier space, so partitioning the latter produces few cuts in the former. This minimizes the work of path resolution, which is proportional to the number of server regions along a path.
Renames can disrupt the alignment of the name and identifier spaces. For example, Fig. Unless renames occur excessively, there will still be enough alignment to keep the path-resolution process efficient.
The Windows rename operation atomically updates metadata for three files: the source directory, the destination directory, and the file being renamed. Our protocol does not support POSIX-style rename, which enables a fourth file to be overwritten by the rename, although our protocol could possibly be extended to handle this case.
These files may be managed by three separate servers, so the client must obtain leases from all three servers before performing the rename. Before the client returns its leases, thus transferring authority over the metadata back to the managing servers, its metadata update must be validated by all three servers with logical atomicity.
The servers achieve this atomicity with two-phase locking: The server that manages the destination directory acts as the leader , and the other two servers act as followers. Each follower validates its part of the rename, locks the relevant metadata fields, and notifies the leader.
The leader decides whether the update is valid and tells the followers to abort or commit their updates, either of which unlocks the field. While a field is locked, the server will not issue a lease on the field.
Since a follower that starts a multi-server operation is obligated to commit if the leader commits, a follower cannot unlock a field on its own, even to timeout a spurious update from a faulty client.
Instead, the leader centrally handles timeouts by setting a timer for each notification it receives from a follower. The leader, which manages the destination server, can afford to check the one non-local condition, namely that the file being moved is not an ancestor of the destination.
This check is facilitated by means of a path lease, as described in the following section. As described in Section 3. More generally, Farsite provides name-space consistency for all path-based operations. In particular, it makes the root server responsible for providing leases to all interested parties in the system. Our solution to this problem is the mechanism of recursive path leases. Path leases are recursive, in that they are issued to other files, specifically to the children of the file whose path is being leased; a path lease on a file can be issued only when the file holds a path lease from its parent.
A path lease is accessible to the server that manages the file holding the lease, and if the file is delegated to another server, its path lease migrates with it. The recursive nature of path leases makes them scalable; in particular, the root server need only deal with its immediate child servers, not with every interested party in the system. When a rename operation causes a server to change the sequence of files along a path, the server must recall any relevant path leases before making the change, which in turn recalls dependent path leases, and so on down the entire subtree.
The time to perform a rename thus correlates with the size of the subtree whose root is being renamed. This implies that renames near the root of a large file system may take considerable time to execute. Such renames have considerable semantic impact, so this slowness appears unavoidable. Nonetheless, to plausibly argue that our system can reach such a scale, we believe it necessary to address the problem of workload hotspotting.
However, even non-commutative sharing can result in hotspotting if the metadata structure induces false sharing. We avoid false sharing by means of file-field leases and disjunctive leases. Rather than a single lease over all metadata of a file, Farsite has a lease over each metadata field. For the latter fields, since there are infinitely many potential child names, we have a shorthand representation of lease permission over all names other than an explicitly excluded set; we call this the infinite child lease.
File-field leases are beneficial when, for example, two clients edit two separate files in the same directory using GNU Emacs [ 36 ], which repeatedly creates, deletes, and renames files using a primary name and a backup name.
For some fields, even a lease per field can admit false sharing. So, to process an open correctly, a client must determine whether another client has the file open for a particular mode. For example, if some client X has a file open for read access, no other client Y can open the file with a mode that excludes others from reading. In Farsite, this false sharing is avoided by applying disjunctive leases [ 12 ] to each the above fields. For a disjunctive leased field, each client has a Boolean self value that it can write and a Boolean other value that it can read.
The other value for each client x is defined as:. When client Z opens the file for read access, it sets its self value to TRUE , but this does not change the other value that Y sees. The server manages these leases to allow a client to access its values when this does not conflict with the accesses of other clients.
For example, if client X does not hold a self-write lease, it is not allowed to change its self value; if this value is currently TRUE , then the server can simultaneously grant other-read access to client Y and self-write access to client Z. Disjunctive leases are beneficial for, for example, a popular Microsoft Word [ 18 ] document. By default, Word attempts to open a file for exclusive access, and if it fails, it then opens the file with shared access.
A single field has a single meaning. However, this does not follow. It is semantically permissible for two clients to have write-access handles open concurrently, but it is not logistically permissible for them to hold write leases on the file content concurrently. Authority is rigorously defined. In a distributed system wherein servers delegate authority to other servers and lease authority to clients, it is crucial to define which data is authoritative.
When we were careless about this, it led to circularities. In our final design, the lease that grants authority over all-but-a-few child names now explicitly identifies the names it excludes, rather than implicitly excluding names that are used. The lease mechanism is not overloaded with semantics.
In one of our designs, if a client tried to open a file that was already open for exclusive access by another client, the server would refuse to issue the client a lease, which the client would interpret as an indication of a mode conflict.
Not only does this introduce a special case into the lease logic, but it also obviates a key benefit of leases: If the client repeatedly tries to open the file, it will repeatedly ask the server for a lease, and the server will repeatedly refuse until the file is closed by the other client. With this lease, clients can locally determine whether the open operation should succeed. Operations are performed on clients and replayed on servers. This arrangement is common for file-content writes in prior distributed file systems, and we found it just as applicable to metadata updates.
In testing design alternatives, we found that the regular approach was simpler even for multi-server rename, because it provides a simple operational model: The client obtains leases from all relevant servers, performs the operation locally in a single logical step, and lazily sends the update to the servers.
Message recipients need not draw inferences. In one of our design variants, when a client released a lease, it did not indicate to the server whether it was releasing a read lease or a write lease. Because these leases are mutually exclusive, the server could infer which type of lease the client was releasing. However, although this inference was always logically possible, it added some subtlety in a few corner cases, which led to design bugs.
The lease-release message now explicitly identifies the lease, because we found that such changes vastly simplify program logic with only a small additional expense in message length. Leases do not indicate values; they indicate knowledge of values. Before our design was explicit about the fact that leases convey knowledge, we had a function called NameIsUsed that retuned TRUE when the client held a lease over a name field with a non-null value. This led us to carelessly write!
The directory-service code is split into two main pieces: application logic and an atomic-action substrate. Within the application logic, all aspects of mobility are handled in a bottom layer.
As a consequence of the lack of BFT, a machine failure in the current implementation can damage the name space, which is clearly not acceptable for a real deployment. Our emphasis has been on mechanisms that support a seamlessly distributed and dynamically partitioned file system. We have not yet developed optimal policies for managing leases or determining when servers should delegate metadata.
For our experimental evaluation, we have developed some provisional policies that appear to work reasonably well. If a lease request arrives when no conflicting lease is outstanding and no conflicting request is ahead in the queue, the request is satisfied immediately. Our current policy never does so; infinite child leases are issued only for operations that need to know that the file has no children, namely delete and close.
0コメント