Welcome!

Linux Containers Authors: Liz McMillan, Zakia Bouachraoui, Elizabeth White, Pat Romanski, Stefana Muller

Related Topics: Linux Containers

Linux Containers: Article

An Advanced File System for Linux

Demanded by enterprises and beneficial to everyone

As Linux made its way further into the enterprise, a key feature that it was lacking at one point in time was a journaling file system. This was true in 1999, but today there are four journaling file systems that can solve enterprise server requirements. This article focuses on one of them: JFS.

The file system is one of the most important parts of an operating system. It stores and manages user data on disk drives and ensures that what's read from storage is identical to what was originally written. In addition to storing user data in files, the file system also creates and manages information about files and about itself. Besides guaranteeing the integrity of all that data, file systems are also expected to be extremely reliable and have excellent performance.

Before the year 2000, Ext2 was the de facto file system for most Linux machines; it was robust, reliable, and suitable for most deployments. However, as Linux displaced Unix and other operating systems in more and more large server and computing environments, Ext2 was pushed to its limits. In fact, many now-common requirements - large hard-disk volumes, quick recovery from crashes, high-performance I/O, and the need to store millions of files representing terabytes of data - exceed the capabilities of Ext2.

Fortunately, a number of other Linux file systems pick up where Ext2 leaves off. Indeed, Linux now offers four alternatives to Ext2: Ext3, JFS, ReiserFS, and XFS. In addition to meeting some or all of the previously mentioned requirements, each of these alternative file systems also supports journaling, a feature certainly demanded by enterprises but beneficial to anyone running Linux. A journaling file system can simplify restarts, reduce fragmentation, and accelerate I/O. Better yet, journaling file systems make fscks a thing of the past.

To better appreciate the benefits of file systems, it's helpful to speak the vernacular of file systems.

  • Logical block (or a file system's block size): The smallest unit of storage that can be allocated by the file system. A logical block is measured in bytes, and it may take several blocks to store a single file.
  • Logical volume: One or more physical disks or some subset of the physical disk space.
  • Block allocation: A method of allocating blocks in which the file system allocates one block at a time. With this method, a pointer to every block in a file is maintained and recorded. Ext2 uses block allocation.
  • Extent: A large number of contiguous blocks. Each extent is described by a triple, consisting of file offset, starting block number, and length. File offset is the offset of the extent's first block from the beginning of the file; starting block number is the first block in the extent; and length is the number of blocks in the extent. Extents are allocated and tracked as a single unit, meaning that a single pointer tracks a group of blocks. For large files, extent allocation is a much more efficient technique than block allocation. Figure 1 shows how extents are used.
  • File system metadata: The file system's internal data structures - everything concerning a file except the actual data inside the file. Metadata includes date and time stamps, ownership information, file access permissions, other security information such as access control lists (if they exist), the file's size, and the storage location or locations on disk.
  • Inode: Stores all the information about a file except the data itself. You can think of an inode as a "bookkeeping" file for a file (indeed, an inode is a structure that consumes blocks, too). An inode contains file permissions, file types, and the number of links to the file. Every inode has a unique inode number that distinguishes it from every other inode.
An extent is described by its block offset in the file, the location of the first block in the extent, and the length of the extent. If file sample.txt requires 18 blocks, and the file system is able to allocate one extent of length 8, a second extent of length 5, and a third extent of length 5, the file system would look something like Figure 1. The first extent has offset 0 (block A in the file), location 10, and length 8. The second extent has offset 8 (block I), location 20, and length 5. The last extent has offset 13 (block N), location 35, and length 5.

How File Systems Go Bad

With these concepts in mind, here's what happens when a three-block file is modified and grows to be a five-block file:
  1. Two new blocks are allocated to hold the new data.
  2. The file's inode is updated to record the new size of the file.
  3. The actual data is written into the blocks.
As you can see, while writing data to a file appears to be a single atomic operation, the actual process involves a number of steps (even more steps than shown here if you consider all of the accounting required to remove the two blocks from the free list of blocks and other metadata changes).

If all the steps to write a file are completed correctly (and this happens most of the time), the file is saved successfully. However, if the process is interrupted at any time (perhaps due to power failure or other system failure), a non-journal file system can end up in an inconsistent state. Corruption occurs because the logical operation of writing (or updating) a file is actually a sequence of I/O, and the entire operation may not be totally reflected on the media at any given point in time. A journaling file system uses transactions to keep track of metadata changes. Transactions are recorded in the log and during log replay a rollback to the last commit point is used to place the file system into a consistent state.

Features of JFS

JFS for Linux is a file system based on IBM's JFS file system for OS/2 Warp Server for e-business. Released as open source in early 2000 with a GPL license and ported to Linux soon after, JFS is well suited for enterprise environments. JFS uses many advanced techniques to boost performance, provide for very large file systems, and, of course, journal changes to the file system. Some of the features of JFS include:
  • Extent-based addressing structures: JFS uses extent-based addressing structures, along with aggressive block allocation policies to produce compact, efficient, and scalable structures for mapping logical offsets within files to physical addresses on disk. This feature yields excellent performance.
  • Dynamic inode allocation: JFS dynamically allocates space for disk inodes as required, freeing the space when it is no longer required. This is a radical improvement over Ext2, which reserves a fixed amount of space for disk inodes at file system creation time. With dynamic inode allocation, users do not have to estimate the maximum number of files and directories that a file system will contain. Additionally, this feature decouples disk inodes from fixed disk locations.
  • Directory organization: Two different directory organizations are provided: one is used for small directories and the other for large directories. The contents of a small directory (up to eight entries) are stored within the directory's inode. This eliminates the need for separate directory block I/O and the need to allocate separate storage. The contents of larger directories are organized in a B+ tree keyed on name. B+ trees provide faster directory lookup, insertion, and deletion capabilities when compared to traditional unsorted directory organizations.
  • Online resizing: Allows the file system to grow while it is mounted. This feature is used with a volume manager.
  • Online snapshot: Enables backing up an active file system. It provides an online backup mechanism by creating a point-in-time image of the file system. It helps eliminate the system being offline to require a consistent backup. This feature is used with a volume manager.
  • No integrity mount option: Allows the file system to not journal file system metadata changes. This feature can be used by a restore program to decrease the restore time.
  • 64-bits: JFS is a full 64-bit file system. All of the appropriate file system structure fields are 64-bits in size. This allows JFS to support large files and volumes.
There are other advanced features in JFS such as allocation groups (which speeds file access times by maximizing locality). Two additional features are extended attributes and Access Control Lists. To help understand the Access Control List feature a discussion of Linux's file permissions is a must, since Access Control Lists give a user a finer control of file permissions.

If you've spent even a little time with a Linux system, you're probably quite familiar with Linux's file permission scheme. In a nutshell, you may read, write, or execute a file (or in the case of a directory, search the directory) only if you have the proper permission. Furthermore, the traditional Linux read, write, and execute permissions are distinct, and each of those rights can be granted separately to the owner (a user) of the file, to the group that owns the file, and to other, which represents users other than the owner and users in the named group. Linux commands like chmod, chown, and chgrp affect the permissions and change the owners of files.

In general, Linux's simple permission scheme works well and is especially effective when access rights align with the users and groups on the system. But if you want to grant access rights to lists of users that do not belong to an existing group, the system fails miserably. For example, if you want to share one of your personal files, phones.txt, with every member of your group, say, staff, you can grant that access with two commands: chown staff phones.txt, and chmod g+r phones.txt. However, if you want to give read access to friends.txt to Debbie and Bo, and read access to colleagues.txt to Bo and Abby, you'd have to create two different groups with Bo in each one. (Or, perhaps it's more accurate to say that your system administrator would have to create the groups.)

More Flexibility with Fine-Grained Control

As you can see, managing permissions through "special interest groups" is terribly inconvenient, and worse, it doesn't scale. A more flexible scheme is Access Control Lists, or ACLs. Instead of capturing permissions in just a few flags, ACLs record permissions in an individual and extensible list of access rights that are attached to each file or directory. Access control rights can be assigned to a specific user, a specific group, or to multiple users or groups in any combination. In a sense, ACLs are like the "Will Call" list at the hottest restaurant in town: if you're not on the access control list, you don't get in.

Reusing the example above, if you want to give access to friends.txt to Debbie and Bo, you simply grant read access to both users. No (administrative) group is needed. Need to grant access to a third user? Simply give that user the appropriate access rights. In a sense, ACLs enhance security because ACLs can implement an access policy directly, even if the policy is different for every file on the system.

ACLs can be used to build advanced system applications like Samba, which, like its progenitor, Windows, requires ACLs. (For more information on how Samba uses ACLs, see sidebar "ACL Support in Samba.") Let's see how Extended Attributes work and how they can be used.

File Access Control Lists and Extended Attributes (EAs) are currently supported by the Ext2, Ext3, JFS, ReiserFS, and XFS file systems. You've already seen what an ACL is for; EAs are simply the underlying mechanism used to record ACLs.

An EA consists of a name/value pair, and associates arbitrary pieces of file metadata, or data about data, with a file or directory. EAs are not a part of the file's data. Instead, EAs are maintained separately and automatically managed by the file system.

More than one EA can be attached to a specific file or directory, and an EA can store system objects (such as access control lists or the capabilities of an executable) and user objects (such as the MIME type or character set of a file). Applications can define and associate extended attributes with a file object (remember, a directory is just a special file) through file system function calls.

Extended attributes can be used to store almost anything. You can maintain a file's history; categorize the contents of the file (such as text, icons, bitmaps); record the version of the file; append additional data; or do all of the above. For example, Figure 2 shows five extended attributes (Version, File Type, Additional data, Install, and History) of fileA.

With EAs in place, ACLs are relatively easy to implement. An Access Control Entry, or ACE, is an individual entry in an ACL. Each ACE is a triple defined by an entry type, either group or user; a group name, username, numeric UID, or numeric GID, depending on the value of the first field; and the access permission or right (read, write, execute) associated with the ACE. So, in the abstract, giving Debbie permission to read friends.txt means that the ACL attached to friends.txt contains an ACE (user, Debbie, read).

Currently, ACLs are the only Linux feature dependent on EAs. Other operating systems have had EAs for several years, and uses of EAs on those operating systems are broader.

ACL Support in Samba

To make Samba as portable as possible, the designers of Samba decided against a custom implementation of ACLs. Instead, each Samba server converts NT ACL specifications (sent via MS-RPC) into a POSIX ACL, and then converts that neutral ACL into an ACL that's platform-specific. A conceptual illustration of Samba's ACL subsystem is shown below.

If the Samba server's underlying file system supports ACLs, and the POSIX ACL can be converted to a native ACL, Windows users can manipulate server-side ACLs on the Samba server using the common Windows NT commands.

Samba 2.2 included support for ACLs, but up until now, Samba has had no way to store ACLs directly on the file system since there was no ACL support available for Linux. That's no longer an issue, and Samba will preserve NTFS ACLs rather than mapping ACL permissions to the less-flexible, standard Unix permissions. (Windows NT and Windows 2000 use ACLs to set permissions on files and directories. That scheme offers a much finer-grained control over permissions than the traditional "one user, one group" solution that most Unix systems use.)

Native ACL support, in combination with winbind, allows a Linux-based system to "assimilate" Windows NT users, groups, and ACL permissions. Quite an impressive solution!

Resources

  • Extended Attributes and Access Control Lists: http://acl.bestbits.at
  • JFS for Linux: http://oss.software.ibm.com/jfs
  • ReiserFS: www.namesys.com
  • XFS: http://oss.sgi.com/projects/xfs
  • Samba: http://us1.samba.org/samba/samba.html
  • More Stories By Steve Best

    Steve Best is a Senior Software Engineer in the Linux Technology Center of IBM in Austin,
    Texas. He is currently working on the Journaled File System (JFS) for
    Linux project. Steve has done extensive work in operating system
    development, with a focus in the areas of file systems,
    internationalization, and security. He can be reached at
    [email protected]

    Comments (2)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    IoT & Smart Cities Stories
    The Jevons Paradox suggests that when technological advances increase efficiency of a resource, it results in an overall increase in consumption. Writing on the increased use of coal as a result of technological improvements, 19th-century economist William Stanley Jevons found that these improvements led to the development of new ways to utilize coal. In his session at 19th Cloud Expo, Mark Thiele, Chief Strategy Officer for Apcera, compared the Jevons Paradox to modern-day enterprise IT, examin...
    While the focus and objectives of IoT initiatives are many and diverse, they all share a few common attributes, and one of those is the network. Commonly, that network includes the Internet, over which there isn't any real control for performance and availability. Or is there? The current state of the art for Big Data analytics, as applied to network telemetry, offers new opportunities for improving and assuring operational integrity. In his session at @ThingsExpo, Jim Frey, Vice President of S...
    Rodrigo Coutinho is part of OutSystems' founders' team and currently the Head of Product Design. He provides a cross-functional role where he supports Product Management in defining the positioning and direction of the Agile Platform, while at the same time promoting model-based development and new techniques to deliver applications in the cloud.
    In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settl...
    @CloudEXPO and @ExpoDX, two of the most influential technology events in the world, have hosted hundreds of sponsors and exhibitors since our launch 10 years ago. @CloudEXPO and @ExpoDX New York and Silicon Valley provide a full year of face-to-face marketing opportunities for your company. Each sponsorship and exhibit package comes with pre and post-show marketing programs. By sponsoring and exhibiting in New York and Silicon Valley, you reach a full complement of decision makers and buyers in ...
    There are many examples of disruption in consumer space – Uber disrupting the cab industry, Airbnb disrupting the hospitality industry and so on; but have you wondered who is disrupting support and operations? AISERA helps make businesses and customers successful by offering consumer-like user experience for support and operations. We have built the world’s first AI-driven IT / HR / Cloud / Customer Support and Operations solution.
    LogRocket helps product teams develop better experiences for users by recording videos of user sessions with logs and network data. It identifies UX problems and reveals the root cause of every bug. LogRocket presents impactful errors on a website, and how to reproduce it. With LogRocket, users can replay problems.
    Data Theorem is a leading provider of modern application security. Its core mission is to analyze and secure any modern application anytime, anywhere. The Data Theorem Analyzer Engine continuously scans APIs and mobile applications in search of security flaws and data privacy gaps. Data Theorem products help organizations build safer applications that maximize data security and brand protection. The company has detected more than 300 million application eavesdropping incidents and currently secu...
    Rafay enables developers to automate the distribution, operations, cross-region scaling and lifecycle management of containerized microservices across public and private clouds, and service provider networks. Rafay's platform is built around foundational elements that together deliver an optimal abstraction layer across disparate infrastructure, making it easy for developers to scale and operate applications across any number of locations or regions. Consumed as a service, Rafay's platform elimi...
    The Internet of Things is clearly many things: data collection and analytics, wearables, Smart Grids and Smart Cities, the Industrial Internet, and more. Cool platforms like Arduino, Raspberry Pi, Intel's Galileo and Edison, and a diverse world of sensors are making the IoT a great toy box for developers in all these areas. In this Power Panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists discussed what things are the most important, which will have the most profound e...