Does Fsync() Ensure Data Persistency When Disk Cache Is Enabled?

This is a question confused me for a long time. The answer was “no” for a long time, but became “yes” nowadays. Take Ubuntu as an example, before 12.10, the answer was “no”; from 12.10 on, the answer becomes “yes”. We’ll track the change of fsync’s man page to see the details.

Change of fsync()

First, let’s see the The Open Group Base Specifications Issue 6, IEEE Std 1003.1, 2004 Edition. This is POSIX.

DESCRIPTION

The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined. The fsync() function shall not return until the system has completed that action or until an error is detected.

If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state. All I/O operations shall be completed as defined for synchronized I/O file integrity completion.


RATIONALE

The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk. Since the concepts of “buffer cache”, “system crash”, “physical write”, and “non-volatile storage” are not defined here, the wording has to be more abstract.

If _POSIX_SYNCHRONIZED_IO is not defined, the wording relies heavily on the conformance document to tell the user what can be expected from the system. It is explicitly intended that a null implementation is permitted. This could be valid in the case where the system cannot assure non-volatile storage under any circumstances or when the system is highly fault-tolerant and the functionality is not required. In the middle ground between these extremes, fsync() might or might not actually cause data to be written where it is safe from a power failure. The conformance document should identify at least that one configuration exists (and how to obtain that configuration) where this can be assured for at least some files that the user can select to use for critical data. It is not intended that an exhaustive list is required, but rather sufficient information is provided so that if critical data needs to be saved, the user can determine how the system is to be configured to allow the data to be written to non-volatile storage.

It is reasonable to assert that the key aspects of fsync() are unreasonable to test in a test suite. That does not make the function any less valuable, just more difficult to test. A formal conformance test should probably force a system crash (power shutdown) during the test for this condition, but it needs to be done in such a way that automated testing does not require this to be done except when a formal record of the results is being made. It would also not be unreasonable to omit testing for fsync(), allowing it to be treated as a quality-of-implementation issue.

Second, let’s see the man page of fsync from Ubuntu-8.04 (Hardy):

DESCRIPTION

fsync() transfers (“flushes”) all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) where that file resides. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)).

Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.


NOTES

Applications that access databases or log files often write a tiny data fragment (e.g., one line in a log file) and then call fsync() immediately in order to ensure that the written data is physically stored on the harddisk. Unfortunately, fsync() will always initiate two write operations: one for the newly written data and another one in order to update the modification time stored in the inode. If the modification time is not a part of the transaction concept fdatasync() can be used to avoid unnecessary inode disk write operations.

If the underlying hard disk has write caching enabled, then the data may not really be on permanent storage when fsync() / fdatasync() return.

When an ext2 file system is mounted with the sync option, directory entries are also implicitly synced by fsync().

On kernels before 2.4, fsync() on big files can be inefficient. An alternative might be to use the O_SYNC flag to open(2).

In Linux 2.2 and earlier, fdatasync() is equivalent to fsync(), and so has no performance advantage.

In the man page of fsync from Ubuntu 14.04:

DESCRIPTION

fsync() transfers (“flushes”) all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)).

Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.


NOTES

On some UNIX systems (but not Linux), fd must be a writable file descriptor.

In Linux 2.2 and earlier, fdatasync() is equivalent to fsync(), and so has no performance advantage.

The fsync() implementations in older kernels and lesser used filesystems does not know how to flush disk caches. In these cases disk caches need to be disabled using hdparm(8) or sdparm(8) to guarantee safe operation.

Differences between fsync/fdatasync/sync_file_range/aio_sync

Sync is a file system operation to flush data to disk for persistency (durability). There are many variants of sync(), including fsync, fdatasync, aio_sync, sync_file_range, etc. These procedures are similar but with subtle differences.

  • fsync() is easy to understand. Given a fd, it flush all the dirty data of the file to disk, as well as metadata in inode. However, it doesn’t flush the directory entry of the file. Thus if you create a file (in some directory of course), write some data, fsync() and close it, then the machine crashes. After reboot, you may not find the file in the directory, because the directory’s data is not flushed. Thus, it is recommended to flush a directory after creating or deleting a file in it. However, a directory can only be opened in read-only mode, which could be flushed… or not? The answer is: a directory can be flushed even if it is read-only. More details can be found in this article: Everything You Always Wanted to Know About Fsync().

  • fdatasync() is similar with fsync(). The only difference is that it doesn’t flush some metadata to disk if not necessary. Which metadata are necessary? Size, for example. Which ones are not necessary? mtime, atime, etc. Typically, using fdatasync can improve performance since it can avoid at least one disk write to inode, which usually incurs a long seek time. IMHO, fdatasync can replace fsync in most cases.

  • sync_file_range() can be used to flush a range of a file instead of all. It is introduced from Linux 2.6.17, but is actually a non-stand API. It has three flags: WAIT_BEFORE, WRITE and WAIT_AFTER. None of these operations writes out the file’s metadata, recall that fdatasync just selectively not writes metadata. Note that the man page says: “This system call is extremely dangerous and should not be used in portable programs.” There’s also little infomation on Google. I guess it is not used widely yet.

  • aio_sync() is a part of Linux AIO mechanism. It just puts a sync request to the queue and returns. it does not wait for I/O completion. Its arguments include an aiocb struct, which contains aio_sigevent which indicates the desired type of asynchronous notification at completion. I guess it can be used to decouple ordering and durability. It only ensures ordering but doesn’t need to flush disk actually. The flushing operations can be batched for performance.