Why Someone Suggests to Turn Off the Disk Write Buffer?
TL;DR
Disk write buffer will violate user’s assumptions on durability (e.g., ATM transaction lost on power failure) and write order (e.g., journal COMMIT before the transaction has actually done).
What is disk write buffer
Disk write buffer is a small amount of memory embedded in a disk drive. Once a disk receives a write operation, it can signal the invoker “write done” as long as the data is written on the buffer instead of on the platter. Thus, there’s a time window between the “claimed done” and the “actual done”. In this way, the invoker can move forward instead of waiting for a long time, since writing to platter is much slower. Another benefit is that the disk can merge and re-order disk writes within the buffer to reduce unnecessary disk header movement and improve the performance. Experiment shows that for some scenarios, the write buffer can improve disk write performance by an order of magnitude.
Two “features” from invoker’s perspective
From the invoker’s perspective, enabling write buffer could bring two “features”, if not bugs.
- First, it can never be known when a write operation is actually done. In another word, theoretically, the window can be infinitely long.
- Second, the order of write operations can be arbitrary. It is to say, if an invoker writes “A”, “B” and “C” to a disk in order, but the disk may first write “C” to platter, and then “B”, and finally “A”.
Two problems due to the features
Here comes the problem: the two features violate two assumptions of an invoker.
- Durability assumption: My data is safe as long as a write operation returns.
- Order assumption: Multiple data writes are stored to platter in the same order of issuing.
First, the durability is not ensured. Once there’s a power crash within the window, the data would be lost, but the user thought the data is safe. Consider this case: a user takes some money from an ATM. The ATM writes the transaction to disk, and then prints a receipt to the user. At that moment, the ATM crashes. Once it reboots, it found that the transaction is not on the disk. However, from the user’s perspective, the transaction is done.
Second, the order is not ensured. If an invoker writes “A”, “B” and “C” in order, after a power failure, the disk may have only “C” but lose “A” and “B”. It could be very harmful in following case: in a journal file system, a COMMIT will be written to disk if and only if a write transaction has been done, which means the COMMIT must be written to platter after the data and metadata have been written to platter. If the order is not guaranteed, the file system recovery procedure cannot work properly.
That’s why someone might suggest to disable the disk write buffer.