How could WAL (write ahead log) have better performance than write directly to disk? How could WAL (write ahead log) have better performance than write directly to disk? database database

How could WAL (write ahead log) have better performance than write directly to disk?


Performance.

  • Step two in your list is optional. For busy records, the value might not make it out of the cache and onto the disk before it is updated again. These writes do not need to be performed, with only the log writes performed for possible recovery.

  • Log writes can be batched into larger, sequential writes. For busy workloads, delaying a log write and then performing a single write can significantly improve throughput.

This was much more important when spinning disks were the standard technology because seek times and rotational latency were a bit issue. This is the physical process of getting the right part of the disk under the read/write head. With SSDs those considerations are not so important, but avoiding some writes, and large sequential writes still help.

Update:

SSDs also have better performance with large sequential writes but for different reasons. It is not as simple as saying "no seek time or rotational latency therefore just randomly write". For example, writing large blocks into space the SSD knows is "free" (eg. via the TRIM command to the drive) is better than read-modify-write, where the drive also needs to manage wear levelling and potentially mapping updates into different internal block sizes.


I have some guess.

Make every write to disk directly do not need recovery on power off. But the performance issue need to discuss in two way.

situation 1:

All your storage device is spinning disk. The WAL way will have better performance. Because when you write WAL it is sequential write. The write data to disk operation is random write. The performance for random write is very poor than sequential write for spinning disk.

situation 2:All your device is SSD. Then the performance may not be too much difference. Because sequential write and random write have almost the same performance for SSD.


As you note a key contribution of a WAL is durability. After a mutation has been committed to the WAL you can return to the caller, because even if the system crashes the mutation is never lost.

If you write the update directly to disk, there are two options:

  1. write all records to the end of some file
  2. the files are somehow structured

If you go with 1) it is needless to say that the cost of read is O(mutations), hence pretty much every system uses 2). RocksDB uses an LSM, which uses files that are internally sorted by key. For that reason, "directly writing to disk" means that you possibly have to rewrite every record that comes after the current key. That's too expensive, so instead you

  1. write to the WAL for persistence
  2. update the memtables (in RAM)

Because the memtables and the files on disk are sorted, read accesses are still reasonably fast. Updating the sorted structure in memory is easy because that's just a balanced tree. When you flush the memtable to disk and/or run a compaction, you will rewrite your file-structures to the updated state as a result of many writes, which makes each write substantially cheaper.