Writing programs to cope with I/O errors causing lost writes on Linux

`fsync()` returns `-EIO` if the kernel lost a write

(Note: early part references older kernels; updated below to reflect modern kernels)

It looks like async buffer write-out in end_buffer_async_write(...) failures set an -EIO flag on the failed dirty buffer page for the file:

set_bit(AS_EIO, &page->mapping->flags);set_buffer_write_io_error(bh);clear_buffer_uptodate(bh);SetPageError(page);

which is then detected by wait_on_page_writeback_range(...) as called by do_sync_mapping_range(...) as called by sys_sync_file_range(...) as called by sys_sync_file_range2(...) to implement the C library call fsync().

But only once!

This comment on sys_sync_file_range

168  * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any169  * I/O errors or ENOSPC conditions and will return those to the caller, after170  * clearing the EIO and ENOSPC flags in the address_space.

suggests that when fsync() returns -EIO or (undocumented in the manpage) -ENOSPC, it will clear the error state so a subsequent fsync() will report success even though the pages never got written.

Sure enough wait_on_page_writeback_range(...) clears the error bits when it tests them:

301         /* Check for outstanding write errors */302         if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))303                 ret = -ENOSPC;304         if (test_and_clear_bit(AS_EIO, &mapping->flags))305                 ret = -EIO;

So if the application expects it can re-try fsync() until it succeeds and trust that the data is on-disk, it is terribly wrong.

I'm pretty sure this is the source of the data corruption I found in the DBMS. It retries fsync() and thinks all will be well when it succeeds.

Is this allowed?

The POSIX/SuS docs on fsync() don't really specify this either way:

If the fsync() function fails, outstanding I/O operations are not guaranteed to have been completed.

Linux's man-page for fsync() just doesn't say anything about what happens on failure.

So it seems that the meaning of fsync() errors is "I don't know what happened to your writes, might've worked or not, better try again to be sure".

Newer kernels

On 4.9 end_buffer_async_write sets -EIO on the page, just via mapping_set_error.

    buffer_io_error(bh, ", lost async page write");    mapping_set_error(page->mapping, -EIO);    set_buffer_write_io_error(bh);    clear_buffer_uptodate(bh);    SetPageError(page);

On the sync side I think it's similar, though the structure is now pretty complex to follow. filemap_check_errors in mm/filemap.c now does:

    if (test_bit(AS_EIO, &mapping->flags) &&        test_and_clear_bit(AS_EIO, &mapping->flags))            ret = -EIO;

which has much the same effect. Error checks seem to all go through filemap_check_errors which does a test-and-clear:

    if (test_bit(AS_EIO, &mapping->flags) &&        test_and_clear_bit(AS_EIO, &mapping->flags))            ret = -EIO;    return ret;

I'm using btrfs on my laptop, but when I create an ext4 loopback for testing on /mnt/tmp and set up a perf probe on it:

sudo dd if=/dev/zero of=/tmp/ext bs=1M count=100sudo mke2fs -j -T ext4 /tmp/extsudo mount -o loop /tmp/ext /mnt/tmpsudo perf probe filemap_check_errorssudo perf record -g -e probe:end_buffer_async_write -e probe:filemap_check_errors dd if=/dev/zero of=/mnt/tmp/test bs=4k count=1 conv=fsync

I find the following call stack in perf report -T:

        ---__GI___libc_fsync           entry_SYSCALL_64_fastpath           sys_fsync           do_fsync           vfs_fsync_range           ext4_sync_file           filemap_write_and_wait_range           filemap_check_errors

A read-through suggests that yeah, modern kernels behave the same.

This seems to mean that if fsync() (or presumably write() or close()) returns -EIO, the file is in some undefined state between when you last successfully fsync()d or close()d it and its most recently write()ten state.

Test

I've implemented a test case to demonstrate this behaviour.

Implications

A DBMS can cope with this by entering crash recovery. How on earth is a normal user application supposed to cope with this? The fsync() man page gives no warning that it means "fsync-if-you-feel-like-it" and I expect a lot of apps won't cope well with this behaviour.

CodeHunter

Writing programs to cope with I/O errors causing lost writes on Linux

`fsync()` returns `-EIO` if the kernel lost a write

But only once!

Is this allowed?

Newer kernels

Test

Implications

Bug reports

Further reading

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last

Writing programs to cope with I/O errors causing lost writes on Linux

fsync() returns -EIO if the kernel lost a write

But only once!

Is this allowed?

Newer kernels

Test

Implications

Bug reports

Further reading

Recent Posts

`fsync()` returns `-EIO` if the kernel lost a write