How to make use of the filesystem cache in Java or Python? How to make use of the filesystem cache in Java or Python? elasticsearch elasticsearch

How to make use of the filesystem cache in Java or Python?


File-system cache is an implementation detail related to OS inner workings that is transparent to the end user. It isn't something that needs adjustments or changes. Lucene already makes use of the file-system cache when it manages the index segments. Every time something is indexed into Lucene (via Elasticsearch) those documents are written to segments, which are first written to the file-system cache and then, after some time (when the translog - a way of keeping track of documents being indexed - is full for example) the content of the cache is written to an actual file. But, while the documents to be indexed are in file-system cache, they can still be accessed.

This improvement in doc values implementation refers to this feature as being able to use the file-system cache now, as they are read from disk, put in cache and accessed from there, instead of taking up Heap space.

How this file-system cache is being accessed is described in this excellent blog post:

In our previous approaches, we were relying on using a syscall to copy the data between the file system cache and our local Java heap. How about directly accessing the file system cache? This is what mmap does!

Basically mmap does the same like handling the Lucene index as a swap file. The mmap() syscall tells the O/S kernel to virtually map our whole index files into the previously described virtual address space, and make them look like RAM available to our Lucene process. We can then access our index file on disk just like it would be a large byte[] array (in Java this is encapsulated by a ByteBuffer interface to make it safe for use by Java code). If we access this virtual address space from the Lucene code we don’t need to do any syscalls, the processor’s MMU and TLB handles all the mapping for us. If the data is only on disk, the MMU will cause an interrupt and the O/S kernel will load the data into file system cache. If it is already in cache, MMU/TLB map it directly to the physical memory in file system cache.

Related to the actual means of using mmap in a Java program, I think this is the class and method to do so.