Determine a file's path(s) relative to a directory, including symlinks Determine a file's path(s) relative to a directory, including symlinks python-3.x python-3.x

Determine a file's path(s) relative to a directory, including symlinks


This, like many things, is more complex than it might appear on the surface.

Each entity in the file system points at an inode, which describes the content of the file. Entities are the things you see - files, directories, sockets, block devices, character devices, etc...

The content of a single "file" can be accessed via one or more paths - each of these paths is called a "hard link". Hard links can only point at files on the same filesystem, they cannot cross the boundary of a filesystem.

It is also possible for a path to address a "symbolic link", which can point at another path - that path doesn't have to exist, it can be another symbolic link, it can be on another filesystem, or it can point back at the original path producing an infinite loop.

It is impossible to locate all links (symbolic or hard) that point at a particular entity without scanning the entire tree.


Before we get into this... some comments:

  1. See the end for some benchmarks. I'm not convinced that this is a significant issue, though admittedly this filesystem is on a 6-disk ZFS array, on an i7, so using a lower spec system will take longer...
  2. Given that this is impossible without calling stat() on every file at some point, you're going to struggle coming up with a better solution that isn't significantly more complex (such as maintaining an index database, with all the issues that introduces)

As mentioned, we must scan (index) the whole tree. I know it's not what you want to do, but it's impossible without doing this...

To do this, you need to collect inodes, not filenames, and review them after the fact... there may be some optimisation here, but I've tried to keep it simple to prioritise understanding.

The following function will produce this structure for us:

def get_map(scan_root):    # this dict will have device IDs at the first level (major / minor) ...    # ... and inodes IDs at the second level    # each inode will have the following keys:    #   - 'type'     the entity's type - i.e: dir, file, socket, etc...    #   - 'links'    a list of all found hard links to the inode    #   - 'symlinks' a list of all found symlinks to the inode    # e.g: entities[2049][4756]['links'][0]     path to a hard link for inode 4756    #      entities[2049][4756]['symlinks'][0]  path to a symlink that points at an entity with inode 4756    entity_map = {}    for root, dirs, files in os.walk(scan_root):        root = '.' + root[len(scan_root):]        for path in [ os.path.join(root, _) for _ in files ]:            try:                p_stat = os.stat(path)            except OSError as e:                if e.errno == 2:                    print('Broken symlink [%s]... skipping' % ( path ))                    continue                if e.errno == 40:                    print('Too many levels of symbolic links [%s]... skipping' % ( path ))                    continue                raise            p_dev = p_stat.st_dev            p_ino = p_stat.st_ino            if p_dev not in entity_map:                entity_map[p_dev] = {}            e_dev = entity_map[p_dev]            if p_ino not in e_dev:                e_dev[p_ino] = {                    'type': get_type(p_stat.st_mode),                    'links': [],                    'symlinks': [],                }            e_ino = e_dev[p_ino]            if os.lstat(path).st_ino == p_ino:                e_ino['links'].append(path)            else:                e_ino['symlinks'].append(path)    return entity_map

I've produced an example tree that looks like this:

$ tree --inodes.├── [  67687]  4 -> 5├── [  67676]  5 -> 4├── [  67675]  6 -> dead├── [  67676]  a│   └── [  67679]  1├── [  67677]  b│   └── [  67679]  2 -> ../a/1├── [  67678]  c│   └── [  67679]  3└── [  67687]  d    └── [  67688]  44 directories, 7 files

The output of this function is:

$ placesBroken symlink [./6]... skippingToo many levels of symbolic links [./5]... skippingToo many levels of symbolic links [./4]... skipping{201: {67679: {'links': ['./a/1', './c/3'],               'symlinks': ['./b/2'],               'type': 'file'},       67688: {'links': ['./d/4'], 'symlinks': [], 'type': 'file'}}}

If we are interested in ./c/3, then you can see that just looking at symlinks (and ignoring hard links) would cause us to miss ./a/1...

By subsequently searching for the path we are interested in, we can find all other references within this tree:

def filter_map(entity_map, filename):    for dev, inodes in entity_map.items():        for inode, info in inodes.items():            if filename in info['links'] or filename in info['symlinks']:                return info
$ places ./a/1Broken symlink [./6]... skippingToo many levels of symbolic links [./5]... skippingToo many levels of symbolic links [./4]... skipping{'links': ['./a/1', './c/3'], 'symlinks': ['./b/2'], 'type': 'file'}

The full source for this demo is below. Note that I've used relative paths to keep things simple, but it would be sensible to update this to use absolute paths. Additionally, any symlink that points outside the tree will not currently have a corresponding link... that's an exercise for the reader.

It might also be an idea to collect the data while you're filling the tree (if that's something that would work with your process)... you can use inotify to deal with this nicely - there's even a python module.

#!/usr/bin/env python3import os, sys, statfrom pprint import pprintdef get_type(mode):    if stat.S_ISDIR(mode):        return 'directory'    if stat.S_ISCHR(mode):        return 'character'    if stat.S_ISBLK(mode):        return 'block'    if stat.S_ISREG(mode):        return 'file'    if stat.S_ISFIFO(mode):        return 'fifo'    if stat.S_ISLNK(mode):        return 'symlink'    if stat.S_ISSOCK(mode):        return 'socket'    return 'unknown'def get_map(scan_root):    # this dict will have device IDs at the first level (major / minor) ...    # ... and inodes IDs at the second level    # each inode will have the following keys:    #   - 'type'     the entity's type - i.e: dir, file, socket, etc...    #   - 'links'    a list of all found hard links to the inode    #   - 'symlinks' a list of all found symlinks to the inode    # e.g: entities[2049][4756]['links'][0]     path to a hard link for inode 4756    #      entities[2049][4756]['symlinks'][0]  path to a symlink that points at an entity with inode 4756    entity_map = {}    for root, dirs, files in os.walk(scan_root):        root = '.' + root[len(scan_root):]        for path in [ os.path.join(root, _) for _ in files ]:            try:                p_stat = os.stat(path)            except OSError as e:                if e.errno == 2:                    print('Broken symlink [%s]... skipping' % ( path ))                    continue                if e.errno == 40:                    print('Too many levels of symbolic links [%s]... skipping' % ( path ))                    continue                raise            p_dev = p_stat.st_dev            p_ino = p_stat.st_ino            if p_dev not in entity_map:                entity_map[p_dev] = {}            e_dev = entity_map[p_dev]            if p_ino not in e_dev:                e_dev[p_ino] = {                    'type': get_type(p_stat.st_mode),                    'links': [],                    'symlinks': [],                }            e_ino = e_dev[p_ino]            if os.lstat(path).st_ino == p_ino:                e_ino['links'].append(path)            else:                e_ino['symlinks'].append(path)    return entity_mapdef filter_map(entity_map, filename):    for dev, inodes in entity_map.items():        for inode, info in inodes.items():            if filename in info['links'] or filename in info['symlinks']:                return infoentity_map = get_map(os.getcwd())if len(sys.argv) == 2:    entity_info = filter_map(entity_map, sys.argv[1])    pprint(entity_info)else:    pprint(entity_map)

I've run this on my system out of curiosity. It's a 6x disk ZFS RAID-Z2 pool on an i7-7700K with plenty of data to play with. Admittedly this will run somewhat slower on lower-spec systems...

Some benchmarks to consider:

  • A dataset of ~3.1k files and links in ~850 directories.This runs in less than 3.5 seconds, ~80ms on subsequent runs
  • A dataset of ~30k files and links in ~2.2k directories.This runs in less than 30 seconds, ~300ms on subsequent runs
  • A dataset of ~73.5k files and links in ~8k directories.This runs in approx 60 seconds, ~800ms on subsequent runs

Using simple maths, that's about 1140 stat() calls per second with an empty cache, or ~90k stat() calls per second once the cache has been filled - I don't think that stat() is as slow as you think it is!


Symlinks do not admit of shortcuts. You have to know about all relevant FS entries that might point at the file of interest. That corresponds either to creating an empty directory and then listening for all file creation events under it, or to scanning all files currently under it. Run the following.

#! /usr/bin/env pythonfrom pathlib import Pathimport collectionsimport osimport pprintimport statclass LinkFinder:    def __init__(self):        self.target_to_orig = collections.defaultdict(set)    def scan(self, folder='/tmp'):        for fspec, target in self._get_links(folder):            self.target_to_orig[target].add(fspec)    def _get_links(self, folder):        for root, dirs, files in os.walk(Path(folder).resolve()):            for file in files:                fspec = os.path.join(root, file)                if stat.S_ISLNK(os.lstat(fspec).st_mode):                    target = os.path.abspath(os.readlink(fspec))                    yield fspec, targetif __name__ == '__main__':    lf = LinkFinder()    for folder in '/base /realbase'.split():        lf.scan(folder)    pprint.pprint(lf.target_to_orig)

You wind up with a mapping from all symlinked filespecs to a set of aliases by which that filespec may be accessed.

A symlink target may be a file or directory, so to properly use the mapping on a given filespec you must repeatedly truncate it, asking if parent directory or an ancestor directory appears in the mapping.

Dangling symlinks are not handled specially, they are simply allowed to dangle.

You might choose to serialize the mapping, probably in sorted order. If you repeatedly re-scan a large directory, there is an opportunity to remember directory mod times across runs, and avoid re-scanning files in that directory. Unfortunately, you would still have to recurse down into its descendant directories in case any of them had recent changes.Your subtrees may exhibit enough structure to let you avoid recursing more than K levels deep, or avoid descending into a directory whose name matches some regex.

If most FS changes are produced by a handful of programs, such as package managers or a build system, then getting those programs to log their actions could yield a performance win. That is, if you do a full scan each midnight, and then you run make in only two out of a thousand directories, you could choose to re-scan just that pair of subtrees.


My first instinct is to have the OS or some service inform you when the file-system tree has changed instead of you looking for the changes. Essentially don't reinvent the wheel.

Maybe:

Windows specific: 5 tools to monitor folder changes