python - Efficiently parsing flattened strings into nested dictionaries python - Efficiently parsing flattened strings into nested dictionaries json json

python - Efficiently parsing flattened strings into nested dictionaries


You’ve thrown away a lot of information in building these strings. For example, when you see MIPK\/DORAS, there’s no way to know whether it’s a file (which should just be a string in the parent directory’s list), a leaf directory (which should be a list in the parent directory’s dict), or an intermediate directory (which should be a dict in the parent directory’s dict). The best you can do (without a quadratic nested search) is guess, and then change it later if you guessed wrong.

Also, sometimes your intermediate directories don’t even show up on their own, but only appear as components of later paths, but other times they do appear. So sometimes the “guess” that you have to fix will be the directory not existing at all.

And finally, your desired format is ambiguous if, e.g., any directly can contain both files and directories (should it be a dict or a list), or nothing at all (should it be an empty dict, an empty list, or just a string?).

However, things seem to be in sorted order, and the ambiguous cases don’t seem to actually show up. If we can rely on both of these, we can tell whether something is a directory by just seeing if it’s a prefix or the next entry. Then we can just read the leaves and put them into a collection of a strings inside a list inside 0 or more nested dicts.

So, first, we want to iterate adjacent pairs of paths, to make it easier to do that “is it a prefix of the next value?” check:

output = {}it1, it2 = itertools.tee(paths)next(it2)pairs = itertools.zip_longest(it1, it2, fillvalue='')for path, nextpath in pairs:    if nextpath.startswith(path):        continue

Now, we know path is a leaf, so we need to find the list to append it to, creating one if needed, which may mean recursively creating dicts along the way:

components = path.split(r'\/')d = outputfor component in components[:-2]:    d = d.setdefault(component, {})d.setdefault(components[-2], []).append(components[-1])

I haven’t tested this, but it should do the right thing for non-ambiguous input, but raise some kind of exception if any directory includes both files and subdirs, or any top-level directory includes files, or any of the other ambiguous cases (except for empty directories, which will be treated the same as files).

Sure, this is kind of ugly, but that’s inherent in parsing an ugly format that relies on lots of special-case rules to handle what would otherwise be ambiguous.