How to implement multithreaded access to file-based queue in bash script How to implement multithreaded access to file-based queue in bash script bash bash

How to implement multithreaded access to file-based queue in bash script


First of all, do you need to implement your own HTML/HTTP spidering? Why not let wget or curl recurse through a site for you?

You could abuse the filesystem as a database, and have your queues be directories of one-line files. (or empty files where the filename is the data). That would give you producer-consumer locking, where producers touch a file, and consumers move it from the incoming to the processing/done directory.

The beauty of this is that multiple threads touching the same file Just Works. The desired result is the url appearing once in the "incoming" list, and that's what you get when multiple threads create files with the same name. Since you want de-duplication, you don't need locking when writing to the incoming list.

1) Downloads a page

2) Parses out all links and adds them to queue

For each link found,

grep -ql "$url" already_checked || : > "incoming/$url"

or

[[ -e "done/$url" ]] || : > "incoming/$url"

3) Does some unimportant processing

4) Pops an URL from queue and starts from 1)

# mostly untested, you might have to debug / tweak thislocal inc=( incoming/* )# edit: this can make threads exit sooner than desired.# See the comments for some ideas on how to make threads wait for new workwhile [[ $inc != "incoming/*" ]]; do    # $inc is shorthand for "${inc[0]}", the first array entry    mv "$inc" "done/" || { rm -f "$inc"; continue; } # another thread got that link, or that url already exists in done/    url=${inc#incoming/}    download "$url";    for newurl in $(link_scan "$url"); do        [[ -e "done/$newurl" ]] || : > "incoming/$newurl"    done    process "$url"    inc=( incoming/* )done

edit: encoding URLs into strings that don't contain / is left as an exercise for the reader. Although probably urlencoding / to %2F would work well enough.

I was thinking of moving URLs to a "processing" list per thread, but actually if you don't need to be able to resume from interruption, your "done" list can instead be a "queued & done" list. I don't think it's actually ever useful to mv "$url" "threadqueue.$$/" or something.

The "done/" directory will get pretty big, and start to slow down with maybe 10k files, depending on what filesystem you use. It's probably more efficient to maintain the "done" list as a file of one url per line, or a database if there's a database CLI interface that's fast for single commands.

Maintaining the done list as a file isn't bad, because you never need to remove entries. You can probably get away without locking it, even for multiple processes appending it. (I'm not sure what happens if thread B writes data at EOF between thread A opening the file and thread A doing a write. Thread A's file position might be the old EOF, in which case it would overwrite thread B's entry, or worse, overwrite only part of it. If you do need locking, maybe flock(1) would be useful. It gets a lock, then executes the commands you pass as args.)

If broken files from lack of write locking doesn't happen, then you might not need write locking. The occasional duplicate entry in the "done" list will be a tiny slowdown compared to having to lock for every check/append.

If you need strictly-correct avoidance of downloading the same URL multiple times, you need readers to wait for the writer to finish. If you can just sort -u the list of emails at the end, it's not a disaster for a reader to read an old copy while the list is being appended. Then writers only need to lock each other out, and readers can just read the file. If they hit EOF before a writer manages to write a new entry, then so be it.

I'm not sure whether it matters if a a thread adds entries to the "done" list before or after they remove them from the incoming list, as long as they add them to "done" before processing. I was thinking that one way or the other might make races more likely to cause duplicate done entries, and less likely to make duplicate downloads / processing, but I'm not sure.