Storing URLs while Spidering

python database url storage web-crawler

I've written a lot of spiders. To me, a bigger problem than running out of memory is the potential of losing all the URLs you've spidered already if the code or machine crashes or you decide you need to tweak the code. If you run out of RAM most machines and OSes these days will page so you'll slow down but still function. Having to rebuild a set of URLs gathered over hours and hours of run-time because its no longer available can be a real blow to productivity.

Keeping information in RAM that you do NOT want to lose is bad. Obviously a database is the way to go at that point because you need fast random access to see if you've already found a URL. Of course in-memory lookups are faster but the trade-off of figuring out WHICH urls to keep in memory adds overhead. Rather than try writing code to determine which URLs I need/don't-need, I keep it in the database and concentrate on making my code clean and maintainable and my SQL queries and schemas sensible. Make your URL field a unique index and the DBM will be able to find them in no time while automatically avoiding redundant links.

Your connection to the internet and sites you're accessing will probably be a lot slower than your connection to a database on a machine on your internal network. A SQLite database on the same machine might be the fastest, though the DBM itself isn't as sophisticated as Postgres, which is my favorite. I found that putting the database on another machine on the same switch as my spidering machine to be extremely fast; Making one machine handle the spidering, parsing, and then the database reads/writes is pretty intensive so if you have an old box throw Linux on it, install Postgres, and go to town. Throw some extra RAM in the box if you need more speed. Having that separate box for database use can be very nice.

python database url storage web-crawler

These seem to be the important aspects to me:

You can't keep the URLs in memory as RAM will get too high
You need fast existence lookups at least O(logn)
You need fast insertions

There are many ways to do this and it depends on how big your database will get. I think an SQL database can provide a good model for your problem.

Probably all you need is an SQLite database. Typically string lookups for existence check is a slow operation. To speed this up you can create a CRC hash of the URL and store both the CRC and URL in your database. You would have an index on that CRC field.

When you insert: You insert the URL and the hash
When you want to do an existance lookup: You take the CRC of the potentially new URL and check if it is in your database already.

There is of course a chance of collision on the URL hashes, but if 100% spanning is not important to you then you can take the hit of not having a URL in your DB when there is a collision.

You could also decrease collisions in many ways. For example you can increase the size of your CRC (CRC8 instead of CRC4) and use a hashing algorithm with a bigger size. Or use CRC as well as URL length.

python database url storage web-crawler

It depends on the scale of the spidering you're going to be doing, and the kind of machine you're doing it on. Suppose a typical URL is a string of 60 bytes or so, an in-memory set will take a bit more than 100 bytes per URL (sets and dicts in Python are never allowed to grow beyond 60% full, for speed reasons). If you have a 64-bit machine (and Python distro) with about 16 GB of RAM available, you could surely devote over 10 GB to the crucial set in question, letting you easily spider about 100 million URLs or so; but at the other extreme, if you have a 32-bit machine with 3GB of RAM, you clearly can't devote much more than a GB to the crucial set, limiting you to about 10 million URLs. Sqlite would help around the same range of sized where a 32-bit machine couldn't make it but a generously-endowed 64-bit one could -- say 100 or 200 million URLs.

Beyond those, I'd recommend PostgreSQL, which also has the advantage of being able to run on a different machine (on a fast LAN) with basically no problems, letting you devote your main machine to spidering. I guess MySQL &c would be OK for that too, but I love PostgreSQL standard compliance and robustness;-). This would allow, say, a few billion URLs with no problems (just a fast disk, or better a RAID arrangement, and as much RAM as you can afford to speed things up, of course).

Trying to save memory/storage by using a fixed-length hash in lieu of URLs that might be quite long is fine if you're OK with an occasional false positive that will stop you from crawling what's actually a new URL. Such "collisions" need not be at all likely: even if you only use 8 bytes for the hash, you should only have a substantial risk of some collision when you're looking at billions of URLs (the "square root heuristic" for this well-known problem).

With 8-bytes strings to represent the URLs, the in-memory set architecture should easily support a billion URLs or more on a well-endowed machine as above outlined.

So, roughly how many URLs do you want to spider, and how much RAM can you spare?-)

CodeHunter

Storing URLs while Spidering

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last