How should my scraping "stack" handle 404 errors? How should my scraping "stack" handle 404 errors? ruby-on-rails ruby-on-rails

How should my scraping "stack" handle 404 errors?


TL;DR

Use out-of-band error handling and a different conceptual scraping model to speed up operations.

Exceptions Are Not for Common Conditions

There are a number of other answers that address how to handle exceptions for your use case. I'm taking a different approach by saying that handling exceptions is fundamentally the wrong approach here for a number of reasons.

  1. In his book Exceptional Ruby, Avdi Grimm provides some benchmarks showing the performance of exceptions as ~156% slower than using alternative coding techniques such as early returns.

  2. In The Pragmatic Programmer: From Journeyman to Master, the authors state "[E]xceptions should be reserved for unexpected events." In your case, 404 errors are undesirable, but are not at all unexpected--in fact, handling 404 errors is a core consideration!

In short, you need a different approach. Preferably, the alternative approach should provide out-of-band error handling and prevent your process from blocking on retries.

One Alternative: A Faster, More Atomic Process

You have a lot of options here, but the one I'm going to recommend is to handle 404 status codes as a normal result. This allows you to "fail fast," but also allows you to retry pages or remove URLs from your queue at a later time.

Consider this example schema:

ActiveRecord::Schema.define(:version => 20120718124422) do  create_table "webcrawls", :force => true do |t|    t.text     "raw_html"    t.integer  "retries"    t.integer  "status_code"    t.text     "parsed_data"    t.datetime "created_at",  :null => false    t.datetime "updated_at",  :null => false  endend

The idea here is that you would simply treat the entire scrape as an atomic process. For example:

  • Did you get the page?

    Great, store the raw page and the successful status code. You can even parse the raw HTML later, in order to complete your scrapes as fast as possible.

  • Did you get a 404?

    Fine, store the error page and the status code. Move on quickly!

When your process is done crawling URLs, you can then use an ActiveRecord lookup to find all the URLs that recently returned a 404 status so that you can take appropriate action. Perhaps you want to retry the page, log a message, or simply remove the URL from your list of URLs to scrape--"appropriate action" is up to you.

By keeping track of your retry counts, you could even differentiate between transient errors and more permanent errors. This allows you to set thresholds for different actions, depending on the frequency of scraping failures for a given URL.

This approach also has the added benefit of leveraging the database to manage concurrent writes and share results between processes. This would allow you to parcel out work (perhaps with a message queue or chunked data files) among multiple systems or processes.

Final Thoughts: Scaling Up and Out

Spending less time on retries or error handling during the initial scrape should speed up your process significantly. However, some tasks are just too big for a single-machine or single-process approach. If your process speedup is still insufficient for your needs, you may want to consider a less linear approach using one or more of the following:

  • Forking background processes.
  • Using dRuby to split work among multiple processes or machines.
  • Maximizing core usage by spawning multiple external processes using GNU parallel.
  • Something else that isn't a monolithic, sequential process.

Optimizing the application logic should suffice for the common case, but if not, scaling up to more processes or out to more servers. Scaling out will certainly be more work, but will also expand the processing options available to you.


Curb has an easier way of doing this and can be a better (and faster) option instead of open-uri.

Errors Curb reports (and that you can rescue from and do something:

http://curb.rubyforge.org/classes/Curl/Err.html

Curb gem:https://github.com/taf2/curb

Sample code:

def browse(url)  c = Curl::Easy.new(url)  begin    c.connect_timeout = 3    c.perform    return c.body_str  rescue Curl::Err::NotFoundError    handle_not_found_error(url)  endenddef handle_not_found_error(url)  puts "This is a 404!"end


You could just raise the 404's:

rescue Exception => ex  raise ex if ex.message['404']  # retry for non-404send