nutch 1.10 input path does not exist /linkdb/current nutch 1.10 input path does not exist /linkdb/current hadoop hadoop

nutch 1.10 input path does not exist /linkdb/current


Ok, it seems as though I have run into a version of this problem:

https://issues.apache.org/jira/browse/NUTCH-2041

Which is a result of the crawl script not being aware of changes to ignore_external_links my nutch-site.xml file.

I am trying to crawl several sites and was hoping to keep my life simple by ignoring external links and leaving regex-urlfilter.txt alone (just using +.)

Now it looks like I'll have to change ignore_external_links back to false and add a regex filter for each of my urls. Hopefully I can get a nutch 1.11 release soon. It looks like this is fixed there.