MapReduce and downloading files from external source

This 'sounds' similar to what Nutch does (although i'm not too familiar with Nutch beyond that statement).

Some points for observation:

If you have several URLs which are hosted by the same server, you may actually benefit from partitioning by the hostname and then doing the pulls in the Reducer (depends on the number of URLs you are pulling from)
If the content is 'cachable', and you will be pulling from the same URLs over and over, you 'may' benefit from putting a cache / proxy server between your hadoop cluster and the internet (your company and ISP may / should already be doing this). Although if you are hitting unique URLs or the content is dynamic this will actually hinder you as you have a single bottleneck in the cache/proxy server

I think you should take a look at Storm. It's a scalable framework that's very useful for data collection from many different sources. This is really what you're trying to do. Processing can still be done using map reduce, but for the actual collection you should use a framework like Storm.

java http ftp hadoop mapreduce

I think your internet connection will easily become a bottleneck in this case but I'm sure it can be done.

I haven't done this exact thing but have had to make a web service call from my Mapper to obtain some meta data from a 3rd party API for further processing. The 3rd party web service quickly became a bottleneck and slowed everything down.
Yes since there's nothing to reduce in this case (I'm assuming you just want to save the downloaded files somewhere).
I'd save the FTP/HTTP URLs in HDFS and have your Mapper read in the URLs from your HDFS.
I highly doubt MapReduce is the best method for this type of thing. Like I said already, I think your internet connection will easily become a bottleneck and you won't be able to scale out your MR program very much. Once downloaded (and saved in HDFS), if you want to process the data using MapReduce, that would be a different story. Yes, in this case I'd say you're abusing MR.

CodeHunter

MapReduce and downloading files from external source

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last