Suggested Hadoop-based Design / Component for Ingestion of Periodic REST API Calls

As stated in the Apache Flume web:

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

As you can see, among the features attributed to Flume is the gathering of data. "Pushing-like or emitting-like" data sources are easy to integrate thanks to HttpSource, AvroSurce, ThriftSource, etc. In your case, where the data must be let's say "actively pulled" from a http-based service, the integration is not so obvious, but can be done. For instance, by using the ExecSource, which runs a script getting the data and pushing it to the Flume agent.

If you use a proprietary code in charge of pulling the data and writting it into HDFS, such a design will be OK, but you will be missing some interesting built-in Flume characteristics (that probably you will have to implement by yourself):

Reliability. Flume has mechanisms to ensure the data is really persisted in the final storage, retrying until is is effectively written. This is achieved through the usage of an internal channel buffering data both at the input (ingesting peaks of loads) and the output (retaining data until it is effecively persisted) and the transaction concept.
Performance. The usage of transactions and the possibility to configure multiple parallel sinks (data processors) will your deployment able to deal with really large amounts of data generated per second.
Usability. By using Flume you don't need to deal with the storage details (e.g. HDFS API). Even, if some day you decide to change the final storage you only have to reconfigure the Flume agent for using the new related sink.

CodeHunter

Suggested Hadoop-based Design / Component for Ingestion of Periodic REST API Calls

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last