How to insert JSON in HDFS using Flume correctly
If I've understood well, you want to serialize both the data and the headers. In that case, you do not have to modify the data source, but use some standard Flume elements and create your custom serializer for HDFS.
The first step is to achieve Flume creates the desired JSON structure, i.e. headers+body. Flume is able to do it for you, just use JSONHandler at your HTTPSource, this way:
a1.sources = r1a1.sources.r1.hnadler = org.apache.flume.source.http.JSONHandler
In fact, it is not necessary to configure the JSON handler since it is the default one for HTTPSource.
Then, use both Timestamp Interceptor and Host Interceptor in order to add the desired headers. The only trick is the Flume agent must run in the same machine than the sender process in order the intercepted host is the same than the sender one:
a1.sources.r1.interceptors = i1 i2a1.sources.r1.interceptors.i1.type = timestampa1.sources.r1.interceptors.i2.type = hosta1.sources.r1.interceptors.i2.hostHeader = hostname
At this point, you will have the desired event. Nevertheless, standard serializers for HDFS only save the body, not the headers. Thus create a custom serializer that implements org.apache.flume.serialization.EventSerializer
. It is configured as:
a1.sinks = k1a1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.serializer = my_custom_serializer
HTH
The answer posted by @frb was correct, the only point missing is that the JSON generator must send the body
part (I must admit/complain that the docs are not clear in that point), so, the correct way of posting the json
is
[body:"{'username':'xyz','password':'123'}"]
Please note that the json
of data is now a string.
With this change, the json
is now visible in the hdfs
.
The Flume HTTPSource using the default JSONHandler expects a list of fully-formed Flume events in JSON representation [{ headers: ..., body: ... }]
to be submitted to the endpoint; to create an agent endpoint which can accept a bare application-level structure like {"username":"xyz", "password":"123"}
, you can override the handler with an alternative class which implements HTTPSourceHandler; see the JSONHandler source - there's not a lot to it.
public List<Event> getEvents(HttpServletRequest request) throws ...
In a custom JSONHandler you could also add headers to the event based on the HTTP request, such as the source IP, User-Agent etc (an Interceptor won't have the context for this). You may want to validate the application-supplied JSON at this point (though the default handler doesn't).
Although as you've found, you can pass just the [{body: ...}]
part, such a custom handler could also be useful if you want to prevent a generator injecting headers for the event.