How to extract all the collected tweets in a single file How to extract all the collected tweets in a single file hadoop hadoop

How to extract all the collected tweets in a single file


You can configure the HDFS sink to produce a message by time, event or size. So, if you want to save multiple messages till 120MB limit is reached, set

hdfs.rollInterval = 0 # This is to create new file based on timehdfs.rollSize = 125829120 # This is to create new file based on sizehdfs.rollCount = 0 # This is to create new file based on events (different tweets in your case)


You can use the following commands to concatenate the files into single file:

find . -type f -name 'FlumeData*' -exec cat {} + >> output.file

or if you want to store the data into Hive tables for later analysis, create an external table and consume it into Hive DB.