How to make R tm corpus of 100 million tweets? How to make R tm corpus of 100 million tweets? hadoop hadoop

How to make R tm corpus of 100 million tweets?


wouldn't be easier and more reasonable to make huge HDFS file with 100 million tweets and then process them by standard R' tm package?

This approach seems to me more natural since HDFS is developed for big files and distributed environment while R is great analytical tool but without parallelism (or limited). Your approach looks like using tools for something they were not developed for...


I would strongly recommend to check this url http://www.quora.com/How-can-R-and-Hadoop-be-used-together. This will give you necessary insights to your problem.


TM package basically works on term and document model. It creates a term document matrix or document term matrix. This matrix contains features like term (word) and its frequency in the document. Since you want to perform analysis on twitter data you should have each tweet as document and then you can created TDM or DTM. And can perform various analysis like finding associations, finding frequencies or clustering or calculating TDF-IDF measure etc.

You need to build a corpus of directory source. So you need to have base directory which contains individual documents which is your tweet.

Depending on the OS you are using, What I would have done if windows will create .bat file or a simple javascript or java code to read the MySQL rows for the tweet file and FTP it a directory present on your local file system of Hadoop Box.

Once the files were FTP's we can copy the directory to HDFS by using Hadoop Copy From Local Command.