How to make R tm corpus of 100 million tweets?

r hadoop amazon-ec2 hive tm

wouldn't be easier and more reasonable to make huge HDFS file with 100 million tweets and then process them by standard R' tm package?

This approach seems to me more natural since HDFS is developed for big files and distributed environment while R is great analytical tool but without parallelism (or limited). Your approach looks like using tools for something they were not developed for...

r hadoop amazon-ec2 hive tm

I would strongly recommend to check this url http://www.quora.com/How-can-R-and-Hadoop-be-used-together. This will give you necessary insights to your problem.

r hadoop amazon-ec2 hive tm

TM package basically works on term and document model. It creates a term document matrix or document term matrix. This matrix contains features like term (word) and its frequency in the document. Since you want to perform analysis on twitter data you should have each tweet as document and then you can created TDM or DTM. And can perform various analysis like finding associations, finding frequencies or clustering or calculating TDF-IDF measure etc.

You need to build a corpus of directory source. So you need to have base directory which contains individual documents which is your tweet.

Depending on the OS you are using, What I would have done if windows will create .bat file or a simple javascript or java code to read the MySQL rows for the tweet file and FTP it a directory present on your local file system of Hadoop Box.

Once the files were FTP's we can copy the directory to HDFS by using Hadoop Copy From Local Command.

CodeHunter

How to make R tm corpus of 100 million tweets?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last