Read large mongodb data Read large mongodb data hadoop hadoop

Read large mongodb data


Your problem lies at the asList() call

This forces the driver to iterate through the entire cursor (80,000 docs few Gigs), keeping all in memory.

batchSize(someLimit) and Cursor.batch() won't help here as you traverse the whole cursor, no matter what batch size is.

Instead you can:

1) Iterate the cursor: List<MYClass> datalist = datasource.getCollection("mycollection").find()

2) Read documents one at a time and feed the documents into a buffer (let's say a list)

3) For every 1000 documents (say) call Hadoop API, clear the buffer, then start again.


The asList() call will try to load the whole Mongodb collection into memory. Trying to make an in memory list object bigger than 3gb of size.

Iterating the collection with a cursor will fix this problem. You can do this with the Datasource class, but I prefer the type safe abstractions that Morphia offers with the DAO classes:

  class Dao extends BasicDAO<Order, String> {    Dao(Datastore ds) {      super(Order.class, ds);    }  }  Datastore ds = morphia.createDatastore(mongoClient, DB_NAME);  Dao dao = new Dao(ds);  Iterator<> iterator = dao.find().fetch();  while (iterator.hasNext()) {      Order order = iterator.next;      hadoopStrategy.add(order);  }