How to use Elasticsearch with MongoDB? How to use Elasticsearch with MongoDB? elasticsearch elasticsearch

How to use Elasticsearch with MongoDB?


This answer should be enough to get you set up to follow this tutorial on Building a functional search component with MongoDB, Elasticsearch, and AngularJS.

If you're looking to use faceted search with data from an API then Matthiasn's BirdWatch Repo is something you might want to look at.

So here's how you can setup a single node Elasticsearch "cluster" to index MongoDB for use in a NodeJS, Express app on a fresh EC2 Ubuntu 14.04 instance.

Make sure everything is up to date.

sudo apt-get update

Install NodeJS.

sudo apt-get install nodejssudo apt-get install npm

Install MongoDB - These steps are straight from MongoDB docs.Choose whatever version you're comfortable with. I'm sticking with v2.4.9 because it seems to be the most recent version MongoDB-River supports without issues.

Import the MongoDB public GPG Key.

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10

Update your sources list.

echo 'deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' | sudo tee /etc/apt/sources.list.d/mongodb.list

Get the 10gen package.

sudo apt-get install mongodb-10gen

Then pick your version if you don't want the most recent. If you are setting your environment up on a windows 7 or 8 machine stay away from v2.6 until they work some bugs out with running it as a service.

apt-get install mongodb-10gen=2.4.9

Prevent the version of your MongoDB installation being bumped up when you update.

echo "mongodb-10gen hold" | sudo dpkg --set-selections

Start the MongoDB service.

sudo service mongodb start

Your database files default to /var/lib/mongo and your log files to /var/log/mongo.

Create a database through the mongo shell and push some dummy data into it.

mongo YOUR_DATABASE_NAMEdb.createCollection(YOUR_COLLECTION_NAME)for (var i = 1; i <= 25; i++) db.YOUR_COLLECTION_NAME.insert( { x : i } )

Now to Convert the standalone MongoDB into a Replica Set.

First Shutdown the process.

mongo YOUR_DATABASE_NAMEuse admindb.shutdownServer()

Now we're running MongoDB as a service, so we don't pass in the "--replSet rs0" option in the command line argument when we restart the mongod process. Instead, we put it in the mongod.conf file.

vi /etc/mongod.conf

Add these lines, subbing for your db and log paths.

replSet=rs0dbpath=YOUR_PATH_TO_DATA/DBlogpath=YOUR_PATH_TO_LOG/MONGO.LOG

Now open up the mongo shell again to initialize the replica set.

mongo DATABASE_NAMEconfig = { "_id" : "rs0", "members" : [ { "_id" : 0, "host" : "127.0.0.1:27017" } ] }rs.initiate(config)rs.slaveOk() // allows read operations to run on secondary members.

Now install Elasticsearch. I'm just following this helpful Gist.

Make sure Java is installed.

sudo apt-get install openjdk-7-jre-headless -y

Stick with v1.1.x for now until the Mongo-River plugin bug gets fixed in v1.2.1.

wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.1.1.debsudo dpkg -i elasticsearch-1.1.1.debcurl -L http://github.com/elasticsearch/elasticsearch-servicewrapper/tarball/master | tar -xzsudo mv *servicewrapper*/service /usr/local/share/elasticsearch/bin/sudo rm -Rf *servicewrapper*sudo /usr/local/share/elasticsearch/bin/service/elasticsearch installsudo ln -s `readlink -f /usr/local/share/elasticsearch/bin/service/elasticsearch` /usr/local/bin/rcelasticsearch

Make sure /etc/elasticsearch/elasticsearch.yml has the following config options enabled if you're only developing on a single node for now:

cluster.name: "MY_CLUSTER_NAME"node.local: true

Start the Elasticsearch service.

sudo service elasticsearch start

Verify it's working.

curl http://localhost:9200

If you see something like this then you're good.

{  "status" : 200,  "name" : "Chi Demon",  "version" : {    "number" : "1.1.2",    "build_hash" : "e511f7b28b77c4d99175905fac65bffbf4c80cf7",    "build_timestamp" : "2014-05-22T12:27:39Z",    "build_snapshot" : false,    "lucene_version" : "4.7"  },  "tagline" : "You Know, for Search"}

Now install the Elasticsearch plugins so it can play with MongoDB.

bin/plugin --install com.github.richardwilly98.elasticsearch/elasticsearch-river-mongodb/1.6.0bin/plugin --install elasticsearch/elasticsearch-mapper-attachments/1.6.0

These two plugins aren't necessary but they're good for testing queries and visualizing changes to your indexes.

bin/plugin --install mobz/elasticsearch-headbin/plugin --install lukas-vlcek/bigdesk

Restart Elasticsearch.

sudo service elasticsearch restart

Finally index a collection from MongoDB.

curl -XPUT localhost:9200/_river/DATABASE_NAME/_meta -d '{  "type": "mongodb",  "mongodb": {    "servers": [      { "host": "127.0.0.1", "port": 27017 }    ],    "db": "DATABASE_NAME",    "collection": "ACTUAL_COLLECTION_NAME",    "options": { "secondary_read_preference": true },    "gridfs": false  },  "index": {    "name": "ARBITRARY INDEX NAME",    "type": "ARBITRARY TYPE NAME"  }}'

Check that your index is in Elasticsearch

curl -XGET http://localhost:9200/_aliases

Check your cluster health.

curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'

It's probably yellow with some unassigned shards. We have to tell Elasticsearch what we want to work with.

curl -XPUT 'localhost:9200/_settings' -d '{ "index" : { "number_of_replicas" : 0 } }'

Check cluster health again. It should be green now.

curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'

Go play.


Using river can present issues when your operation scales up. River will use a ton of memory when under heavy operation. I recommend implementing your own elasticsearch models, or if you're using mongoose you can build your elasticsearch models right into that or use mongoosastic which essentially does this for you.

Another disadvantage to Mongodb River is that you'll be stuck using mongodb 2.4.x branch, and ElasticSearch 0.90.x. You'll start to find that you're missing out on a lot of really nice features, and the mongodb river project just doesn't produce a usable product fast enough to keep stable. That said Mongodb River is definitely not something I'd go into production with. It's posed more problems than its worth. It will randomly drop write under heavy load, it will consume lots of memory, and there's no setting to cap that. Additionally, river doesn't update in realtime, it reads oplogs from mongodb, and this can delay updates for as long as 5 minutes in my experience.

We recently had to rewrite a large portion of our project, because its a weekly occurrence that something goes wrong with ElasticSearch. We had even gone as far as to hire a Dev Ops consultant, who also agrees that its best to move away from River.

UPDATE:Elasticsearch-mongodb-river now supports ES v1.4.0 and mongodb v2.6.x. However, you'll still likely run into performance problems on heavy insert/update operations as this plugin will try to read mongodb's oplogs to sync. If there are a lot of operations since the lock(or latch rather) unlocks, you'll notice extremely high memory usage on your elasticsearch server. If you plan on having a large operation, river is not a good option. The developers of ElasticSearch still recommend you to manage your own indexes by communicating directly with their API using the client library for your language, rather than using river. This isn't really the purpose of river. Twitter-river is a great example of how river should be used. Its essentially a great way to source data from outside sources, but not very reliable for high traffic or internal use.

Also consider that mongodb-river falls behind in version, as its not maintained by ElasticSearch Organization, its maintained by a thirdparty. Development was stuck on v0.90 branch for a long time after the release of v1.0, and when a version for v1.0 was released it wasn't stable until elasticsearch released v1.3.0. Mongodb versions also fall behind. You may find yourself in a tight spot when you're looking to move to a later version of each, especially with ElasticSearch under such heavy development, with many very anticipated features on the way. Staying up on the latest ElasticSearch has been very important as we rely heavily on constantly improving our search functionality as its a core part of our product.

All in all you'll likely get a better product if you do it yourself. Its not that difficult. Its just another database to manage in your code, and it can easily be dropped in to your existing models without major refactoring.


River is a good solution once you want to have a almost real time synchronization and general solution.

If you have data in MongoDB already and want to ship it very easily to Elasticsearch like "one-shot" you can try my package in Node.js https://github.com/itemsapi/elasticbulk.

It's using Node.js streams so you can import data from everything what is supporting streams (i.e. MongoDB, PostgreSQL, MySQL, JSON files, etc)

Example for MongoDB to Elasticsearch:

Install packages:

npm install elasticbulknpm install mongoosenpm install bluebird

Create script i.e. script.js:

const elasticbulk = require('elasticbulk');const mongoose = require('mongoose');const Promise = require('bluebird');mongoose.connect('mongodb://localhost/your_database_name', {  useMongoClient: true});mongoose.Promise = Promise;var Page = mongoose.model('Page', new mongoose.Schema({  title: String,  categories: Array}), 'your_collection_name');// stream query var stream = Page.find({}, {title: 1, _id: 0, categories: 1}).limit(1500000).skip(0).batchSize(500).stream();elasticbulk.import(stream, {  index: 'my_index_name',  type: 'my_type_name',  host: 'localhost:9200',}).then(function(res) {  console.log('Importing finished');})

Ship your data:

node script.js

It's not extremely fast but it's working for millions of records (thanks to streams).