Scrapy MongoDB and Elasticsearch Synchronization Scrapy MongoDB and Elasticsearch Synchronization elasticsearch elasticsearch

Scrapy MongoDB and Elasticsearch Synchronization


The rivers in elasticsearch is deprecated.

Try this you can use transporter to sync data between mongodb and elasticsearch.

How To Sync Transformed Data from MongoDB to Elasticsearch with Transporter

Installing Go

In order to install the compose transporter we need to install Go language.

sudo apt-get install golang

Create a folder for Go from your $HOME directory:

mkdir ~/go; echo "export GOPATH=$HOME/go" >> ~/.bashrc

Update your path:

echo "export PATH=$PATH:$HOME/go/bin:/usr/local/go/bin" >> ~/.bashrc

Now go to the $GOPATH directory and create the subdirectories src, pkg and bin. These directories constitute a workspace for Go.

cd $GOPATHmkdir src pkg bin

Installing Transporter

Now create and move into a new directory for Transporter. Since the utility was developed by Compose, we'll call the directory compose.

mkdir -p $GOPATH/src/github.com/composecd $GOPATH/src/github.com/compose

This is where compose/transporter will be installed.

Clone the Transporter GitHub repository:

git clone https://github.com/compose/transporter.git

Move into the new directory:

cd transporter

Take ownership of the /usr/lib/go directory:

sudo chown -R $USER /usr/lib/go

Make sure build-essential is installed for GCC:

sudo apt-get install build-essential

Run the go get command to get all the dependencies:

go get -a ./cmd/...

This step might take a while, so be patient. Once it's done you can build Transporter.

go build -a ./cmd/...

If all goes well, it will complete without any errors or warnings. Check that Transporter is installed correctly by running this command:

transporter

So the installation is complete.

Create some sample data in mongoDB.Then we have to configure the transporter.Transporter requires a config file (config.yaml), a transform file (myTransformation.js), and an application file (application.js) to migrate our data from MongoDB to Elasticsearch.

Move to the transporter directory:

cd ~/go/src/github.com/compose/transporter

Config File

You can take a look at the example config.yaml file if you like. We're going to back up the original and then replace it with our own contents.

mv test/config.yaml test/config.yaml.00

The new file is similar but updates some of the URIs and a few of the other settings to match what's on our server. Let's copy the contents from here and paste into the new config.yaml file. Use nano editor again.

nano test/config.yaml

Copy the contents below into the file. Once done, save the file as described earlier.

# api:#   interval: 60s#   uri: "http://requestb.in/13gerls1"#   key: "48593282-b38d-4bf5-af58-f7327271e73d"#   pid: "something-static"nodes:  localmongo:    type: mongo    uri: mongodb://localhost/foo    tail: true  es:    type: elasticsearch    uri: http://localhost:9200/  timeseries:    type: influx    uri: influxdb://root:root@localhost:8086/compose  debug:    type: file    uri: stdout://  foofile:    type: file    uri: file:///tmp/foo

Application File

Now, open the application.js file in the test directory.

nano test/application.js

Replace the sample contents of the file with the contents shown below:

Source({name:"localmongo", namespace:"foo.bar"}).transform({filename: "transformers/addFullName.js", namespace: "foo.bar"}).save({name:"es", namespace:"foo.bar"});

Transformation File

Let's say we want the documents being stored in Elasticsearch to have another field called fullName. For that, we need to create a new transform file, test/transformers/addFullName.js.

nano test/transformers/addFullName.js

Paste the contents below into the file. Save and exit as described earlier.

module.exports = function(doc) {console.log(JSON.stringify(doc)); //If you are curious you can listen in on what's changed and being copied.  doc._id = doc.data._id['$oid'];    doc["fullName"] = doc["firstName"] + " " + doc["lastName"];  return doc}

The first line is necessary to tackle the way Transporter handles MongoDB's ObjectId() field. The second line tells Transporter to concatenate firstName and lastName of mongoDB to form fullName of ES.

This is a simple transformation for the example, but with a little JavaScript you can do more complex data manipulation as you prepare your data for searching.

Executing Transporter:

If you have a simple standalone instance of MongoDB, it won't be being replicated, there'll be no oplog and the Transporter won't be able to detect the changes. To convert a standalone MongoDB to a single node replica set you'll need to start the server with --replSet rs0 (rs0 is just a name for the set) and when running, log in with the Mongo shell and run rs.initiate() to get the server to configure itself.make sure you are in the transporter directory:

cd ~/go/src/github.com/compose/transporter

Execute the following command to sync the data:

transporter run --config ./test/config.yaml ./test/application.js


The synchronizing mechanism that you are looking for is called rivers in Elasticsearch.In this specific case, you should sync the specific Mongodb collection that you are using to save your Scrapy data with the Elasticsearch index.

For details about how to proceed you should check out the following links:

  1. setting up the mongodb river
  2. mongodb river plugin

Also I recommend checking out the answers on the Elasticsearch tags here on Stackoverflow. I have found detailed answers for most of the common problems regarding implementation details.