Scrapy MongoDB and Elasticsearch Synchronization
The rivers in elasticsearch is deprecated.
Try this you can use transporter to sync data between mongodb and elasticsearch.
How To Sync Transformed Data from MongoDB to Elasticsearch with Transporter
Installing Go
In order to install the compose transporter we need to install Go language.
sudo apt-get install golang
Create a folder for Go from your $HOME directory:
mkdir ~/go; echo "export GOPATH=$HOME/go" >> ~/.bashrc
Update your path:
echo "export PATH=$PATH:$HOME/go/bin:/usr/local/go/bin" >> ~/.bashrc
Now go to the $GOPATH directory and create the subdirectories src, pkg and bin. These directories constitute a workspace for Go.
cd $GOPATHmkdir src pkg bin
Installing Transporter
Now create and move into a new directory for Transporter. Since the utility was developed by Compose, we'll call the directory compose.
mkdir -p $GOPATH/src/github.com/composecd $GOPATH/src/github.com/compose
This is where compose/transporter will be installed.
Clone the Transporter GitHub repository:
git clone https://github.com/compose/transporter.git
Move into the new directory:
cd transporter
Take ownership of the /usr/lib/go directory:
sudo chown -R $USER /usr/lib/go
Make sure build-essential is installed for GCC:
sudo apt-get install build-essential
Run the go get command to get all the dependencies:
go get -a ./cmd/...
This step might take a while, so be patient. Once it's done you can build Transporter.
go build -a ./cmd/...
If all goes well, it will complete without any errors or warnings. Check that Transporter is installed correctly by running this command:
transporter
So the installation is complete.
Create some sample data in mongoDB.Then we have to configure the transporter.Transporter requires a config file (config.yaml), a transform file (myTransformation.js), and an application file (application.js) to migrate our data from MongoDB to Elasticsearch.
Move to the transporter directory:
cd ~/go/src/github.com/compose/transporter
Config File
You can take a look at the example config.yaml file if you like. We're going to back up the original and then replace it with our own contents.
mv test/config.yaml test/config.yaml.00
The new file is similar but updates some of the URIs and a few of the other settings to match what's on our server. Let's copy the contents from here and paste into the new config.yaml file. Use nano editor again.
nano test/config.yaml
Copy the contents below into the file. Once done, save the file as described earlier.
# api:# interval: 60s# uri: "http://requestb.in/13gerls1"# key: "48593282-b38d-4bf5-af58-f7327271e73d"# pid: "something-static"nodes: localmongo: type: mongo uri: mongodb://localhost/foo tail: true es: type: elasticsearch uri: http://localhost:9200/ timeseries: type: influx uri: influxdb://root:root@localhost:8086/compose debug: type: file uri: stdout:// foofile: type: file uri: file:///tmp/foo
Application File
Now, open the application.js file in the test directory.
nano test/application.js
Replace the sample contents of the file with the contents shown below:
Source({name:"localmongo", namespace:"foo.bar"}).transform({filename: "transformers/addFullName.js", namespace: "foo.bar"}).save({name:"es", namespace:"foo.bar"});
Transformation File
Let's say we want the documents being stored in Elasticsearch to have another field called fullName. For that, we need to create a new transform file, test/transformers/addFullName.js.
nano test/transformers/addFullName.js
Paste the contents below into the file. Save and exit as described earlier.
module.exports = function(doc) {console.log(JSON.stringify(doc)); //If you are curious you can listen in on what's changed and being copied. doc._id = doc.data._id['$oid']; doc["fullName"] = doc["firstName"] + " " + doc["lastName"]; return doc}
The first line is necessary to tackle the way Transporter handles MongoDB's ObjectId() field. The second line tells Transporter to concatenate firstName and lastName of mongoDB to form fullName of ES.
This is a simple transformation for the example, but with a little JavaScript you can do more complex data manipulation as you prepare your data for searching.
Executing Transporter:
If you have a simple standalone instance of MongoDB, it won't be being replicated, there'll be no oplog and the Transporter won't be able to detect the changes. To convert a standalone MongoDB to a single node replica set you'll need to start the server with --replSet rs0 (rs0 is just a name for the set) and when running, log in with the Mongo shell and run rs.initiate() to get the server to configure itself.make sure you are in the transporter directory:
cd ~/go/src/github.com/compose/transporter
Execute the following command to sync the data:
transporter run --config ./test/config.yaml ./test/application.js
The synchronizing mechanism that you are looking for is called rivers in Elasticsearch.In this specific case, you should sync the specific Mongodb collection that you are using to save your Scrapy data with the Elasticsearch index.
For details about how to proceed you should check out the following links:
Also I recommend checking out the answers on the Elasticsearch tags here on Stackoverflow. I have found detailed answers for most of the common problems regarding implementation details.