Loading Wikipedia Dump into Elasticsearch Loading Wikipedia Dump into Elasticsearch elasticsearch elasticsearch

Loading Wikipedia Dump into Elasticsearch


Two years ago wikimedia has made available dumps of the production elasticsearch indices.

The indices are exported every week and for each wiki there are two exports.

The content index, which contains only article pages, called content;The general index, containing all pages. This includes talk pages, templates, etc, called general;

you can find them here http://dumps.wikimedia.org/other/cirrussearch/current/

  • create a mapping according your needs. For example:

    {     "mappings": {     "page": {        "properties": {           "auxiliary_text": {              "type": "text"           },           "category": {              "type": "text"           },           "coordinates": {              "properties": {                 "coord": {                    "properties": {                       "lat": {                          "type": "double"                       },                       "lon": {                          "type": "double"                       }                    }                 },                 "country": {                    "type": "text"                 },                 "dim": {                    "type": "long"                 },                 "globe": {                    "type": "text"                 },                 "name": {                    "type": "text"                 },                 "primary": {                    "type": "boolean"                 },                 "region": {                    "type": "text"                 },                 "type": {                    "type": "text"                 }              }           },           "defaultsort": {              "type": "boolean"           },           "external_link": {              "type": "text"           },           "heading": {              "type": "text"           },           "incoming_links": {              "type": "long"           },           "language": {              "type": "text"           },           "namespace": {              "type": "long"           },           "namespace_text": {              "type": "text"           },           "opening_text": {              "type": "text"           },           "outgoing_link": {              "type": "text"           },           "popularity_score": {              "type": "double"           },           "redirect": {              "properties": {                 "namespace": {                    "type": "long"                 },                 "title": {                    "type": "text"                 }              }           },           "score": {              "type": "double"           },           "source_text": {              "type": "text"           },           "template": {              "type": "text"           },           "text": {              "type": "text"           },           "text_bytes": {              "type": "long"           },           "timestamp": {              "type": "date",              "format": "strict_date_optional_time||epoch_millis"           },           "title": {              "type": "text"           },           "version": {              "type": "long"           },           "version_type": {              "type": "text"           },           "wiki": {              "type": "text"           },           "wikibase_item": {              "type": "text"           }        }     }  }

    }

once you have created the index you just type:

zcat enwiki-current-cirrussearch-general.json.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/enwiki/_bulk --data-binary @- > /dev/null'

Enjoy!


I tried many ways to import the wikipedia. I found two ways incuding using Logstash and write python coder directly.