Indexing Wikipedia dump to elasticsearch gets XML document structures must start and end within the same entity error Indexing Wikipedia dump to elasticsearch gets XML document structures must start and end within the same entity error elasticsearch elasticsearch

Indexing Wikipedia dump to elasticsearch gets XML document structures must start and end within the same entity error


I'm not sure how to make the XML imports work, but there is another option. Recently wikimedia has made available dumps of the production elasticsearch indices.

The indices are exported every week and for each wiki there are two exports.

These are formatted for the elasticsearch bulk import API. Because that is JSON these are also usable outside elasticsearch.

Importing them is not documented yet, but i do roughly the following:

  1. Fetch the current mapping: curl https://en.wikipedia.org/w/api.php?action=cirrus-mapping-dump&format=json > mapping.json
  2. Feed that mapping into elasticsearch: jq .content < mapping.json | curl -XPUT localhost:9200/enwiki_content --data @-
  3. Load the dump: zcat enwiki-20151116-cirrussearch-general.json.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/enwiki_content/_bulk --data-binary @- > /dev/null'