MapReduceIndexerTool - Best way to index HDFS files in Solr? MapReduceIndexerTool - Best way to index HDFS files in Solr? hadoop hadoop

MapReduceIndexerTool - Best way to index HDFS files in Solr?


On Cloudera I think that you have these options:

About MapReduceIndexerTool here a quick guide:

Index a csv to SolR using MapReduceIndexerTool

This guide show you how to index/upload a .csv file to SolR using MapReduceIndexerTool.This procedure will read the csv from HDFS and write directly the index inside HDFS.

See also https://www.cloudera.com/documentation/enterprise/latest/topics/search_mapreduceindexertool.html .

Assuming that you have:

  • a valid cloudera installation (see THIS_IS_YOUR_CLOUDERA_HOST, if using Docker Quickstart it should be quickstart.cloudera)
  • a csv file stored in HDFS (see THIS_IS_YOUR_INPUT_CSV_FILE, like /your-hdfs-dir/your-csv.csv)
  • a valid destination SolR collection with the expected fields already configured (see THIS_IS_YOUR_DESTINATION_COLLECTION)
    • the output directory will be the SolR configured instanceDir (see THIS_IS_YOUR_CORE_INSTANCEDIR) and should be an HDFS path

For this example we will process a TAB separated file with uid, firstName and lastName columns. The first row contains the headers. The Morphlines configuration files will skip the first line, so the actual column name doesn't matter, columns are expected just in this order.On SolR we should configure the fields with something similar:

<field name="_version_" type="long" indexed="true" stored="true" /><field name="uid" type="string" indexed="true" stored="true" required="true" /><field name="firstName" type="text_general" indexed="true" stored="true" /><field name="lastName" type="text_general" indexed="true" stored="true" /><field name="text" type="text_general" indexed="true" multiValued="true" />

Then you should create a Morphlines configuration file (csv-to-solr-morphline.conf) with the following code:

# Specify server locations in a SOLR_LOCATOR variable; used later in# variable substitutions:SOLR_LOCATOR : {  # Name of solr collection  collection : THIS_IS_YOUR_DESTINATION_COLLECTION  # ZooKeeper ensemble  zkHost : "THIS_IS_YOUR_CLOUDERA_HOST:2181/solr"}# Specify an array of one or more morphlines, each of which defines an ETL# transformation chain. A morphline consists of one or more potentially# nested commands. A morphline is a way to consume records such as Flume events,# HDFS files or blocks, turn them into a stream of records, and pipe the stream# of records through a set of easily configurable transformations on the way to# a target application such as Solr.morphlines : [  {    id : morphline1    importCommands : ["org.kitesdk.**"]    commands : [      {        readCSV {          separator : "\t"          # This columns should map the one configured in SolR and are expected in this position inside CSV          columns : [uid,lastName,firstName]          ignoreFirstLine : true          quoteChar : ""          commentPrefix : ""          trim : true          charset : UTF-8        }      }      # Consume the output record of the previous command and pipe another      # record downstream.      #      # This command deletes record fields that are unknown to Solr      # schema.xml.      #      # Recall that Solr throws an exception on any attempt to load a document      # that contains a field that is not specified in schema.xml.      {        sanitizeUnknownSolrFields {          # Location from which to fetch Solr schema          solrLocator : ${SOLR_LOCATOR}        }      }      # log the record at DEBUG level to SLF4J      { logDebug { format : "output record: {}", args : ["@{}"] } }      # load the record into a Solr server or MapReduce Reducer      {        loadSolr {          solrLocator : ${SOLR_LOCATOR}        }      }    ]  }]

To import run the following command inside a cluster:

hadoop jar /usr/lib/solr/contrib/mr/search-mr-*-job.jar \  org.apache.solr.hadoop.MapReduceIndexerTool \  --output-dir hdfs://quickstart.cloudera/THIS_IS_YOUR_CORE_INSTANCEDIR/  \  --morphline-file ./csv-to-solr-morphline.conf \  --zk-host quickstart.cloudera:2181/solr \  --solr-home-dir /THIS_IS_YOUR_CORE_INSTANCEDIR \  --collection THIS_IS_YOUR_DESTINATION_COLLECTION \  --go-live \  hdfs://THIS_IS_YOUR_CLOUDERA_HOST/THIS_IS_YOUR_INPUT_CSV_FILE

Some considerations:

  • You can use sudo -u hdfs to run the above command because you should not have permissiong to write in the HDFS output directory.
  • By default Cloudera QuickStart has a very small memory and heap memory configuration. If you receive out of memory exception or heap exception I suggest to increase it using Cloudera Manager->Yarn->Configurations (http://THIS_IS_YOUR_CLOUDERA_HOST:7180/cmf/services/11/config#filterdisplayGroup=Resource+Management)I have used 1 GB for memory and 500MB for heap for both map and reduce jobs.Consider also changing yarn.app.mapreduce.am.command-opts, mapreduce.map.java.opts, mapreduce.map.memory.mb and mapreduce.map.memory.mb inside /etc/hadoop/conf/map-red-sites.xml

Other resources:


But I cannot work with this because it has certain limitations (the main one being that you cannot specify the filetypes to be considered).

With the https://github.com/lucidworks/hadoop-solr the input is a path.

So, you can specify by file name.

-i /path/*.pdf

Edit:

you can add the add.subdirectories argument. But the *.pdf is not set recursively gitsource

-Dadd.subdirectories=true