Is there any way to add delimiters to a string in hadoop? Is there any way to add delimiters to a string in hadoop? hadoop hadoop

Is there any way to add delimiters to a string in hadoop?


This can be achieved via map-reduce or spark job. (substring()):

Map-reduce (JAVA): you only need mapper in this case. Just map your input string line with delimited string lines:

public class Delimeters {    public static class DelimetersMapper extends Mapper<Object, Text, Text, Text> {        @Override        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {            //001012489FEB01856400004400            String lineWithDelimeter = value.toString().substring(0, 4) + "|" + value.toString().substring(4, 9)                    + "|" + value.toString().substring(9, 12) + "|" + value.toString().substring(12, 18)                    + "|" + value.toString().substring(18, 22) + "|" + value.toString().substring(22, 24)                    + "|" + value.toString().substring(24,26);            System.out.println(lineWithDelimeter); //0010|12489|MAR|018564|0000|44|00            context.write(new Text(lineWithDelimeter),new Text(""));            }       }    public static void main(String[] args) throws Exception {        Configuration conf = new Configuration();        Job job = Job.getInstance(conf, "Add-Delimeters-to-flat-file");        job.setJarByClass(Delimeters.class);        job.setMapperClass(DelimetersMapper.class);         job.setMapOutputKeyClass(Text.class);        job.setMapOutputValueClass(Text.class);        FileInputFormat.addInputPath(job, new Path(args[0]));        FileOutputFormat.setOutputPath(job, new Path(args[1]));        FileSystem fs = null;        Path dstFilePath = new Path(args[1]);        try {            fs = dstFilePath.getFileSystem(conf);            if (fs.exists(dstFilePath))                fs.delete(dstFilePath, true);        } catch (IOException e1) {            e1.printStackTrace();        }        job.waitForCompletion(true);    } }

Spark (Scala):

object delimeter {    def main(args: Array[String]) {            val inputFile = args(0)                val conf = new SparkConf().setAppName("Add-Delimeters-to-flat-file").setMaster("local")                val sc = new SparkContext(conf)                val txtFileLines = sc.textFile(inputFile)                val fields = txtFileLines.map(line => line.substring(0, 4) + "|" + line.substring(4, 9)                    + "|" + line.substring(9, 12) + "|" + line.substring(12, 18)                    + "|" + line.substring(18, 22) + "|" + line.substring(22, 24)                    + "|" + line.substring(24,26))        fields.foreach(x => println(x))            fields.saveAsTextFile(args(1))            }}

Update:

  1. you can use file:/// uri to let hadoop know to look for local file system as source (same rule applies for spark):

    hadoop jar <app.jar> <package.classname> <file:///path/to/local/dir> </path/to/hdfs/>

Example:

    [cloudera@quickstart Desktop]$ hadoop jar hadoop-stack.jar so.Delimeters file:///home/cloudera/Desktop/test.txt /user/cloudera/delim    [cloudera@quickstart Desktop]$ hadoop fs -cat /user/cloudera/delim/*    0010|12489|FEB|018564|0000|44|00        0010|12489|MAR|018564|0000|44|00    
  1. you can have source files in hdfs and delete the source files in application itself after successful processing:

    int exitcode = job.waitForCompletion(true)? 0: -1;if (exitcode == 0){try {    Path sourcePath = new Path(args[0]);    fs = sourcePath.getFileSystem(conf);    if (fs.exists(sourcePath))        fs.delete(sourcePath, true);} catch (IOException e1) {    e1.printStackTrace();}}
  2. Create oozie workflow which runs application that add delimiters to target destination and shell script that deletes the source file/dir at end


One more alternate is by using Hive ( which takes care of the programming part)

1) Create a hive tmp table pointing to your HDFS raw data file location ,

CREATE EXTERNAL TABLE tmp (raw String) LOCATION '<hdfs_path>'

2) Create formatted_data table with pipe delimeter as seperator

CREATE TABLE formatted_data(col1 string,col2 string,col3 string,col4 string, col5 string,col6 string,col7 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

3) Insert the data from tmp table to formatted_data table:

INSERT INTO formatted_data select substr(raw,0, 4),substr(raw,4, 9),substr(raw,9, 12),substr(raw,12, 18),substr(raw,18, 22),substr(raw,22, 24),substr(raw,24,26) from TMP ;

4) verify the hdfs file of 'formatted_data' table

hadoop fs -cat /hive/warehouse/formatted_data/000000_016/08/30 10:47:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 0010|012489MAR|9MAR01856400|R01856400004400|400004400|04400|400 0010|012489FEB|9FEB01856400|B01856400004400|400004400|04400|400