Java - empty orc file Java - empty orc file hadoop hadoop

Java - empty orc file


Looks like an-in depth review of the API doc was what I needed. What I was missing:

  • Call initBuffer() on each BytesColumnVector in the initialization phase
  • Assign the value of the columns calling setVal(). This can be accomplished using also setRef(). It is documented to be the fastest of the two, but I don't know if fits my specific case, I will try it.

This is the updated code:

//  File schemaString outputFormat = "struct<";for(int i=0;i<outputSchema.length;i++){    outputFormat+=outputSchema[i]+":string,";}outputFormat+="lastRecordHash:string,currentHash:string>";TypeDescription orcSchema = TypeDescription.fromString(outputFormat);//  Initializes buffersVectorizedRowBatch batch = orcSchema.createRowBatch();ArrayList<BytesColumnVector> orcBuffers = new ArrayList<>(numFields+2);for(int i=0;i<numFields+2;i++){    BytesColumnVector bcv = (BytesColumnVector) batch.cols[i];    bcv.initBuffer();    orcBuffers.add(i, bcv);}...//  Initializes writerWriter writer=null;try{    writer = OrcFile.createWriter(new Path(hdfsUri+outputPath), OrcFile.writerOptions(conf).setSchema(orcSchema));    partitionCounter++;}catch(IOException e){    log.error("Cannot open hdfs file. Reason: "+e.getMessage());    session.transfer(flowfile, hdfsFailure);    return;}//  Writes contentString[] records = ...for(int i=0;i<records.length;i++){    fields = records[i].split(fieldSeparator);    int row=batch.size++;    //  Filling the orc buffers    for(int j=0;j<numFields;j++){        orcBuffers.get(j).setVal(row, fields[j].getBytes());        hashDigest.append(fields[j]);    }    if (batch.size == batch.getMaxSize()) {        try{            writer.addRowBatch(batch);            batch.reset();        }        catch(IOException e){            log.error("Cannot write to hdfs. Reason: "+e.getMessage());            return;        }    }         }if (batch.size != 0) {    try{        writer.addRowBatch(batch);        batch.reset();    }    catch(IOException e){        log.error("Cannot write to hdfs. Reason: "+e.getMessage());        return;    }}writer.close();