Java - empty orc file
Looks like an-in depth review of the API doc was what I needed. What I was missing:
- Call
initBuffer()
on eachBytesColumnVector
in the initialization phase - Assign the value of the columns calling
setVal()
. This can be accomplished using alsosetRef()
. It is documented to be the fastest of the two, but I don't know if fits my specific case, I will try it.
This is the updated code:
// File schemaString outputFormat = "struct<";for(int i=0;i<outputSchema.length;i++){ outputFormat+=outputSchema[i]+":string,";}outputFormat+="lastRecordHash:string,currentHash:string>";TypeDescription orcSchema = TypeDescription.fromString(outputFormat);// Initializes buffersVectorizedRowBatch batch = orcSchema.createRowBatch();ArrayList<BytesColumnVector> orcBuffers = new ArrayList<>(numFields+2);for(int i=0;i<numFields+2;i++){ BytesColumnVector bcv = (BytesColumnVector) batch.cols[i]; bcv.initBuffer(); orcBuffers.add(i, bcv);}...// Initializes writerWriter writer=null;try{ writer = OrcFile.createWriter(new Path(hdfsUri+outputPath), OrcFile.writerOptions(conf).setSchema(orcSchema)); partitionCounter++;}catch(IOException e){ log.error("Cannot open hdfs file. Reason: "+e.getMessage()); session.transfer(flowfile, hdfsFailure); return;}// Writes contentString[] records = ...for(int i=0;i<records.length;i++){ fields = records[i].split(fieldSeparator); int row=batch.size++; // Filling the orc buffers for(int j=0;j<numFields;j++){ orcBuffers.get(j).setVal(row, fields[j].getBytes()); hashDigest.append(fields[j]); } if (batch.size == batch.getMaxSize()) { try{ writer.addRowBatch(batch); batch.reset(); } catch(IOException e){ log.error("Cannot write to hdfs. Reason: "+e.getMessage()); return; } } }if (batch.size != 0) { try{ writer.addRowBatch(batch); batch.reset(); } catch(IOException e){ log.error("Cannot write to hdfs. Reason: "+e.getMessage()); return; }}writer.close();