How to extract data from Hadoop sequence file? How to extract data from Hadoop sequence file? hadoop hadoop

How to extract data from Hadoop sequence file?


I finally solve this strange problem, and I have to share it. First, I will show you the wrong way to get the bytes from sequence.

Configuration conf = new Configuration();FileSystem fs = FileSystem.get(conf);Path input = new Path(inPath);Reader reader = new SequenceFile.Reader(conf, Reader.file(input));Text key = new Text();BytesWritable val = new BytesWritable();    while (reader.next(key, val)) {    fileName = key.toString();    byte[] data = val.getBytes(); //don't think you have got the data!}

the reason is getBytes() does not return the exactly size of your original data. I put the data in using

FSDataInputStream in = null;in = fs.open(input);byte[] buffer = IOUtils.toByteArray(in);Writer writer = SequenceFile.createWriter(conf,Writer.file(output), Writer.keyClass(Text.class),Writer.valueClass(BytesWritable.class));writer.append(new Text(inPath), new BytesWritable(buffer));writer.close();

I check the size of output sequence file, it is original size plus head, I am not sure the reason why getBytes() give me more bytes than original. But let's see how to get the data correctly.

Option #1, copy the size of data you need.

byte[] rawdata = val.getBytes();length = val.getLength(); //exactly size of original databyte[] data = Arrays.copyOfRange(rawdata,  0, length); this is corrent

Option #2

byte[] data = val.copyBytes();

this is more sweet. :)Finally got it right.