Trying to write a custom loader for pig that processes records spanning multiple line, how to make sure splits don't happen in the middle of records? Trying to write a custom loader for pig that processes records spanning multiple line, how to make sure splits don't happen in the middle of records? hadoop hadoop

Trying to write a custom loader for pig that processes records spanning multiple line, how to make sure splits don't happen in the middle of records?


CSVExcelStorage works on the assumption that there aren't any embedded new line characters and so there's no code that handles them.

You're right about RecordReader being the culprit here. You'll need to write a new record reader class that understands your data and therefore understands which new line characters are candidates for a split location and which new line chars are simply part of the data. Once you write a new record class, you'll need a new InputFormatType to use that record reader class.