Possible to use HCatalog with XML? -- Doing ETL on Cloudera VM
XML uses a fairly standardized structure, so I would be interested in seeing your data format and what delimiter isn't working.
Without knowing more about the data/structure, etc ... this is probably what I would do:
- Decide on my schema and create the HCatalog manually (or scripted, whichever is easiest).
- Load the data via pig, using the piggybank XMLLoader.
- Parse the data using regex into the schema that I decided upon for the HCat
- Store it using the HCatStore method.
--Example code
REGISTER piggybank.jaritems = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);data = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray, REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS pubdate:chararray;STORE data into 'rss_items' USING org.apache.hcatalog.pig.HCatStorer();validate = LOAD 'default.rss_items' USING org.apache.hcatalog.pig.HCatLoader();dump validate;
-- Results
(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)
-- rss.txt data file
<rss version="2.0"> <channel> <title>News</title> <link>http://www.hannonhill.com</link> <description>Hannon Hill News</description> <language>en-us</language> <pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate> <generator>Cascade Server</generator> <webMaster>webmaster@hannonhill.com</webMaster> <item> <title>News Item 1</title> <link>http://www.hannonhill.com/news/item1.html</link> <description>Description of news item 1 here.</description> <pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate> <guid>http://www.hannonhill.com/news/item1.html</guid> </item> <item> <title>News Item 2</title> <link>http://www.hannonhill.com/news/item2.html</link> <description>Description of news item 2 here.</description> <pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate> <guid>http://www.hannonhill.com/news/item2.html</guid> </item> <item> <title>News Item 3</title> <link>http://www.hannonhill.com/news/item3.html</link> <description>Description of news item 3 here.</description> <pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate> <guid>http://www.hannonhill.com/news/item3.html</guid> </item> </channel></rss>