Possible to use HCatalog with XML? -- Doing ETL on Cloudera VM Possible to use HCatalog with XML? -- Doing ETL on Cloudera VM hadoop hadoop

Possible to use HCatalog with XML? -- Doing ETL on Cloudera VM


XML uses a fairly standardized structure, so I would be interested in seeing your data format and what delimiter isn't working.

Without knowing more about the data/structure, etc ... this is probably what I would do:

  1. Decide on my schema and create the HCatalog manually (or scripted, whichever is easiest).
  2. Load the data via pig, using the piggybank XMLLoader.
  3. Parse the data using regex into the schema that I decided upon for the HCat
  4. Store it using the HCatStore method.



--Example code

REGISTER piggybank.jaritems = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS  (item:chararray);data = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS  link:chararray, REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS  title:chararray,REGEX_EXTRACT(item, '<description>(.*)</description>',  1) AS description:chararray,REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS  pubdate:chararray;STORE data into 'rss_items' USING org.apache.hcatalog.pig.HCatStorer();validate = LOAD 'default.rss_items' USING org.apache.hcatalog.pig.HCatLoader();dump validate;



-- Results

(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)



-- rss.txt data file

<rss version="2.0">   <channel>      <title>News</title>      <link>http://www.hannonhill.com</link>      <description>Hannon Hill News</description>      <language>en-us</language>      <pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>      <generator>Cascade Server</generator>      <webMaster>webmaster@hannonhill.com</webMaster>      <item>         <title>News Item 1</title>         <link>http://www.hannonhill.com/news/item1.html</link>         <description>Description of news item 1 here.</description>         <pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>         <guid>http://www.hannonhill.com/news/item1.html</guid>      </item>      <item>         <title>News Item 2</title>         <link>http://www.hannonhill.com/news/item2.html</link>         <description>Description of news item 2 here.</description>         <pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>         <guid>http://www.hannonhill.com/news/item2.html</guid>      </item>      <item>         <title>News Item 3</title>         <link>http://www.hannonhill.com/news/item3.html</link>         <description>Description of news item 3 here.</description>         <pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>         <guid>http://www.hannonhill.com/news/item3.html</guid>      </item>   </channel></rss>