Import XML files to PostgreSQL Import XML files to PostgreSQL xml xml

Import XML files to PostgreSQL


Necromancing:For those that need a working example:

DO $$   DECLARE myxml xml;BEGINmyxml := XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('MyData.xml'), 'UTF8'));DROP TABLE IF EXISTS mytable;CREATE TEMP TABLE mytable AS SELECT      (xpath('//ID/text()', x))[1]::text AS id    ,(xpath('//Name/text()', x))[1]::text AS Name     ,(xpath('//RFC/text()', x))[1]::text AS RFC    ,(xpath('//Text/text()', x))[1]::text AS Text    ,(xpath('//Desc/text()', x))[1]::text AS DescFROM unnest(xpath('//record', myxml)) x;END$$;SELECT * FROM mytable;

Or with less noise

SELECT      (xpath('//ID/text()', myTempTable.myXmlColumn))[1]::text AS id    ,(xpath('//Name/text()', myTempTable.myXmlColumn))[1]::text AS Name     ,(xpath('//RFC/text()', myTempTable.myXmlColumn))[1]::text AS RFC    ,(xpath('//Text/text()', myTempTable.myXmlColumn))[1]::text AS Text    ,(xpath('//Desc/text()', myTempTable.myXmlColumn))[1]::text AS Desc    ,myTempTable.myXmlColumn as myXmlElementFROM unnest(    xpath    (    '//record'        ,XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('MyData.xml'), 'UTF8'))    )) AS myTempTable(myXmlColumn);

With this example XML file (MyData.xml):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><data-set>    <record>        <ID>1</ID>        <Name>A</Name>        <RFC>RFC 1035[1]</RFC>        <Text>Address record</Text>        <Desc>Returns a 32-bit IPv4 address, most commonly used to map hostnames to an IP address of the host, but it is also used for DNSBLs, storing subnet masks in RFC 1101, etc.</Desc>    </record>    <record>        <ID>2</ID>        <Name>NS</Name>        <RFC>RFC 1035[1]</RFC>        <Text>Name server record</Text>        <Desc>Delegates a DNS zone to use the given authoritative name servers</Desc>    </record></data-set>

Note:
MyData.xml needs to be in the PG_Data directory (the parent-directory of the pg_stat directory).
e.g. /var/lib/postgresql/9.3/main/MyData.xml
This requires PostGreSQL 9.1+

Overall, you can achive it fileless, like this:

SELECT      (xpath('//ID/text()', myTempTable.myXmlColumn))[1]::text AS id    ,(xpath('//Name/text()', myTempTable.myXmlColumn))[1]::text AS Name     ,(xpath('//RFC/text()', myTempTable.myXmlColumn))[1]::text AS RFC    ,(xpath('//Text/text()', myTempTable.myXmlColumn))[1]::text AS Text    ,(xpath('//Desc/text()', myTempTable.myXmlColumn))[1]::text AS Desc    ,myTempTable.myXmlColumn as myXmlElement     -- Source: https://en.wikipedia.org/wiki/List_of_DNS_record_typesFROM unnest(xpath('//record',  CAST('<?xml version="1.0" encoding="UTF-8" standalone="yes"?><data-set>    <record>        <ID>1</ID>        <Name>A</Name>        <RFC>RFC 1035[1]</RFC>        <Text>Address record</Text>        <Desc>Returns a 32-bit IPv4 address, most commonly used to map hostnames to an IP address of the host, but it is also used for DNSBLs, storing subnet masks in RFC 1101, etc.</Desc>    </record>    <record>        <ID>2</ID>        <Name>NS</Name>        <RFC>RFC 1035[1]</RFC>        <Text>Name server record</Text>        <Desc>Delegates a DNS zone to use the given authoritative name servers</Desc>    </record></data-set>' AS xml)   )) AS myTempTable(myXmlColumn);

Note that unlike in MS-SQL, xpath text() returns NULL on a NULL value, and not an empty string.
If for whatever reason you need to explicitly check for the existence of NULL, you can use [not(@xsi:nil="true")], to which you need to pass an array of namespaces, because otherwise, you get an error (however, you can omit all namespaces but xsi).

SELECT      (xpath('//xmlEncodeTest[1]/text()', myTempTable.myXmlColumn))[1]::text AS c1    ,(    xpath('//xmlEncodeTest[1][not(@xsi:nil="true")]/text()', myTempTable.myXmlColumn    ,    ARRAY[        -- ARRAY['xmlns','http://www.w3.org/1999/xhtml'], -- defaultns        ARRAY['xsi','http://www.w3.org/2001/XMLSchema-instance'],        ARRAY['xsd','http://www.w3.org/2001/XMLSchema'],                ARRAY['svg','http://www.w3.org/2000/svg'],        ARRAY['xsl','http://www.w3.org/1999/XSL/Transform']    ]    )    )[1]::text AS c22    ,(xpath('//nixda[1]/text()', myTempTable.myXmlColumn))[1]::text AS c2     --,myTempTable.myXmlColumn as myXmlElement    ,xmlexists('//xmlEncodeTest[1]' PASSING BY REF myTempTable.myXmlColumn) AS c1e    ,xmlexists('//nixda[1]' PASSING BY REF myTempTable.myXmlColumn) AS c2e    ,xmlexists('//xmlEncodeTestAbc[1]' PASSING BY REF myTempTable.myXmlColumn) AS c1eaFROM unnest(xpath('//row',      CAST('<?xml version="1.0" encoding="utf-8"?>    <table xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">      <row>        <xmlEncodeTest xsi:nil="true" />        <nixda>noob</nixda>      </row>    </table>    ' AS xml)       )) AS myTempTable(myXmlColumn);

You can also check if a field is contained in an XML-text, by doing

 ,xmlexists('//xmlEncodeTest[1]' PASSING BY REF myTempTable.myXmlColumn) AS c1e

for example when you pass an XML-value to a stored-procedure/function for CRUD. (see above)

Also, note that the correct way to pass a null-value in XML is <elementName xsi:nil="true" /> and not <elementName /> or nothing. There is no correct way to pass NULL in attributes (you can only omit the attribute, but then it gets difficult/slow to infer the number of columns and their names in a large dataset).

e.g.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><table>    <row column1="a" column2="3" />    <row column1="b" column2="4" column3="true" /></table>

(is more compact, but very bad if you need to import it, especially if from XML-files with multiple GB of data - see a wonderful example of that in the stackoverflow data dump)

SELECT      myTempTable.myXmlColumn    ,(xpath('//@column1', myTempTable.myXmlColumn))[1]::text AS c1    ,(xpath('//@column2', myTempTable.myXmlColumn))[1]::text AS c2    ,(xpath('//@column3', myTempTable.myXmlColumn))[1]::text AS c3    ,xmlexists('//@column3' PASSING BY REF myTempTable.myXmlColumn) AS c3e    ,case when (xpath('//@column3', myTempTable.myXmlColumn))[1]::text is null then 1 else 0 end AS is_null FROM unnest(xpath('//row', '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><table>    <row column1="a" column2="3" />    <row column1="b" column2="4" column3="true" /></table>'))  AS myTempTable(myXmlColumn) 


I would try a different approach: read the XML file directly into variable inside a plpgsql function and proceed from there. Should be a lot faster and a lot more robust.

CREATE OR REPLACE FUNCTION f_sync_from_xml()  RETURNS boolean AS$BODY$DECLARE    myxml    xml;    datafile text := 'path/to/my_file.xml';BEGIN   myxml := pg_read_file(datafile, 0, 100000000);  -- arbitrary 100 MB max.   CREATE TEMP TABLE tmp AS   SELECT (xpath('//some_id/text()', x))[1]::text AS id   FROM   unnest(xpath('/xml/path/to/datum', myxml)) x;   ...

You need superuser privileges, and file must be local to the DB server, in an accessible directory.
Complete code example with more explanation and links:


Extending @stefan-steiger's excellent answer, here is an example that extracts XML elements from child nodes that contain multiple siblings (e.g., multiple <synonym> elements, for a particular <synomyms> parent node).

I encountered this issue with my data and searched quite a bit for a solution; his answer was the most helpful, to me.

Example data file, hmdb_metabolites_test.xml:

<?xml version="1.0" encoding="UTF-8"?><hmdb><metabolite>  <accession>HMDB0000001</accession>  <name>1-Methylhistidine</name>  <synonyms>    <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid</synonym>    <synonym>1-Methylhistidine</synonym>    <synonym>Pi-methylhistidine</synonym>    <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate</synonym>  </synonyms></metabolite><metabolite>  <accession>HMDB0000002</accession>  <name>1,3-Diaminopropane</name>  <synonyms>    <synonym>1,3-Propanediamine</synonym>    <synonym>1,3-Propylenediamine</synonym>    <synonym>Propane-1,3-diamine</synonym>    <synonym>1,3-diamino-N-Propane</synonym>  </synonyms></metabolite><metabolite>  <accession>HMDB0000005</accession>  <name>2-Ketobutyric acid</name>  <synonyms>    <synonym>2-Ketobutanoic acid</synonym>    <synonym>2-Oxobutyric acid</synonym>    <synonym>3-Methyl pyruvic acid</synonym>    <synonym>alpha-Ketobutyrate</synonym>  </synonyms></metabolite></hmdb>

Aside: the original XML file had a URL in the Document Element

<hmdb xmlns="http://www.hmdb.ca">

that prevented xpath from parsing the data. It will run (without error messages), but the relation/table is empty:

[hmdb_test]# \i /mnt/Vancouver/Programming/data/hmdb/sql/hmdb_test.sqlDO accession | name | synonym -----------+------+---------

Since the source file is 3.4GB, I decided to edit that line using sed:

sed -i '2s/.*hmdb xmlns.*/<hmdb>/' hmdb_metabolites.xml

[Adding the 2 (instructs sed to edit "line 2") also -- coincidentally, in this instance -- doubling the sed command execution speed.]


My postgres data folder (PSQL: SHOW data_directory;) is

/mnt/Vancouver/Programming/RDB/postgres/postgres/data

so, as sudo, I needed to copy my XML data file there and chown it for use in PostgreSQL:

sudo chown postgres:postgres /mnt/Vancouver/Programming/RDB/postgres/postgres/data/hmdb_metabolites_test.xml

Script (hmdb_test.sql):

DO $$DECLARE myxml xml;BEGINmyxml := XMLPARSE(DOCUMENT convert_from(pg_read_binary_file('hmdb_metabolites_test.xml'), 'UTF8'));DROP TABLE IF EXISTS mytable;-- CREATE TEMP TABLE mytable AS CREATE TABLE mytable AS SELECT     (xpath('//accession/text()', x))[1]::text AS accession    ,(xpath('//name/text()', x))[1]::text AS name     -- The "synonym" child/subnode has many sibling elements, so we need to    -- "unnest" them,otherwise we only retrieve the first synonym per record:    ,unnest(xpath('//synonym/text()', x))::text AS synonymFROM unnest(xpath('//metabolite', myxml)) x;END$$;-- select * from mytable limit 5;SELECT * FROM mytable;

Execution, output (in PSQL):

[hmdb_test]# \i /mnt/Vancouver/Programming/data/hmdb/hmdb_test.sqlaccession  |        name        |                         synonym                          -------------+--------------------+----------------------------------------------------------HMDB0000001 | 1-Methylhistidine  | (2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acidHMDB0000001 | 1-Methylhistidine  | 1-MethylhistidineHMDB0000001 | 1-Methylhistidine  | Pi-methylhistidineHMDB0000001 | 1-Methylhistidine  | (2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoateHMDB0000002 | 1,3-Diaminopropane | 1,3-PropanediamineHMDB0000002 | 1,3-Diaminopropane | 1,3-PropylenediamineHMDB0000002 | 1,3-Diaminopropane | Propane-1,3-diamineHMDB0000002 | 1,3-Diaminopropane | 1,3-diamino-N-PropaneHMDB0000005 | 2-Ketobutyric acid | 2-Ketobutanoic acidHMDB0000005 | 2-Ketobutyric acid | 2-Oxobutyric acidHMDB0000005 | 2-Ketobutyric acid | 3-Methyl pyruvic acidHMDB0000005 | 2-Ketobutyric acid | alpha-Ketobutyrate[hmdb_test]#