Parse SGML with Open Arbitrary Tags in Python 3 Parse SGML with Open Arbitrary Tags in Python 3 xml xml

Parse SGML with Open Arbitrary Tags in Python 3


If you can find an SGML DTD for the documents that you work with, a solution could be to use the osx SGML to XML converter from the OpenSP SGML toolkit to turn the documents into XML.

Here is a simple example. Let's say that we have the following SGML document (company.sgml; with a root element):

<!DOCTYPE ROOT SYSTEM "company.dtd"><ROOT><COMPANY>Awesome Corp<FORM> 24-7<ADDRESS><STREET>101 PARSNIP LN<ZIP>31337</ADDRESS>

The DTD (company.dtd) looks like this:

<!ELEMENT ROOT       -  o (COMPANY, FORM, ADDRESS) ><!ELEMENT COMPANY    -  o (#PCDATA) ><!ELEMENT FORM       -  o (#PCDATA) ><!ELEMENT ADDRESS    -  - (STREET, ZIP) ><!ELEMENT STREET     -  o (#PCDATA) ><!ELEMENT ZIP        -  o (#PCDATA) >

The - o bit means that the end tag can be omitted.

The SGML document can be parsed with osx, and the output can be formatted with xmllint, as follows:

osx company.sgml | xmllint --format -

Output from the above command:

<?xml version="1.0"?><ROOT>  <COMPANY>Awesome Corp</COMPANY>  <FORM> 24-7</FORM>  <ADDRESS>    <STREET>101 PARSNIP LN</STREET>    <ZIP>31337</ZIP>  </ADDRESS></ROOT>

Now we have well-formed XML that can be processed with lxml or other XML tools.

I don't know if there is a complete DTD for the document that you link to. The following PDF file contains related information about EDGAR, including a DTD that might be useful: http://www.sec.gov/info/edgar/pdsdissemspec910.pdf (I found it via this answer). But the linked SGML document contains elements (SEC-HEADER, for example) that are not mentioned in the PDF file.