Extracting text from XML using python
There is already a built-in XML library, notably ElementTree
. For example:
>>> from xml.etree import cElementTree as ET>>> xmlstr = """... <root>... <page>... <title>Chapter 1</title>... <content>Welcome to Chapter 1</content>... </page>... <page>... <title>Chapter 2</title>... <content>Welcome to Chapter 2</content>... </page>... </root>... """>>> root = ET.fromstring(xmlstr)>>> for page in list(root):... title = page.find('title').text... content = page.find('content').text... print('title: %s; content: %s' % (title, content))...title: Chapter 1; content: Welcome to Chapter 1title: Chapter 2; content: Welcome to Chapter 2
You can also try this code to extract texts:
from bs4 import BeautifulSoupimport csvdata ="""<page> <title>Chapter 1</title> <content>Welcome to Chapter 1</content></page><page> <title>Chapter 2</title> <content>Welcome to Chapter 2</content></page>"""soup = BeautifulSoup(data, "html.parser")########### Title #############required0 = soup.find_all("title")title = []for i in required0: title.append(i.get_text())########### Content #############required0 = soup.find_all("content")content = []for i in required0: content.append(i.get_text())doc1 = list(zip(title, content))for i in doc1: print(i)
Output:
('Chapter 1', 'Welcome to Chapter 1')('Chapter 2', 'Welcome to Chapter 2')
Code :
from xml.etree import cElementTree as ETtree = ET.parse("test.xml")root = tree.getroot()for page in root.findall('page'): print("Title: ", page.find('title').text) print("Content: ", page.find('content').text)
Output:
Title: Chapter 1Content: Welcome to Chapter 1Title: Chapter 2Content: Welcome to Chapter 2