How do I parse XML in Python?
How do I parse XML in Python?
Question
I have many rows in a database that contains XML and I'm trying to write a Python script to count instances of a particular node attribute.
My tree looks like:
<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>
How can I access the attributes "1"
and "2"
in the XML using Python?
Accepted Answer
I suggest ElementTree
. There are other compatible implementations of the same API, such as lxml
, and cElementTree
in the Python standard library itself; but, in this context, what they chiefly add is even more speed -- the ease of programming part depends on the API, which ElementTree
defines.
First build an Element instance root
from the XML, e.g. with the XML function, or by parsing a file with something like:
import xml.etree.ElementTree as ET
root = ET.parse('thefile.xml').getroot()
Or any of the many other ways shown at ElementTree
. Then do something like:
for type_tag in root.findall('bar/type'):
value = type_tag.get('foobar')
print(value)
And similar, usually pretty simple, code patterns.
Read more… Read less…
minidom
is the quickest and pretty straight forward.
XML:
<data>
<items>
<item name="item1"></item>
<item name="item2"></item>
<item name="item3"></item>
<item name="item4"></item>
</items>
</data>
Python:
from xml.dom import minidom
xmldoc = minidom.parse('items.xml')
itemlist = xmldoc.getElementsByTagName('item')
print(len(itemlist))
print(itemlist[0].attributes['name'].value)
for s in itemlist:
print(s.attributes['name'].value)
Output:
4
item1
item1
item2
item3
item4
You can use BeautifulSoup:
from bs4 import BeautifulSoup
x="""<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>"""
y=BeautifulSoup(x)
>>> y.foo.bar.type["foobar"]
u'1'
>>> y.foo.bar.findAll("type")
[<type foobar="1"></type>, <type foobar="2"></type>]
>>> y.foo.bar.findAll("type")[0]["foobar"]
u'1'
>>> y.foo.bar.findAll("type")[1]["foobar"]
u'2'
There are many options out there. cElementTree looks excellent if speed and memory usage are an issue. It has very little overhead compared to simply reading in the file using readlines
.
The relevant metrics can be found in the table below, copied from the cElementTree website:
library time space
xml.dom.minidom (Python 2.1) 6.3 s 80000K
gnosis.objectify 2.0 s 22000k
xml.dom.minidom (Python 2.4) 1.4 s 53000k
ElementTree 1.2 1.6 s 14500k
ElementTree 1.2.4/1.3 1.1 s 14500k
cDomlette (C extension) 0.540 s 20500k
PyRXPU (C extension) 0.175 s 10850k
libxml2 (C extension) 0.098 s 16000k
readlines (read as utf-8) 0.093 s 8850k
cElementTree (C extension) --> 0.047 s 4900K <--
readlines (read as ascii) 0.032 s 5050k
As pointed out by @jfs, cElementTree
comes bundled with Python:
- Python 2:
from xml.etree import cElementTree as ElementTree
. - Python 3:
from xml.etree import ElementTree
(the accelerated C version is used automatically).
I suggest xmltodict for simplicity.
It parses your XML to an OrderedDict;
>>> e = '<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo> '
>>> import xmltodict
>>> result = xmltodict.parse(e)
>>> result
OrderedDict([(u'foo', OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))]))])
>>> result['foo']
OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))])
>>> result['foo']['bar']
OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])])
lxml.objectify is really simple.
Taking your sample text:
from lxml import objectify
from collections import defaultdict
count = defaultdict(int)
root = objectify.fromstring(text)
for item in root.bar.type:
count[item.attrib.get("foobar")] += 1
print dict(count)
Output:
{'1': 1, '2': 1}