Advertisement
Advertisement


How do I parse XML in Python?


Question

I have many rows in a database that contains XML and I'm trying to write a Python script to count instances of a particular node attribute.

My tree looks like:

<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>

How can I access the attributes "1" and "2" in the XML using Python?

2019/12/12
1
1031
12/12/2019 11:17:18 PM

Accepted Answer

I suggest ElementTree. There are other compatible implementations of the same API, such as lxml, and cElementTree in the Python standard library itself; but, in this context, what they chiefly add is even more speed -- the ease of programming part depends on the API, which ElementTree defines.

First build an Element instance root from the XML, e.g. with the XML function, or by parsing a file with something like:

import xml.etree.ElementTree as ET
root = ET.parse('thefile.xml').getroot()

Or any of the many other ways shown at ElementTree. Then do something like:

for type_tag in root.findall('bar/type'):
    value = type_tag.get('foobar')
    print(value)

And similar, usually pretty simple, code patterns.

2019/12/13
807
12/13/2019 12:39:06 AM

minidom is the quickest and pretty straight forward.

XML:

<data>
    <items>
        <item name="item1"></item>
        <item name="item2"></item>
        <item name="item3"></item>
        <item name="item4"></item>
    </items>
</data>

Python:

from xml.dom import minidom
xmldoc = minidom.parse('items.xml')
itemlist = xmldoc.getElementsByTagName('item')
print(len(itemlist))
print(itemlist[0].attributes['name'].value)
for s in itemlist:
    print(s.attributes['name'].value)

Output:

4
item1
item1
item2
item3
item4
2019/12/12

You can use BeautifulSoup:

from bs4 import BeautifulSoup

x="""<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

y=BeautifulSoup(x)
>>> y.foo.bar.type["foobar"]
u'1'

>>> y.foo.bar.findAll("type")
[<type foobar="1"></type>, <type foobar="2"></type>]

>>> y.foo.bar.findAll("type")[0]["foobar"]
u'1'
>>> y.foo.bar.findAll("type")[1]["foobar"]
u'2'
2019/12/12

There are many options out there. cElementTree looks excellent if speed and memory usage are an issue. It has very little overhead compared to simply reading in the file using readlines.

The relevant metrics can be found in the table below, copied from the cElementTree website:

library                         time    space
xml.dom.minidom (Python 2.1)    6.3 s   80000K
gnosis.objectify                2.0 s   22000k
xml.dom.minidom (Python 2.4)    1.4 s   53000k
ElementTree 1.2                 1.6 s   14500k  
ElementTree 1.2.4/1.3           1.1 s   14500k  
cDomlette (C extension)         0.540 s 20500k
PyRXPU (C extension)            0.175 s 10850k
libxml2 (C extension)           0.098 s 16000k
readlines (read as utf-8)       0.093 s 8850k
cElementTree (C extension)  --> 0.047 s 4900K <--
readlines (read as ascii)       0.032 s 5050k   

As pointed out by @jfs, cElementTree comes bundled with Python:

  • Python 2: from xml.etree import cElementTree as ElementTree.
  • Python 3: from xml.etree import ElementTree (the accelerated C version is used automatically).
2018/01/31

I suggest xmltodict for simplicity.

It parses your XML to an OrderedDict;

>>> e = '<foo>
             <bar>
                 <type foobar="1"/>
                 <type foobar="2"/>
             </bar>
        </foo> '

>>> import xmltodict
>>> result = xmltodict.parse(e)
>>> result

OrderedDict([(u'foo', OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))]))])

>>> result['foo']

OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))])

>>> result['foo']['bar']

OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])])
2019/12/12

lxml.objectify is really simple.

Taking your sample text:

from lxml import objectify
from collections import defaultdict

count = defaultdict(int)

root = objectify.fromstring(text)

for item in root.bar.type:
    count[item.attrib.get("foobar")] += 1

print dict(count)

Output:

{'1': 1, '2': 1}
2013/06/07

Source: https://stackoverflow.com/questions/1912434
Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Email: [email protected]