On 8/14/08, Marcus Furlong <furlongm at hotmail.com> wrote:
> Hi,
>> I'm having trouble using sed to do replacements on some badly tagged
> xml. I have a large number of files that are tagged as follows:
Personally I'd use something like the python BeautifulSoup module to
do all the work. It's really forgiving when it comes to dealing with
poorly formed xml or html and makes it easy to pull information out of
pretty much any html/xml document.
Docs are here: http://www.crummy.com/software/BeautifulSoup/documentation.html
>>> import BeautifulSoup
>>>>>> xml = """
... <first id="34">
... blah blah
... <second id="56" name="xyz1">hello hello</second>
... <second name="xyz4">hello hello</second>
... <second id="16" name="xyz5">hello hello</second>
... <first id="3">
... blah blah blah
... <second>hello hello</second>
... <second id="12" name="xyz5">hello hello</second>
... """
>>> soup = BeautifulSoup.BeautifulStoneSoup(xml)
>>> print soup.prettify()
<first id="34">
blah blah
<second id="56" name="xyz1">
hello hello
</second>
<second name="xyz4">
hello hello
</second>
<second id="16" name="xyz5">
hello hello
</second>
</first>
<first id="3">
blah blah blah
<second>
hello hello
</second>
<second id="12" name="xyz5">
hello hello
</second>
</first>
>>> soup.findAll('second')
[<second id="56" name="xyz1">hello hello</second>, <second
name="xyz4">hello hello</second>, <second id="16" name="xyz5">hello
hello</second>, <second>hello hello</second>, <second id="12"
name="xyz5">hello hello</second>]
>>> for item in soup.findAll('second'):
... print item
...
<second id="56" name="xyz1">hello hello</second>
<second name="xyz4">hello hello</second>
<second id="16" name="xyz5">hello hello</second>
<second>hello hello</second>
<second id="12" name="xyz5">hello hello</second>
- Niall
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!