XML::LibXML and XML Namespaces (processing OpenOffice documents)

tomhukins has asked for the wisdom of the Perl Monks concerning the following question:

I want to use XML::LibXML to process OpenOffice (sxw) files. OpenOffice stores its information in a Zip archive containing several XML files. I have extracted the content.xml file from such an archive.

I have used XML::LibXML before, but never with XML documents that use namespaces. OpenOffice files use various namespaces for the different types of content.

Normally, I would write something like:

my $parser = XML::LibXML->new() or die $!;
my $tree = $parser->parse_file('content.xml') or die $!;
my @nodes = $tree->findnodes('//p');
[download]

to retrieve all the nodes of type p. OpenOffice stores paragraphs in p elements within the text namespace. So, I replaced the findnodes line above with:

my @nodes = $tree->findnodes('//text:p');
[download]

but this returns the error XPath error Undefined namespace prefix in //text:p xmlXPathCompiledEval: evaluation failed.

I fixed this problem with:

my @nodes = $tree->documentElement->findnodes('//text:p');
[download]

but I don't understand why one way works and the other way doesn't. Both contexts (tree and documentElement) work with another XML document that does not use namespaces.

Can anyone here enlighten me?

Comment on XML::LibXML and XML Namespaces (processing OpenOffice documents) Select or Download Code

Replies are listed 'Best First'.
Re: XML::LibXML and XML Namespaces (processing OpenOffice documents) by dakkar (Hermit) on Mar 11, 2003 at 15:06 UTC
And now a small lesson on XML namespaces no, seriously In namespace-aware XML documents, an element name is a qualified name (qname), composed by a prefix and a local name, separated by a colon (`:`). The prefix is bound to a URI via a namespace declaration. Example: `<first> <ns:second> <my:third xmlns:my="someURI"> <fourth xmlns="otherURI"> <fifth/> </fourth> </my:third> </ns:second> </first>` [download] Let's read that. The element named `first` has a local name of `first`, and belongs to no namespaces. The element named `ns:second` is wrong, since namespace-aware parsers require the prefix to be declared, and `ns` is not. `my:third` belongs to the `someURI` namespace, which is locally bound to the `my` prefix. `fourth` belongs to the (locally default) namespace `otherURI`, as does `fifth`. Hope this is clear enough... Your problem XPath has some problems with namespaces, namely that an XPath expression is interpreted in the context element (which in the case of your program is the invocant of `findnodes`). So the prefixes are resolved using the namespace declarations visible from that node. This forces you to know the prefixes used in the document, instead of the URIs, which creates the problems I said earlier (prefixes are not unique, URIs are). Anyway, your problem is much easier: `$tree` is a `XML::LibXML::Document`, which has no knwoledge of namespaces, since they are declared (at the earliest) on the document element. This is why the second form works. BTW, in the previous examples (disregarding the `ns:second` element), if you did: `$docElem->findnodes('//my:third');` [download] It wouldn't work, since the `my` prefix is not defined on the document element... -- dakkar - Mobilis in mobile	[reply] [d/l] [select]
Re: Re: XML::LibXML and XML Namespaces (processing OpenOffice documents) by bart (Canon) on Mar 11, 2003 at 19:34 UTC
Please go on. What is the meaning of the URI's? What should these point to?	[reply]
Re: Re: Re: XML::LibXML and XML Namespaces (processing OpenOffice documents) by IlyaM (Parson) on Mar 11, 2003 at 21:48 UTC
Namespace URIs don't really point anywhere. They are simply a mean to create non-conflicting namespaces for tags and attributes in XML documents. If you happen to own domain mydomain.com and you design your own DTD you may for example use `http://mydomain.com/mydtd` URI for tags and attributes you use. It is unluckely to conflict with DTDs created by somebody else as they will have URIs with other domain part and you can always choose non-conflicting URIs in your own domain for other your DTDs. Read Namespaces in XML for more info. Update: You post reminded me to publish somewhere my patches for XML::LibXML which make xpath queries to documents with namespaces easier. I've just posted it to perl-xml mailing list, you may find it useful. -- Ilya Martynov, ilya@iponweb.net CTO IPonWEB (UK) Ltd Quality Perl Programming and Unix Support UK managed @ offshore prices - http://www.iponweb.net Personal website - http://martynov.org	[reply] [d/l]
Re: XML::LibXML and XML Namespaces (processing OpenOffice documents) by bronto (Priest) on Mar 11, 2003 at 14:31 UTC
Matts released an OpenOffice Provider for AxKit. Maybe you would like to take a look at the code... Ciao! `--bronto` The very nature of Perl to be like natural language--inconsistant and full of dwim and special cases--makes it impossible to know it all without simply memorizing the documentation (which is not complete or totally correct anyway). --John M. Dlugosz	[reply]


Pathologically Eclectic Rubbish Lister
	PerlMonks

XML::LibXML and XML Namespaces (processing OpenOffice documents)

And now a small lesson on XML namespaces

Your problem