Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm pretty sure this is not possible, but I thought I'd ask the experts to make sure.

I'm working with a large, very deep set of XML data, and mostly my perl script needs to read in the XML - parse the data, do a few modifications/alterations/aggregations, then dump to a DB.

The source XML has a xmls tag defined in the root, which is completely irrelevant to what I am doing with it. As we all know, LibXML becomes a serious pain to work with when namespaces are defined like this, and since I am traveling to 8 or 9 levels of child nodes multiple times, and doing operations on each, I fear for my sanity having to redefine and declare the XPathContext.

So my question is simple, is there any way (barring a sed on the source to remove it before parsing..) to remove namespace from the LibXML parsed object? There seem to be plenty of ways to define new ones, and I haven't seen a definitive answer yet anywhere on this one.

Your thoughts are appreciated.

Replies are listed 'Best First'.
Re: LibXML - Removing Namespace?
by derby (Abbot) on Apr 11, 2008 at 17:33 UTC

    Why don't you just set the namespace

    $root->setNamespace( 'http://a9.com/-/spec/opensearch/1.1/', 'openSearch' );

    Or use it in the XPATH

    my $start = $root->findvalue( 'openSearch:startIndex/text()' );

    -derby
      First all, thanks, those are great suggestions, I didn't know about the setNamespace trick for defining context, however I still have to redefine it for every level as I progress through the xml right? Perhaps I'm not being clear. Incomming example:
      <root xmls='urn:foo'> <first_sub name='foo'> <second_sub id='1'> <third_sub>Foo</third_sub> </second_sub> <second_sub id='2'> </second_sub> </first_sub> </root>
      Now just imagine that there are multiple first_sub, second_sub, and third_sub elements nested in the example above. In order to get at all the values I want, as far as I can tell I have to do something like this(assuming $xml is set to the above):
      my $parser = XML::LibXML->new(); my $data = $parser->parse_string ( $xml ); $data->setNamespace ( 'urn:foo', 'x' ); for my $first_sub ( $data->findnodes ('/x:root/x:first_sub')) { my $name = $first_sub->getAttribute('name'); $first_sub->setNamespace ('urn:foo','x'); for my $second_sub ( $first_sub->findnodes ('./x:second_sub')) { my $id = $second_sub->getAttribute('id'); $second_sub->setNamespace ('urn:foo','x'); for my $third_sub ( $second_sub->findnodes('./x:third_sub')) { # do something with the values } } }
      So I'm pretty sure its confirmed that I can't remove the name space to avoid all this setting/defining and extra work in the XPath expressions, which was my original question, this is fine, I'll just stick to something like the above. thanks again.
        whops, that code is missing a getDocumentElement...should really register so I can edit my posts ;)

        pretend there is:
        $data->getDocumentElement;

        under the parse_string.

Re: LibXML - Removing Namespace?
by Your Mother (Archbishop) on Apr 11, 2008 at 18:06 UTC

    Misery loves company. Found this snippet in LJ::Feed. (edit, fixed link)

    # Strip namespace from child tags. Set default namespace, let # child tags inherit from it. So ghetto that we even have to do t +his # and LibXML can't on its own. my $normalize_ns = sub { my $str = shift; $str =~ s/(<\w+)\s+xmlns="\Q$ns\E"/$1/og; $str =~ s/<feed\b/<feed xmlns="$ns"/; $str =~ s/<entry>/<entry xmlns="$ns">/ if $opts->{'single_entr +y'}; return $str; };

    I've resorted to what derby suggests for namespaced XHTML too and it works fine (this snippet is old and un-re-tested).

    my $root = $doc->documentElement; my $xpc = XML::LibXML::XPathContext->new($html); $xpc->registerNs('x', 'http://www.w3.org/1999/xhtml'); my $htmls = $xpc->find('/x:html', $doc);