bangor has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone, I'm parsing a huge number of XML files and have run into a problem. Most of the files have this at the start:
<EARTHSTATS>
But the odd file has this:
<?xml version="1.0" encoding="utf-8"?> <EARTHSTATS xmlns="http://www.earthstats.org/XFDL/Custom">
and it's messing up my parse. I'm sure there is a way of grepping through the files from the command line and getting rid of the offenders but I'm useless at that. Can anyone start me off with the type of command to use? Thanks!

Update:
This is what's messing me up:

my @nodes = $doc->findnodes('EARTHSTATS');
It doesn't work on the file that has the xmlns - maybe I could change that?

Replies are listed 'Best First'.
Re: Search and replace for large number of files
by choroba (Cardinal) on May 23, 2014 at 08:59 UTC
    I assume you are using XML::LibXML (good for you!). To work with namespaces, you have to register them and use a prefix:
    my $xpc = 'XML::LibXML::XPathContext'->new; $xpc->registerNs('e', 'http://www.earthstats.org/XFDL/Custom'); my @nodes = $xpc->findnodes('e:EARTHSTATS', $xml->documentElement);
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      To catch both cases in one go, you could use an alternation inside the XPath expression:

      my @nodes = $xpc->findnodes('e:EARTHSTATS | EARTHSTATS', $doc);

      An alternative solution that does not require registering the namespace, would be to use the local-name XPath function:

      my @nodes = $doc->findnodes('*[local-name()="EARTHSTATS"]');

      Of course neither of those solutions is very pretty, but unfortunately that's just how XPath 1.0 works when dealing with inconsistently namespaced input.
      Things are better with XPath 2, but XML::LibXML doesn't have support for that (nor does any other Perl module that I'm aware of).

        Thanks for that, I have learnt something new. Looks like I have to change all my other Xpath expressions too though.
      Yes, I really like XML::LibXML, it flies through these files some of which are really big. It's the people who produce the XML files in an inconsistent fashion that I want to invite outside!
        I would guess they are using more than 1 application to generate the XML files, so blame the people who wrote those for being inconsistent with the XML specifications.
Re: Search and replace for large number of files
by taint (Chaplain) on May 23, 2014 at 23:06 UTC
    Hello, bangor.

    I don't know if you are able to manipulate the files your working with. But if that was an option. It occurs to me that it might also be just as easy to perform a search-and-replace against the offending <EARTHSTATS> tags, then you could plow through the whole lot, without concern.

    Best wishes.

    --Chris

    ¡λɐp ʇɑəɹ⅁ ɐ əʌɐɥ puɐ ʻꜱdləɥ ꜱᴉɥʇ ədoH