Eythil has asked for the wisdom of the Perl Monks concerning the following question:

I have XML similiar to this:
<result> <target type="aim"> <tag1>123</tag1> <tag2>456</tag2> ... </target> </result>

I want to match the 'tag'-tags with a regex or a nice xpath expression, but I fail doing so (maybe it is because I am a perl beginner).

So far I tried with regular expressions like:
my $twig = new XML::Twig(twig_handlers => { "/result/target/tag[1-3]" => \&my_parsing_function, });
and xpathes like:
my $twig = new XML::Twig(twig_handlers => { "/result/target/tag[self::1 or self::2 or self::3]" => \&my_parsing_ +function });
Unfortunately both won't work.
The regex version just ignores the tags, the xpath one fails, because of 'unrecongnised expression in handler'.

The Twig documentation mentions it supports 'xpath-like' expression, but it seems I am doing it the wrong way.
Is there a way to do the matching with twig or is it better to match the parent tag and do the matching afterwards?

Thank you!

(I hope my english is understandable enough to get an impression of what I want to do)

Replies are listed 'Best First'.
Re: XML::Twig and handles on regex/xpath
by Corion (Patriarch) on Apr 28, 2011 at 08:05 UTC

    When debugging XPath expressions, I usually work my way up from the "deepest" tag I want to match. In your case, I'd first try to match things with a tagName matching /^tag\d+$/:

    //*[starts-with(name(), "tag")]

    Then, I'd slowly work my way upwards, adding more specific tags or rules in front of it:

    //target/*[starts-with(name(), "tag")] //result/target/*[starts-with(name(), "tag")]

    Ideally, at the end, I can then remove the floating specifier:

    /result/target/*[starts-with(name(), "tag")]

    Maybe you need to change your expression, or maybe XML::Twig doesn't support the starts-with() function - then I'd try the direct regular expression approach. I really wonder though whether your usage of specifying regular expressions instead of XPath expressions is correct, as I wuold imagine that XML::Twig interprets your first expression as XPath expression as well. I'm not sure how self::1 is supposed to work in conjunction with tag, as I don't think that tag[EXPR] will ever match anything that doesn't have an explicit tagName of "tag".

      Thanks.
      Sadly, this also gives an 'unrecogniced expression' error.
      I assume Twig does not support this kind of xpath expression.

        What gives an "unrecogniced expression" error?

        I posted three different XPath expressions. You have shown neither the Perl code you use, nor the input data you use nor the XPath expression you use, nor the error message you use. Please help us to help you better by providing us with the relevant information.

        Looking at the XML::Twig documentation, it recommends looking at XML::Twig::XPath for more XPath support.

Re: XML::Twig and handles on regex/xpath
by wind (Priest) on Apr 28, 2011 at 08:06 UTC
    Using '//target[@type="aim"]/*' as your xpath:
    use XML::Twig; use strict; use warnings; my $data = do {local $/; <DATA>}; my $twig = new XML::Twig(twig_handlers => { '//target[@type="aim"]/*' => sub { print $_->tag, ' ', $_->text, "\n"; }, }); $twig->parse($data); __DATA__ <result> <target type="aim"> <tag1>123</tag1> <tag2>456</tag2> <tag3>234</tag3> </target> </result>

    If that pulls too many nodes, you can always filter within the sub.

      That looks fine, but from what I understand it matches everything within target, doesn't it?

      But what if I want to match tag1, tag2 and tag3 but not tag4?
      I guess I should have made a better example.

        Then add the following filter within the sub

        return if $_->tag !~ /^tag[123]$/;
Re: XML::Twig and handles on regex/xpath
by mirod (Canon) on Apr 28, 2011 at 09:13 UTC

    If you want to use a regexp in the condition, the only way to do it is to have the condition be a regexp, that will be applied to the tag itself (ie you can't apply it to the whole path):

    #!/usr/bin/perl use strict; use warnings; use XML::Twig; XML::Twig->new( twig_handlers => { qr/^tag[1-3]/ => sub { print $_->ta +g, ": ", $_->text, "\n"; } }) ->parse( \*DATA); __DATA__ <result> <target type="aim"> <tag1>123</tag1> <tag2>456</tag2> <nottag>789</nottag> </target> </result>

    It would be nice to at least be able to apply the regexp to the path, so you could write qr{/result/target/tag[1-3]}. I'll look into it. Further than that, I don't think XML::Twig can do better, at least as it is currently implemented. Allowing the "xpath-like" interpreter to deal with XPath regexp syntax (start-with and the likes) would be a bit difficult, and Perl's regexp syntax and the XPath syntax collide ( [...] is a character class for Perl and a predicate for XPath), so not much hope there.

Re: XML::Twig and handles on regex/xpath
by dHarry (Abbot) on Apr 28, 2011 at 11:14 UTC

    As a general note on working with xpath: it pays off to use a decent XML editor. They support interactively constructing xpath expressions and validate them on the fly. You continuously see the results returned by the expression. In my experience this saves a lot of time debugging:) Before you use the expression in your script you have validated that it's correct. In your example it reports syntax errors e.g. 'XPath syntax error at char 25 in ... Unexpected token "<numeric literal>" after axis name'. While you could still argue that the message could be improved, it's better then the "unrecongnised expression in handler" message. I recommend you to take a look at oXygen or XMLspy. They can ease your "xml-life" considerably (try debugging a xslt!). (There are many free/open source XML editors available but IMHO none of them even comes close to the commercial ones.)

    Cheers

    Harry

      So far I've been using vim to look at my xml files.
      But I have to admit that it might not be the best tool to look at these files.

      I haven't looked at any plugins for it (if there are any) that would report syntax errors, but that might be a good idea.