cryion has asked for the wisdom of the Perl Monks concerning the following question:

Hello. I'm having some issues with regex and Im not sure it is even possible to solve this using regex. But I have no way around using regex at the moment.

Imagine the following piece of xml:

<xml> <info> <file>file:/path/to/some/file.mxf</file> </info> <info> <file>file:/path/to/some/file.xml</file> </info> </xml>
(Xml is linarized and does not contain newlines)

What I need to retrieve is the whole path to the xml file. The regex

file:(.*?\.xml)

does not work, because it matches the very first occurence of the file: string. Is there any way to do this, make regex ignore all the file: strings that are not part of the tag that actually includes the path to the xml file?

This is driving me nuts. Thank you!

Update: Corrected regex and added additional information for clarity.

Replies are listed 'Best First'.
Re: Regex match: Ignoring first occurences
by Corion (Patriarch) on Aug 10, 2015 at 12:55 UTC

    Maybe you simply want to avoid (opening) angle brackets to just match the tag values?

    file:([^<>]*?)\.xml

    Update Made the dot (.) in the regex more specific, thanks to Laurent_R.

      Thank you very much, this seems to work fine. I had that idea as well but obviously no idea how to write that expression.
Re: Regex match: Ignoring first occurences
by kcott (Archbishop) on Aug 10, 2015 at 13:29 UTC

    G'day cryion,

    Welcome to the Monastery.

    "But I have no way around using regex at the moment."

    Using a regex to parse XML is generally a poor choice. Why do you have no way around this?

    On the basis that you must use a regex, there is a distinct disconnect between the code and data you've posted and the regex you say doesn't work.

    Parsing your XML code, line by line, with the regex you've shown (i.e. 'file:(.*?).xml'), captures one piece of data:

    /path/to/some/file

    Had you used different paths, such that you could see which path was being matched, you'd know that 'file:/path/to/some/file.mxf' ("the very first occurence of the file: string") was not matched at all. Consider this test:

    #!/usr/bin/env perl -l use strict; use warnings; my $re = qr{file:(.*?).xml}; while (<DATA>) { print $1 if /$re/; } __DATA__ <xml> <info> <file>file:/path/to/someA/file.mxf</file> </info> <info> <file>file:/path/to/someB/file.xml</file> </info> </xml>

    Output:

    /path/to/someB/file

    So, you're matching the right path, but not capturing all of it.

    A '.' in a regex matches any character (except newline), so you really need '\.xml', not '.xml'. The closing parenthesis needs to be after '\.xml' to capture to whole pathname.

    Making those changes:

    #!/usr/bin/env perl -l use strict; use warnings; my $re = qr{file:(.*?\.xml)}; while (<DATA>) { print $1 if /$re/; } __DATA__ <xml> <info> <file>file:/path/to/someA/file.mxf</file> </info> <info> <file>file:/path/to/someB/file.xml</file> </info> </xml>

    Gives this output:

    /path/to/someB/file.xml

    Which is what you state you wanted: "the whole path to the xml file".

    — Ken

      Thank you very much. I translated from my original code to a 'dumbed down' version to get the issue across more easily. I guess I didnt do a good enough job.

      For one, I should have added that the xml string has no line breaks in it and without those your regex doesnt seem to work anymore. (tried only on notepad++'s regex plugin, though)

      I was also actually using the '\.'. Im sorry for not putting it in here.

      It's kind of complicated to explain why I can only use a regex. It has to do with a piece of software I have to use that only takes regex as an input param to retrieve information out of a file. I have heard about regex being a terrible idea for parsing xml and I try to avoid it as often as possible. However, Im not entirely sure about the reasons. You happen to have a good resource to read up on this?

      Thank you very much!

        The repetitious and hierarchical nature of XML often makes use of regexes difficult. There are several useful XML modules which make dealing with it easier. Well, at least less error prone! Especially if the precise structure of the XML may change over time. Popular modules include XML::Twig, XML::LibXML, XML::Rules. Avoid XML::Simple. A few of these have good tutorial pages available. You can find examples of use with Super Search here.

        Dum Spiro Spero
Re: Regex match: Ignoring first occurrences
by Athanasius (Archbishop) on Aug 10, 2015 at 13:46 UTC

    Hello cryion, and welcome to the Monastery!

    With the XML data shown, the regex you say doesn’t work actually does, as long as there is no /s modifier, because in the absence of that modifier . won’t match a newline character — and therefore the matching file: has to be on the same line as the \.xml.

    But I’m guessing that your real data doesn’t always contain newlines as in the example. In that case, you can use a technique which I learned here at PerlMonks:

    #! perl use strict; use warnings; my $xml = '<xml><info><file>file:/path1/to/some/file.mxf</file></in +fo>' . '<info><file>file:/path2/to/some/file.xml</file></info></ +xml>'; my $lmx = reverse $xml; my ($htap) = $lmx =~ /lmx\.(.*?):elif/s; if (defined $htap) { my $path = reverse $htap; print "Path: $path\n"; }

    Output:

    23:32 >perl 1335_SoPW.pl Path: /path2/to/some/file 23:32 >

    See reverse.

    Update (Aug 11, 2015):

    1. Fixed logic to prevent uninitialized warning when attempting to reverse $htap if $htap is undef.
    2. This technique will still give a false positive if, e.g., the file: immediately preceding .xml is missing or misspelled. Prefer the other solutions given above.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Regex match: Ignoring first occurences
by Anonymous Monk on Aug 10, 2015 at 13:03 UTC
    #!/usr/bin/perl -l # http://perlmonks.org/?node_id=1138025 use strict; use warnings; $_ = <<END; <xml> <info> <file>file:/path/to/some/file.mxf</file> </info> <info> <file>file:/path/to/some/file.xml</file> </info> </xml> END print m{.*<file>file:(.*?\.xml)</file>}s ? $1 : "xml not found";