in reply to Regex match: Ignoring first occurences

G'day cryion,

Welcome to the Monastery.

"But I have no way around using regex at the moment."

Using a regex to parse XML is generally a poor choice. Why do you have no way around this?

On the basis that you must use a regex, there is a distinct disconnect between the code and data you've posted and the regex you say doesn't work.

Parsing your XML code, line by line, with the regex you've shown (i.e. 'file:(.*?).xml'), captures one piece of data:

/path/to/some/file

Had you used different paths, such that you could see which path was being matched, you'd know that 'file:/path/to/some/file.mxf' ("the very first occurence of the file: string") was not matched at all. Consider this test:

#!/usr/bin/env perl -l use strict; use warnings; my $re = qr{file:(.*?).xml}; while (<DATA>) { print $1 if /$re/; } __DATA__ <xml> <info> <file>file:/path/to/someA/file.mxf</file> </info> <info> <file>file:/path/to/someB/file.xml</file> </info> </xml>

Output:

/path/to/someB/file

So, you're matching the right path, but not capturing all of it.

A '.' in a regex matches any character (except newline), so you really need '\.xml', not '.xml'. The closing parenthesis needs to be after '\.xml' to capture to whole pathname.

Making those changes:

#!/usr/bin/env perl -l use strict; use warnings; my $re = qr{file:(.*?\.xml)}; while (<DATA>) { print $1 if /$re/; } __DATA__ <xml> <info> <file>file:/path/to/someA/file.mxf</file> </info> <info> <file>file:/path/to/someB/file.xml</file> </info> </xml>

Gives this output:

/path/to/someB/file.xml

Which is what you state you wanted: "the whole path to the xml file".

— Ken

Replies are listed 'Best First'.
Re^2: Regex match: Ignoring first occurences
by cryion (Initiate) on Aug 10, 2015 at 14:15 UTC
    Thank you very much. I translated from my original code to a 'dumbed down' version to get the issue across more easily. I guess I didnt do a good enough job.

    For one, I should have added that the xml string has no line breaks in it and without those your regex doesnt seem to work anymore. (tried only on notepad++'s regex plugin, though)

    I was also actually using the '\.'. Im sorry for not putting it in here.

    It's kind of complicated to explain why I can only use a regex. It has to do with a piece of software I have to use that only takes regex as an input param to retrieve information out of a file. I have heard about regex being a terrible idea for parsing xml and I try to avoid it as often as possible. However, Im not entirely sure about the reasons. You happen to have a good resource to read up on this?

    Thank you very much!

      The repetitious and hierarchical nature of XML often makes use of regexes difficult. There are several useful XML modules which make dealing with it easier. Well, at least less error prone! Especially if the precise structure of the XML may change over time. Popular modules include XML::Twig, XML::LibXML, XML::Rules. Avoid XML::Simple. A few of these have good tutorial pages available. You can find examples of use with Super Search here.

      Dum Spiro Spero