Regex match: Ignoring first occurences

cryion has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Regex match: Ignoring first occurences
by Corion (Patriarch) on Aug 10, 2015 at 12:55 UTC

Maybe you simply want to avoid (opening) angle brackets to just match the tag values?

file:([^<>]*?)\.xml
[download]

Update Made the dot (.) in the regex more specific, thanks to Laurent_R.

[reply]
[d/l]
[select]

Re^2: Regex match: Ignoring first occurences

by cryion (Initiate) on Aug 10, 2015 at 14:17 UTC

Thank you very much, this seems to work fine. I had that idea as well but obviously no idea how to write that expression.

[reply]

Re: Regex match: Ignoring first occurences
by kcott (Archbishop) on Aug 10, 2015 at 13:29 UTC

G'day cryion,

Welcome to the Monastery.

"But I have no way around using regex at the moment."

Using a regex to parse XML is generally a poor choice. Why do you have no way around this?

On the basis that you must use a regex, there is a distinct disconnect between the code and data you've posted and the regex you say doesn't work.

Parsing your XML code, line by line, with the regex you've shown (i.e. 'file:(.*?).xml'), captures one piece of data:

/path/to/some/file
[download]

Had you used different paths, such that you could see which path was being matched, you'd know that 'file:/path/to/some/file.mxf' ("the very first occurence of the file: string") was not matched at all. Consider this test:

#!/usr/bin/env perl -l

use strict;
use warnings;

my $re = qr{file:(.*?).xml};

while (<DATA>) {
    print $1 if /$re/;
}

__DATA__
<xml>
<info>
<file>file:/path/to/someA/file.mxf</file>
</info>
<info>
<file>file:/path/to/someB/file.xml</file>
</info>
</xml>
[download]

Output:

/path/to/someB/file
[download]

So, you're matching the right path, but not capturing all of it.

A '.' in a regex matches any character (except newline), so you really need '\.xml', not '.xml'. The closing parenthesis needs to be after '\.xml' to capture to whole pathname.

Making those changes:

#!/usr/bin/env perl -l

use strict;
use warnings;

my $re = qr{file:(.*?\.xml)};

while (<DATA>) {
    print $1 if /$re/;
}

__DATA__
<xml>
<info>
<file>file:/path/to/someA/file.mxf</file>
</info>
<info>
<file>file:/path/to/someB/file.xml</file>
</info>
</xml>
[download]

Gives this output:

/path/to/someB/file.xml
[download]

Which is what you state you wanted: "the whole path to the xml file".

— Ken

[reply]
[d/l]
[select]

Re^2: Regex match: Ignoring first occurences

by cryion (Initiate) on Aug 10, 2015 at 14:15 UTC

For one, I should have added that the xml string has no line breaks in it and without those your regex doesnt seem to work anymore. (tried only on notepad++'s regex plugin, though)

I was also actually using the '\.'. Im sorry for not putting it in here.

It's kind of complicated to explain why I can only use a regex. It has to do with a piece of software I have to use that only takes regex as an input param to retrieve information out of a file. I have heard about regex being a terrible idea for parsing xml and I try to avoid it as often as possible. However, Im not entirely sure about the reasons. You happen to have a good resource to read up on this?

[reply]

Re^3: Regex match: Ignoring first occurences

by GotToBTru (Prior) on Aug 10, 2015 at 15:26 UTC

The repetitious and hierarchical nature of XML often makes use of regexes difficult. There are several useful XML modules which make dealing with it easier. Well, at least less error prone! Especially if the precise structure of the XML may change over time. Popular modules include XML::Twig, XML::LibXML, XML::Rules. Avoid XML::Simple. A few of these have good tutorial pages available. You can find examples of use with Super Search here.

Dum Spiro Spero

[reply]

Re: Regex match: Ignoring first occurrences
by Athanasius (Archbishop) on Aug 10, 2015 at 13:46 UTC

Hello cryion, and welcome to the Monastery!

With the XML data shown, the regex you say doesn’t work actually does, as long as there is no /s modifier, because in the absence of that modifier . won’t match a newline character — and therefore the matching file: has to be on the same line as the \.xml.

But I’m guessing that your real data doesn’t always contain newlines as in the example. In that case, you can use a technique which I learned here at PerlMonks:

#! perl
use strict;
use warnings;

my  $xml   = '<xml><info><file>file:/path1/to/some/file.mxf</file></in
+fo>' .
             '<info><file>file:/path2/to/some/file.xml</file></info></
+xml>';
my  $lmx   = reverse $xml;
my ($htap) = $lmx =~ /lmx\.(.*?):elif/s;

if (defined $htap)
{
    my $path = reverse $htap;
    print "Path: $path\n";
}
[download]

Output:

23:32 >perl 1335_SoPW.pl
Path: /path2/to/some/file

23:32 >
[download]

See reverse.

Update (Aug 11, 2015):

Fixed logic to prevent uninitialized warning when attempting to reverse $htap if $htap is undef.
This technique will still give a false positive if, e.g., the file: immediately preceding .xml is missing or misspelled. Prefer the other solutions given above.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Regex match: Ignoring first occurences
by Anonymous Monk on Aug 10, 2015 at 13:03 UTC

#!/usr/bin/perl -l

# http://perlmonks.org/?node_id=1138025

use strict;
use warnings;

$_ = <<END;
<xml>
<info>
<file>file:/path/to/some/file.mxf</file>
</info>
<info>
<file>file:/path/to/some/file.xml</file>
</info>
</xml>
END

print m{.*<file>file:(.*?\.xml)</file>}s ? $1 : "xml not found";
[download]

[reply]
[d/l]