PerlJedi has asked for the wisdom of the Perl Monks concerning the following question:

Here is my input file snippet "Input.xml" (Its a huge XML file of nearly 30MB with 400000 lines):

<Reasons id="ABC" name="PJ1:PJ"> <modDate>2010-03-18T09:20:05.793-03:00</modDate> </Reasons> <Reasons id="DEF" name="PJ2:PJ"> <modDate>2010-03-18T09:19:56.997-03:00</modDate> </Reasons> <Reasons id="GHI" name="ADD"> <modDate>2010-03-18T11:12:00.147-03:00</modDate> </Reasons> <Reasons id="JKL" name="ADD"> <modDate>2010-03-18T11:15:16.597-03:00</modDate> </Reasons>

I am trying to extract all the "Reasons" by reading the file into array and then joining to scalar then retaining only the wanted stuff by substitution. My Perl code is as follows:

#! /bin/env perl use warnings; if (! open R21WORKINGCONFIG, "<", "Input.xml") { die "Cant open Input.xml for input"; } @workingConfig = <R21WORKINGCONFIG>; chomp @workingConfig; $domainProperty = "Reasons"; $scalarWorkingConfig = join "XXXXX", @workingConfig; $scalarWorkingConfig =~ s/.*(XXXXX[ | ]*<$domainProperty .*$domainPr +operty>).*/$1/g; print join ("\n", split (/XXXXX/, $scalarWorkingConfig))."\n"; close R21WORKINGCONFIG;

But unfortunately, the substitute does not match the entire set of "Reasons" but only the last one. Here is the output that I get:

<Reasons id="JKL" name="ADD"> <modDate>2010-03-18T11:15:16.597-03:00</modDate> </Reasons>

The question is why is the substitute command not matching the entire set of Reasons? I think the reg-ex is pretty straight forward. Where is it that I am going wrong?

Replies are listed 'Best First'.
Re: How to make substitute greedy like sed's substitute
by Ratazong (Monsignor) on Mar 30, 2010 at 08:55 UTC

    The issue is not that you are not greedy enough - you are too greedy.

    $scalarWorkingConfig =~ s/.*(XXXXX[ | ]*<$domainProperty .*$domainPr +operty>).*/$1/g;

    The first .* is greedy; it tries to catch as much as possible ... which includes all of your <Reasons...> - parts except of the last one.

    Try replacing it with .*?

      Thanks Ratazong ... It works fine as required with the .*?

Re: How to make substitute greedy like sed's substitute
by Jenda (Abbot) on Mar 30, 2010 at 15:14 UTC
    1. 30MB is not huge.
    2. Do not, repeat, do not, repeat, DO NOT attempt to parse XML with regexps!

    Do yourself a favor and use a XML parser. The task at hand sounds perfect for XML::Rules, but XML::Twig would also work great.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re: How to make substitute greedy like sed's substitute
by rovf (Priest) on Mar 30, 2010 at 11:15 UTC
    Its a huge XML file of nearly 30MB with 400000 lines ...
    @workingConfig = <R21WORKINGCONFIG>; $scalarWorkingConfig = join "XXXXX", @workingConfig; $scalarWorkingConfig =~ s/.*(XXXXX[ | ]*<$domainProperty .*$domainPr +operty>).*/$1/g;
    ...

    I wonder whether Larry could ever had guessed all these things people will do with his Perl....

    -- 
    Ronald Fischer <ynnor@mm.st>
Re: How to make substitute greedy like sed's substitute
by AnomalousMonk (Archbishop) on Mar 31, 2010 at 00:42 UTC

    I would endorse Jenda's advice to avoid regexes for parsing (X|HT)ML.

    However,
        $scalarWorkingConfig =~ s/.*(XXXXX[ |   ]*<$domainProperty .*$domainProperty>).*/$1/g;
    suggests a misunderstanding of the syntax of character classes (i.e., the  [chars] construct; see perlre and Using character classes in perlretut and also perlrequick). The  '|' (pipe) character is not special (i.e., is not a metacharacter) in a character class, nor has the repetition of characters any significance. The character class  [ |   ] is the equivalent of  [ |] and consists of a blank-space character and a pipe character.

Re: How to make substitute greedy like sed's substitute
by kyz (Acolyte) on Mar 31, 2010 at 12:40 UTC

    I also agree with Jenda. I regularly work with scanning for needles in 20 gigabyte XML dumps, where loading the data entirely into memory would be impossibly slow.

    I recommend using the right tool for the job. Would you write a perl script to search through text files for lines matching a static pattern, or would you just use grep? The same is true for XML - would you write a script to search through XML when you could just use the XML equivalent of grep?

    I put your XML snippet into a file called test.xml, and wrapped it with <root> </root>. Then I ran this command:

    xml sel -t -c '*/Reasons' test.xml

    This printed just the <Reasons> tags and whatever was inside them, i.e. what your code does. The xml command is a piece of software called XMLStarlet and it's designed to be the XML equivalent of grep/awk/sed. I'd recommend using it.

Re: How to make substitute greedy like sed's substitute
by Anonymous Monk on Mar 31, 2010 at 07:30 UTC

    Try this substitute command;

    $scalarWorkingConfig =~ s/.*XXXXX\s(<$domainProperty +.*$domainPropert +y>).*/$1/g;