Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Regexp Problem with greedy

by Isanchez (Acolyte)
on Sep 10, 2003 at 23:14 UTC ( [id://290551]=perlquestion: print w/replies, xml ) Need Help??

Isanchez has asked for the wisdom of the Perl Monks concerning the following question:

HI

I have a problem with the following script that needs to grab all instances of specific airplane names inside of tags that have a number reference at the end. I know that the number reference should have been in the starting tag... but can anyone check please if there is a way of making this code work

a million thanks Monks

$in = "captionOutTagged.xml"; open (IN, $in) or die "can't open the infile $in \n"; ##### while (not eof (IN)){ $line = <IN>; chomp $line; #print "$line\n\n"; # airplanes models: have a digit at the end of second tag if ( @terms = $line =~ /\<M\>(.*?)\<\/M\d+?\>/gix ) { # print "=**$1**=\n"; } # no number then avionics general terms elsif (@avionics = $line =~ /\<M\>(.+?)<\/'M'\>/gix) { #print " LINE: $line\n"; #print "$1\n"; } ########## foreach $term (@terms){ print "$term\n"; } foreach $avionic (@avionics){ print "$avionic\n"; } ######### } # end

America's first <M>swept-wing</M>, <M>multiengine jet</M> <M>bomber</M> was the <M>B-47 Stratojet</M200>, and the first <M>swept-wing fighter</M> was the <M>F-86 Sabre Jet</M201>. Both used new swept-wing data found in Germany after <M>World War II</M> and sent back to the United States by American scientists. This photograph, from <D>1951</D>, was taken the first time the two flew together over <PL>Kansas</PL>.

curent output:

swept-wing</M>, <M>multiengine jet</M> <M>bomber</M> was the <M>B-47 Stratojet swept-wing fighter</M> was the <M>F-86 Sabre Jet

desired autput:

B-47 Stratojet

F-86 Sabre Jet

Replies are listed 'Best First'.
Re: Regexp Problem with greedy
by Enlil (Parson) on Sep 10, 2003 at 23:54 UTC
    I believe this will get what you want:
    use strict; use warnings; while ( <DATA> ) { chomp; my @planes = m'<M>((?:[^<]+|<(?!/M(?:\d+)?>)))</M\d+>'gi; my @terms = m'<M>((?:[^<]+|<(?!/M>)))</M>'gi; print "@planes\n"; print "@terms\n"; } __DATA__ America's first <M>swept-wing</M>, <M>multiengine jet</M> <M>bomber</M +> was the <M>B-47 Stratojet</M200>, and the first <M>swept-wing fight +er</M> was the <M>F-86 Sabre Jet</M201>. Both used new swept-wing dat +a found in Germany after <M>World War II</M> and sent back to the Uni +ted States by American scientists. This photograph, from <D>1951</D>, + was taken the first time the two flew together over <PL>Kansas</PL>.

    -enlil

Re: Regexp Problem with greedy
by asarih (Hermit) on Sep 11, 2003 at 00:08 UTC
    A fundamental problem with this script is that it is chopping a file at each \n. If a tagged text spans more than one line, you won't catch it. Do you want to catch the <M> tags only if its ending tag has digits?

    This will do it. (I'll leave it up to you to improve it so that it can detect tagged text that spans over line breaks.)

    #!/usr/local/bin/perl -w while (<DATA>) { »· for (m{<M>([^<]+)(?=</M\d+>)}ig) { print $_,"\n"}; } __DATA__ America's first <M>swept-wing</M>, <M>multiengine jet</M> <M>bomber</M +> was the <M>B-47 Stratojet</M200>, and the first <M>swept-wing fighter</M> was the <M>F-86 Sabre Jet</M201>. Both used new swept-wing data found in Germany after <M>World War II</M> and sent back to the United States by American scientists. This photograph, from <D>1951</D +>, was taken the first time the two flew together over <PL>Kansas</PL>.· <M>Bogus</M2> <M>airplane</M3>
    But you really should read perlre and if you get a chance, by all means read Jeff Friedl's excellent Mastering Regular Expressions (ISBN 0596002890). And of course, be wary of Dot star! (Death to Dot Star!).
      thank you all for your great advice. Best, Ivo
Re: Regexp Problem with greedy
by pzbagel (Chaplain) on Sep 10, 2003 at 23:44 UTC

    As long as you are fairly sure your plane names will not contain < or > then you can simply replace:

    (.*?) #with ([^<>]*?) #and (.+?) #with ([^<>]+?)

    in your regexes to get the desired results.

    HTH

    P.S. Mastering Regular Expressions is a must-read!

Re: Regexp Problem with greedy
by davido (Cardinal) on Sep 10, 2003 at 23:52 UTC
    The biggest problem you're facing is what to do if text that should match spans multiple lines. If it does, your version will fail.

    Second, you are single-quoting 'M' in the second regexp. That means the regexp is going to be looking for </'M'> when it should be looking for </M>.

    Third, (probably not a problem) < and > aren't metacharacters in regular expressions. You don't have to escape them.

    Here is my own untested rewrite that might prove to work better.

    { local $/ = ""; while ( <IN> ) push @terms, map ({s/\n/ /}, m|<M>(.+?)</M\d+>|gis); push @avionics, map ({s/\n/ /}, m|<M>(.+?)</M>|/gis); } } { local $, = "\n"; print @terms; print @avionics; }

    The theory here is: Setting $/ to "" sets "paragraph mode" where you read in chunks at a time. This helps to alleviate the problem of having text span multiple lines. We didn't allow for newlines within the tag itself though. Next, the regexp's are evaluated in list context and any matches captured in () are pushed into @terms and @avionics. The /s modifier causes '.' to also match newline characters. The map function modifies the returned list by (in this case) substituting \n with a space. Finally, I set $, to "\n" so that your printout is one element per new line. Modifications to $/ and $" were done "locally" so that they revert back to their original values when the blocks end.

    I haven't tested the code yet, but it ought to do the trick.

    Dave

    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://290551]
Approved by VSarkiss
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (2)
As of 2024-04-20 05:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found