Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

REGEX issue

by siva kumar (Pilgrim)
on May 28, 2007 at 10:07 UTC ( [id://617817]=perlquestion: print w/replies, xml ) Need Help??

siva kumar has asked for the wisdom of the Perl Monks concerning the following question:

$str = "<story> <page> <description> desc </description> <image> notExpected.jpg </image> <headline> head </headline> </page> </story> <image>correctImage.jpg</image> "; $str =~ m/[^<story>].*?<image>(.*?)<\/image>.*?[^<\/story>]/s; print $1;
I want to parse the image value and the image tag should not be inside the story tag.
ie., I want the output "correctImage.jpg" but I am getting "notExpected.jpg" Thnx in advance

Replies are listed 'Best First'.
Re: REGEX issue
by FunkyMonk (Chancellor) on May 28, 2007 at 10:20 UTC
    It is very difficult to parse HTML using regular expressions. You will be better off using a module that understands HTML better than regexen. I usually use HTML::TreeBuilder.

    It is very difficult to parse XML using regular expressions. You will be better off using a module that understands XML better than regexen. I usually use XML::Simple.

    use XML::Simple; my $str = "<story> <page> <description> desc </description> <image> notExpected.jpg </image> <headline> head </headline> </page> </story> <image>correctImage.jpg</image>"; my $xml = XMLin( "<wrapper>$str</wrapper>" ); my $image = $xml->{image}; print "$_\n" for ref( $image ) ? @{$image} : $image;

    Notice that XML must within a single tag. Hence the addition of <wrapper>...<wrapper>

    You're not using [...] character classes correctly.

    • [ABC] matches any one of the letters A, B or C.
    • [^ABC] matches one character that isn't an A, B or C.
    See perlre, perlretut for more details.

    edit: I guess I wasn't properly awake. I didn't notice it was XML rather than HTML.
    edit^2: Added code

Re: REGEX issue
by blazar (Canon) on May 28, 2007 at 10:36 UTC
    $str = "<story> <page> <description> desc </description> <image> notExpected.jpg </image>

    Hmmm, HTML? XML?

    I want to parse the image value and the image tag should not be inside the story tag.

    Hmmm, parse! Super Search for HTML parse or XML parse and use the righttools.

Re: REGEX issue
by Jenda (Abbot) on May 28, 2007 at 14:26 UTC
Re: REGEX issue
by citromatik (Curate) on May 28, 2007 at 10:19 UTC

    That is because the "?" in ".*?" makes the ".*" pattern ungreedy

    For example, compare:

    $str="taaaaaaaa"; $str=~/(ta+)/; print "$1\n"; $str=~/(ta+?)/; print "$1\n";

    In your case...

    $str =~ m/[^<story>].*<image>(.*)<\/image>.*?[^<\/story>]/s; ## <- w/o the "?" should work

    citromatik

    UPDATE: I apologize for my answer: although my previous comment is correct and my regexp do the work, the concept of the regexp is totally incorrect. It should be expressed as:

    $str=~s/\n//g; $str =~ m%<story>.*</story><image>(.*)</image>%;
    Nevertheless I agree with FunkeyMonk, using XML::Simple would be the best solution.

Re: REGEX issue
by bart (Canon) on May 28, 2007 at 20:05 UTC
    $str =~ m/[^<story>]/
    That doesn't do what you think it does. What it does, is match "any character that isn't one of '<', 's', 't', 'o', 'r', 'y', '>'". What behaves more like what you're after, is (?!<story>) or (?<!<story>), i.e. negative lookahead or lookbehind. But even that wouldn't work, as it can still match a string containing "<story>", only not at that exact spot.

    What I would do, if using a full-blown XML/SGML/HTML parser is deemed unnecessary, is match both the substrings between "story" tags, and other substrings between "image" tags, in that order of matching precedence. And only keep the latter.

    Try this on for size:

    @images = grep defined, $str =~ m#<story>.*?</story>|<image>(.*?)</ima +ge>#gs;
    That'll match every correct image. If you're sure you want just one, you still need the /g modifier and the list context, but only keep the first proper match:
    my($image) = grep defined, $str =~ m#<story>.*?</story>|<image>(.*?)</ +image>#gs;
Re: REGEX issue
by Anonymous Monk on May 28, 2007 at 15:49 UTC
    You probably can use an alternation to handle this. try the following:
    my $str = <<END_STRING; <story> <page> <description> desc </description> <image> notExpected.jpg </image> <headline> head </headline> </page> </story> <image>correctImage.jpg</image> END_STRING my $ptn = qr{<story>.*?</story>|<image>(.*?)</image>}s; my @out = ($str =~ /$ptn/g); print qq("$_"\n) for grep $_, @out;
    You need to do more test by yourself thought.

    Regards,
    Xicheng
Re: REGEX issue
by sanPerl (Friar) on May 29, 2007 at 07:35 UTC
    I can suggest some different method
    This will ensure that <image> is NOT embeded into any <story> tag.
    use strict; my $str = "<story> <page> <description> desc </description> <story>aaaaaaaaa</story> <image> notExpected.jpg </image> <headline> head </headline> </page> </story> <story>aaaaaaaaaaaa <image> notExpected.jpg </image> aa</story> <image>correctImage.jpg</image> "; while ($str =~ m/<image>(.*?)<\/image>/gs) { my ($image_content) = ($1); my $story_st=0; my $story_en=0; my $prelines = $`; while ($prelines =~ m/<story>/gs){$story_st++} while ($prelines =~ m/<\/story>/gs){$story_en++} print $image_content, "\n" if ($story_st == $story_en); }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://617817]
Approved by naikonta
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2024-04-25 11:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found