REGEX issue

siva kumar has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: REGEX issue
by FunkyMonk (Chancellor) on May 28, 2007 at 10:20 UTC

~~It is very difficult to parse HTML using regular expressions. You will be better off using a module that understands HTML better than regexen. I usually use HTML::TreeBuilder.~~

It is very difficult to parse XML using regular expressions. You will be better off using a module that understands XML better than regexen. I usually use XML::Simple.

use XML::Simple;

my $str = "<story>
                <page>
                        <description> desc </description>
                        <image> notExpected.jpg </image>
                        <headline> head </headline>
                </page>
        </story>
        <image>correctImage.jpg</image>";

my $xml = XMLin( "<wrapper>$str</wrapper>" );

my $image = $xml->{image};
print "$_\n" for ref( $image ) ? @{$image} : $image;
[download]

Notice that XML must within a single tag. Hence the addition of <wrapper>...<wrapper>

You're not using [...] character classes correctly.

[ABC] matches any one of the letters A, B or C.
[^ABC] matches one character that isn't an A, B or C.

perlre

perlretut

edit: I guess I wasn't properly awake. I didn't notice it was XML rather than HTML.
edit^2: Added code

[reply]
[d/l]
[select]

Re: REGEX issue
by blazar (Canon) on May 28, 2007 at 10:36 UTC

$str = "<story> <page> <description> desc </description> <image> notExpected.jpg </image>
[download]

Hmmm, HTML? XML?

I want to parse the image value and the image tag should not be inside the story tag.

Hmmm, parse! Super Search for HTML parse or XML parse and use the right™ tools.

[reply]
[d/l]

Re: REGEX issue
by Jenda (Abbot) on May 28, 2007 at 14:26 UTC

use XML::Rules;

my $parser = XML::Rules->new(
 rules => [
  story => 'skip',
  image => sub {print $_[1]->{_content},"\n"},
 ]
);

$parser->parse( $the_xml);
[download]

Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]

Re: REGEX issue
by citromatik (Curate) on May 28, 2007 at 10:19 UTC

That is because the "?" in ".*?" makes the ".*" pattern ungreedy

For example, compare:

$str="taaaaaaaa";
$str=~/(ta+)/; print "$1\n";
$str=~/(ta+?)/; print "$1\n";
[download]

In your case...

$str =~ m/[^<story>].*<image>(.*)<\/image>.*?[^<\/story>]/s; ## <- w/o the "?" should work
[download]

citromatik

UPDATE: I apologize for my answer: although my previous comment is correct and my regexp do the work, the concept of the regexp is totally incorrect. It should be expressed as:

$str=~s/\n//g; 
$str =~ m%<story>.*</story><image>(.*)</image>%;
[download]

[reply]
[d/l]
[select]

Re: REGEX issue
by bart (Canon) on May 28, 2007 at 20:05 UTC

$str =~ m/[^<story>]/

(?!<story>)

(?<!<story>)

What I would do, if using a full-blown XML/SGML/HTML parser is deemed unnecessary, is match both the substrings between "story" tags, and other substrings between "image" tags, in that order of matching precedence. And only keep the latter.

Try this on for size:

@images = grep defined, $str =~ m#<story>.*?</story>|<image>(.*?)</ima
+ge>#gs;
[download]

my($image) = grep defined, $str =~ m#<story>.*?</story>|<image>(.*?)</
+image>#gs;
[download]

[reply]
[d/l]
[select]

Re: REGEX issue
by Anonymous Monk on May 28, 2007 at 15:49 UTC

my $str = <<END_STRING;
<story>
  <page>
    <description> desc </description>
    <image> notExpected.jpg </image>
    <headline> head </headline>
  </page>
</story>
<image>correctImage.jpg</image>
END_STRING

my $ptn = qr{<story>.*?</story>|<image>(.*?)</image>}s;
my @out = ($str =~ /$ptn/g);
print qq("$_"\n) for grep $_, @out;
[download]

[reply]
[d/l]

Re: REGEX issue
by sanPerl (Friar) on May 29, 2007 at 07:35 UTC

use strict;
my $str = "<story>
                <page>
                        <description> desc </description>
                        <story>aaaaaaaaa</story>
                        <image> notExpected.jpg </image>
                        <headline> head </headline>
                </page>
        </story>
        <story>aaaaaaaaaaaa

<image> notExpected.jpg </image>

aa</story>
        <image>correctImage.jpg</image> ";

while ($str =~ m/<image>(.*?)<\/image>/gs)
{
        my ($image_content) = ($1);
        my $story_st=0;
        my $story_en=0;
        my $prelines = $`;
        while ($prelines =~ m/<story>/gs){$story_st++}   
        while ($prelines =~ m/<\/story>/gs){$story_en++}
        print $image_content, "\n" if ($story_st == $story_en);       
}
[download]

[reply]
[d/l]


Do you know where your variables are?
	PerlMonks