It is very difficult to parse HTML using regular expressions. You will be better off using a module that understands HTML better than regexen. I usually use HTML::TreeBuilder.
It is very difficult to parse XML using regular expressions. You will be better off using a module that understands XML better than regexen. I usually use XML::Simple.
use XML::Simple;
my $str = "<story>
<page>
<description> desc </description>
<image> notExpected.jpg </image>
<headline> head </headline>
</page>
</story>
<image>correctImage.jpg</image>";
my $xml = XMLin( "<wrapper>$str</wrapper>" );
my $image = $xml->{image};
print "$_\n" for ref( $image ) ? @{$image} : $image;
Notice that XML must within a single tag. Hence the addition of <wrapper>...<wrapper>
You're not using [...] character classes correctly.
- [ABC] matches any one of the letters A, B or C.
- [^ABC] matches one character that isn't an A, B or C.
See perlre, perlretut for more details.
edit: I guess I wasn't properly awake. I didn't notice it was XML rather than HTML.
edit^2: Added code | [reply] [d/l] [select] |
$str = "<story>
<page>
<description> desc </description>
<image> notExpected.jpg </image>
Hmmm, HTML? XML?
I want to parse the image value and the image tag should not be inside the story tag.
Hmmm, parse! Super Search for HTML parse or XML parse and use the right™ tools.
| [reply] [d/l] |
use XML::Rules;
my $parser = XML::Rules->new(
rules => [
story => 'skip',
image => sub {print $_[1]->{_content},"\n"},
]
);
$parser->parse( $the_xml);
| [reply] [d/l] |
$str="taaaaaaaa";
$str=~/(ta+)/; print "$1\n";
$str=~/(ta+?)/; print "$1\n";
In your case...
$str =~ m/[^<story>].*<image>(.*)<\/image>.*?[^<\/story>]/s;
## <- w/o the "?" should work
citromatik
UPDATE: I apologize for my answer: although my previous comment is correct and my regexp do the work, the concept of the regexp is totally incorrect. It should be expressed as:
$str=~s/\n//g;
$str =~ m%<story>.*</story><image>(.*)</image>%;
Nevertheless I agree with FunkeyMonk, using XML::Simple would be the best solution.
| [reply] [d/l] [select] |
$str =~ m/[^<story>]/
That doesn't do what you think it does. What it does, is match "any character that isn't one of '<', 's', 't', 'o', 'r', 'y', '>'". What behaves more like what you're after, is (?!<story>) or (?<!<story>), i.e. negative lookahead or lookbehind. But even that wouldn't work, as it can still match a string containing "<story>", only not at that exact spot.
What I would do, if using a full-blown XML/SGML/HTML parser is deemed unnecessary, is match both the substrings between "story" tags, and other substrings between "image" tags, in that order of matching precedence. And only keep the latter.
Try this on for size:
@images = grep defined, $str =~ m#<story>.*?</story>|<image>(.*?)</ima
+ge>#gs;
That'll match every correct image. If you're sure you want just one, you still need the /g modifier and the list context, but only keep the first proper match:
my($image) = grep defined, $str =~ m#<story>.*?</story>|<image>(.*?)</
+image>#gs;
| [reply] [d/l] [select] |
You probably can use an alternation to handle this. try the following:
my $str = <<END_STRING;
<story>
<page>
<description> desc </description>
<image> notExpected.jpg </image>
<headline> head </headline>
</page>
</story>
<image>correctImage.jpg</image>
END_STRING
my $ptn = qr{<story>.*?</story>|<image>(.*?)</image>}s;
my @out = ($str =~ /$ptn/g);
print qq("$_"\n) for grep $_, @out;
You need to do more test by yourself thought.
Regards,
Xicheng
| [reply] [d/l] |
I can suggest some different method This will ensure that <image> is NOT embeded into any <story> tag.
use strict;
my $str = "<story>
<page>
<description> desc </description>
<story>aaaaaaaaa</story>
<image> notExpected.jpg </image>
<headline> head </headline>
</page>
</story>
<story>aaaaaaaaaaaa
<image> notExpected.jpg </image>
aa</story>
<image>correctImage.jpg</image> ";
while ($str =~ m/<image>(.*?)<\/image>/gs)
{
my ($image_content) = ($1);
my $story_st=0;
my $story_en=0;
my $prelines = $`;
while ($prelines =~ m/<story>/gs){$story_st++}
while ($prelines =~ m/<\/story>/gs){$story_en++}
print $image_content, "\n" if ($story_st == $story_en);
}
| [reply] [d/l] |