Re: Why doens't non-greediness work?
by rinceWind (Monsignor) on May 10, 2003 at 12:34 UTC
|
To explain this requires an understanding of the regex engine and what it is attempting to do.
Basically, what you are doing can be simplified to the following:
my $body = 'img wink img smiley';
$body =~ s/img.+?smiley/:)/;
# $body now contains just :)
Non greediness does not work backwards, only forwards. The substitution is looking for img (something non-greedy) smiley. The non-greedy ? merely means that the regex engine starts looking for the smallest possible match first, not the largest. Without the ? the largest string matched happens first.
Thus starting from the first img, it DOES find a match, hence it has no reason to backtrack, and gobbles the whole string. What you want is something like this:
$body =~ s/<img([^>]+?)="Smiley">/:)/g;
$body =~ s/<img([^>]+?)="Wink">/;)/g;
only looking for non '>' characters in your intervening text.
| [reply] [d/l] [select] |
Re: Why doens't non-greediness work?
by perlplexer (Hermit) on May 10, 2003 at 12:26 UTC
|
What makes you think it doesn't work?
Think about it for a sec. /<img(.+?)="Smiley">/ first matches <img at the beginning of the line. Then it slurps everything until "Smiley">
You may ask, well, why doesn't it stop at "Wink"> ? Why should it? You specifically told it to look for "Smiley"> ;)
If you reverse the order in which you apply the s///-es, it'll work. In this particular case that is.
--perlplexer | [reply] [d/l] [select] |
(jeffa) Re: Why doens't non-greediness work?
by jeffa (Bishop) on May 10, 2003 at 15:41 UTC
|
I love questions like this! (in a sadistic sort of way!)
They allow me to investigate more alternatives to using
regexes to parse *ML ... like my new favorite
XML::Twig. You really need to invest quite a bit
of time into these kinds of solutions, but the time is well
invested as it simply improves your overall programming
skills. Here is my take on the problem:
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new(
twig_handlers => {
'img[@alt="Smiley"]' => sub {
XML::Twig::Elt->new('#PCDATA',':)')->replace($_)
},
'img[@alt="Wink"]' => sub {
XML::Twig::Elt->new('#PCDATA',';)')->replace($_)
},
},
pretty_print => 'indented',
);
$twig->parse(\*DATA);
$twig->flush;
__DATA__
<body>
<a href="wink.html">
<img border="0" src="/images/wink.gif" alt="Wink"/>
</a>
<a href="smile.html">
<img border="0" src="/images/smiley.gif" alt="Smiley"/>
</a>
</body>
It works, but i had to 'XML-ize' the image tags first. I
wrapped the img tags inside a tags simply to show that other
tags are outputted 'as-is'. Also, a big ++ to broquaint
for helping get this right. I was trying to create a new
XML::Twig::Elt object with 'CDATA' as the first
arg. This created a <CDATA> tag pair - broquaint
changed that to '#CDATA', which led me to the correct
argument ... '#PCDATA'. Confusing? Start studying!
;)
jeffa
L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)
| [reply] [d/l] |
|
|
++jeffa for that one, and not only for requiring correct xml-syntax ;-) , although
IMNSHO every new bit of html put to the net should be xhtml 1.\(0|1\)
regards,
tomte
Hlade's Law:
If you have a difficult task, give it to a lazy person --
they will find an easier way to do it.
| [reply] |
|
|
And now of course I have to add my onw take to it!
All you want to do is change some img tags, while leaving the rest of the file unchanged. This looks like a good opportunity to use twig_roots, which only builds the twig for the elements that have handlers, and the awfully named twig_print_outside_roots, that prints everything else in the document:
#!/usr/bin/perl -w
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new(
twig_print_outside_roots => 1,
twig_roots => {
'img[@alt="Smiley"]' => sub { print q{:)} },
'img[@alt="Wink"]' => sub { print q{;)} },
},
);
$twig->parse(\*DATA);
__DATA__
<body>
<a href="wink.html"><img border="0" src="/images/wink.gif" alt="Wink"/
+></a>
<a href="wink.html"><img border="0" src="/images/wink.gif" alt="NotWin
+k"/></a>
<a href="smile.html"><img border="0" src="/images/smiley.gif" alt="Smi
+ley"/></a>
</body>
| [reply] [d/l] |
|
|
For the parser types among us, there are likely more suitable options to be had though.
use strict;
use warnings;
use HTML::TokeParser::Simple;
my %xlat = (
Smiley => ':)',
Wink => ';)',
);
my $p = HTML::TokeParser::Simple->new( \*DATA );
while ( my $t = $p->get_token ) {
if(
$t->is_start_tag('img')
and my $r = $xlat{$t->return_attr->{alt}}
) {
print $r;
}
else {
print $t->as_is;
}
}
__END__
<body>
<a href="wink.html">
<img border="0" src="/images/wink.gif" alt="Wink">
</a>
<a href="smile.html">
<img border="0" src="/images/smiley.gif" alt="Smiley">
</a>
</body>
Note this doesn't require XHTML.
Makeshifts last the longest. | [reply] [d/l] |
Re: Why doens't non-greediness work?
by halley (Prior) on May 10, 2003 at 12:55 UTC
|
All of the other folks have pointed out the right way to match just one HTML tag. I would recommend that you search "defensively" to avoid breaking if the HTML changes slightly, to help future programmers see what you're trying to do, and to make it easier to add obvious new extensions.
One, show that you're expecting the magic "Smiley" string in the ALT parameter.
Two, allow for other things to follow the magic string.
Three, search with the /i case insensitivity modifier.
Four, optionally, make the pattern a little more readable with the /x modifier and some whitespace.
Five, optionally, make a hash of possible magic strings and your desired emoticon replacements, to make new extensions very easy.
Six, optionally, comment on the intent of complex patterns with a very brief example.
my %emoticons = (
smiley => ':)',
wink => ';)',
);
# example: <img alt="smiley"> becomes :)
foreach my $e (keys %emoticons)
{
$body =~ s{
\< img
[^>]*?
alt = "$e"
[^>]*?
\>
}
{$emoticons{$e}}igex;
}
-- [ e d @ h a l l e y . c c ] | [reply] [d/l] |
|
|
my %emoticons = (
smiley => ':)',
wink => ';)',
);
# example: <img alt="smiley"> becomes :)
$body =~ s{
(
\< img
[^>]*?
alt = "(.*?)"
[^>]*?
\>
)
}
{$emoticons{$2} || $1}igex;
-- [ e d @ h a l l e y . c c ] | [reply] [d/l] |
Re: Why doens't non-greediness work?
by benn (Vicar) on May 10, 2003 at 12:26 UTC
|
For this specific case, you just need to swap those two substitutions round - the first one is grabbing everything from the first "<img" all the way through to "Smiley".
HTH Ben
Update Ha! - beaten to the draw :) Notice we're both talking about this particular case though - if you want a generalised "swap my smilies anywhere", you'll need to rethink the regex. Cheers,Ben. | [reply] |
Re: Why doens't non-greediness work?
by cciulla (Friar) on May 10, 2003 at 12:56 UTC
|
Update: Seriously beaten to the punch and the above answers are WAY better than mine. :)
Because the first sub is clobbering the second sub.
Check this out...
my $body = qq!<img border="0" src="/images/wink.gif" alt="Wink"> <img
+border="0" src="/images/smiley.gif" alt="Smiley">!;
print "Inital Value\t $body\n";
$body =~ s/<img(.+?)="Smiley">/:)/g;
print "First Sub\t $body\n";
$body =~ s/<img(.+?)="Wink">/;)/g;
print "Second Sub\t $body\n";
Translating the first sub, ala Friedl:
find <img followed by one or more characters repeated zero or one times followed by ="Smiley"
So, it's being TOO greedy!
Cē
| [reply] [d/l] |
Re: Why doens't non-greediness work?
by graff (Chancellor) on May 10, 2003 at 13:32 UTC
|
I think what you're after is something like this:
my $body = qq!<img border="0" src="/images/wink.gif" alt="Wink"> <img
+border="0" src="/images/smiley.gif" alt="Smiley">!;
%replacer = ( "Smiley" => ":)",
"Wink" => ";)" );
$body =~ s/<img(?:.+?)="(Smiley|Wink)">/$replacer{$1}/g;
print "$body\n";
which prints: ";) :)" -- note that it uses "(?:.+?)" to cluster, not capture, the irrelevant characters that precede the "=".
| [reply] [d/l] |
Re: Why doens't non-greediness work?
by kiat (Vicar) on May 10, 2003 at 12:40 UTC
|
#<img border="0" src="/images/wink.gif" alt="Wink">
# Match <img, followed by a minimum 38 and a
# maximum 50 of any characters before reaching
# the specific target word
$body =~ s/<img.{38,50}="Smiley">/:)/g;
$body =~ s/<img.{38,50}="Wink">/;)/g;
The above code gets me the desired output but I'm not sure if it's the right way to do it. It's overly dependent on the context. | [reply] [d/l] |
Re: Why doens't non-greediness work?
by kiat (Vicar) on May 10, 2003 at 12:30 UTC
|
I see, thanks :)
How do I modify the code to produce the desired output? Swapping solves the problem when Wink appears before Smiley but what if Smiley appears before Wink? Is it possible to have a code that does it whatever the order? | [reply] |
Re: Why doesn't non-greediness work?
by jonadab (Parson) on May 11, 2003 at 02:09 UTC
|
Your problem is perfectly suited to a solution involving
the application of quantum regular expresion dynamics
(QRED). Your code as it stands attempts to match each
type of emoticon in turn; thus, it is jumping the gun
and testing the match before the match actually occurs,
but the test changes the match and forces it to happen
in a certain way -- which isn't what you want. Instead,
you want to write your regex so that it will match
either type of emoticon; thus, as it matches, it enters
a superposition of states wherein it is simultaneously
both a smiley and a winkie. *Then* you test which it
is, and at that point the waveform collapses to a
particle and you get your answer, which you can use
to decide which replacement text to use.
In other words, do what Ed Halley said.
{my$c;$ x=sub{++$c}}map{$ \.=$_->()}map{my$a=$_->[1];
sub{$a++ }}sort{_($a->[0 ])<=>_( $b->[0])}map{my@x=(&
$x( ),$ _) ;\ @x} split //, "rPcr t lhuJnhea eretk.as
o";print;sub _{ord(shift)*($=-++$^H)%(42-ord("\r"))};
| [reply] [d/l] |
Re: Why doesn't non-greediness work?
by gmpassos (Priest) on May 10, 2003 at 23:16 UTC
|
Is simple. Just change the dot "." to "[^>]"
$body =~ s/<img([^>]+?)="Smiley">/:)/g;
$body =~ s/<img([^>]+?)="Wink">/;)/g;
Graciliano M. P.
"The creativity is the expression of the liberty". | [reply] [d/l] |
Re: Why doesn't non-greediness work?
by wolfger (Deacon) on May 13, 2003 at 01:01 UTC
|
$body =~ s/<img(.+?)="Wink">/;)/g;
Why doesn't the wink smiley captured?
Probably because perl is interpreting this as two commands...
$body =~ s/<img(.+?)="Wink">/;
)/g;
The one serious conviction that a man should have is that nothing is to be taken too seriously.
-- Nicholas Butler | [reply] [d/l] |
|
|
| [reply] [d/l] |
|
|
Yes, but notice that you escaped one of those semicolons...
In the original, the semicolon that was part of the smiley was not escaped.
The one serious conviction that a man should have is that nothing is to be taken too seriously.
-- Nicholas Butler
| [reply] |
|
|
|
|
| [reply] |