kiat has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I'm trying out the code below to replace smiley images with their corresponding ascii characters. The non-greediness doesn't seem to work:
my $body = qq!<img border="0" src="/images/wink.gif" alt="Wink"> <img +border="0" src="/images/smiley.gif" alt="Smiley">!; $body =~ s/<img(.+?)="Smiley">/:)/g; $body =~ s/<img(.+?)="Wink">/;)/g; print "$body\n"; # Actual output :) # Desired output ;) :)
Why doesn't the wink smiley captured?

I look forward to reading your solutions :)

Thanks in anticipation.

Replies are listed 'Best First'.
Re: Why doens't non-greediness work?
by rinceWind (Monsignor) on May 10, 2003 at 12:34 UTC
    To explain this requires an understanding of the regex engine and what it is attempting to do. Basically, what you are doing can be simplified to the following:
    my $body = 'img wink img smiley'; $body =~ s/img.+?smiley/:)/; # $body now contains just :)
    Non greediness does not work backwards, only forwards. The substitution is looking for img (something non-greedy) smiley. The non-greedy ? merely means that the regex engine starts looking for the smallest possible match first, not the largest. Without the ? the largest string matched happens first.

    Thus starting from the first img, it DOES find a match, hence it has no reason to backtrack, and gobbles the whole string.

    What you want is something like this:

    $body =~ s/<img([^>]+?)="Smiley">/:)/g; $body =~ s/<img([^>]+?)="Wink">/;)/g;
    only looking for non '>' characters in your intervening text.
Re: Why doens't non-greediness work?
by perlplexer (Hermit) on May 10, 2003 at 12:26 UTC
    What makes you think it doesn't work?
    Think about it for a sec. /<img(.+?)="Smiley">/ first matches <img at the beginning of the line.
    Then it slurps everything until "Smiley">
    You may ask, well, why doesn't it stop at "Wink"> ? Why should it? You specifically told it to look for "Smiley"> ;)

    If you reverse the order in which you apply the s///-es, it'll work. In this particular case that is.

    --perlplexer
(jeffa) Re: Why doens't non-greediness work?
by jeffa (Bishop) on May 10, 2003 at 15:41 UTC
    I love questions like this! (in a sadistic sort of way!) They allow me to investigate more alternatives to using regexes to parse *ML ... like my new favorite XML::Twig. You really need to invest quite a bit of time into these kinds of solutions, but the time is well invested as it simply improves your overall programming skills. Here is my take on the problem:
    use strict; use warnings; use XML::Twig; my $twig = XML::Twig->new( twig_handlers => { 'img[@alt="Smiley"]' => sub { XML::Twig::Elt->new('#PCDATA',':)')->replace($_) }, 'img[@alt="Wink"]' => sub { XML::Twig::Elt->new('#PCDATA',';)')->replace($_) }, }, pretty_print => 'indented', ); $twig->parse(\*DATA); $twig->flush; __DATA__ <body> <a href="wink.html"> <img border="0" src="/images/wink.gif" alt="Wink"/> </a> <a href="smile.html"> <img border="0" src="/images/smiley.gif" alt="Smiley"/> </a> </body>
    It works, but i had to 'XML-ize' the image tags first. I wrapped the img tags inside a tags simply to show that other tags are outputted 'as-is'. Also, a big ++ to broquaint for helping get this right. I was trying to create a new XML::Twig::Elt object with 'CDATA' as the first arg. This created a <CDATA> tag pair - broquaint changed that to '#CDATA', which led me to the correct argument ... '#PCDATA'. Confusing? Start studying! ;)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    

      ++jeffa for that one, and not only for requiring correct xml-syntax ;-)
      , although IMNSHO every new bit of html put to the net should be xhtml 1.\(0|1\)

      regards,
      tomte


      Hlade's Law:

      If you have a difficult task, give it to a lazy person --
      they will find an easier way to do it.

      And now of course I have to add my onw take to it!

      All you want to do is change some img tags, while leaving the rest of the file unchanged. This looks like a good opportunity to use twig_roots, which only builds the twig for the elements that have handlers, and the awfully named twig_print_outside_roots, that prints everything else in the document:

      #!/usr/bin/perl -w use strict; use warnings; use XML::Twig; my $twig = XML::Twig->new( twig_print_outside_roots => 1, twig_roots => { 'img[@alt="Smiley"]' => sub { print q{:)} }, 'img[@alt="Wink"]' => sub { print q{;)} }, }, ); $twig->parse(\*DATA); __DATA__ <body> <a href="wink.html"><img border="0" src="/images/wink.gif" alt="Wink"/ +></a> <a href="wink.html"><img border="0" src="/images/wink.gif" alt="NotWin +k"/></a> <a href="smile.html"><img border="0" src="/images/smiley.gif" alt="Smi +ley"/></a> </body>
      For the parser types among us, there are likely more suitable options to be had though.
      use strict; use warnings; use HTML::TokeParser::Simple; my %xlat = ( Smiley => ':)', Wink => ';)', ); my $p = HTML::TokeParser::Simple->new( \*DATA ); while ( my $t = $p->get_token ) { if( $t->is_start_tag('img') and my $r = $xlat{$t->return_attr->{alt}} ) { print $r; } else { print $t->as_is; } } __END__ <body> <a href="wink.html"> <img border="0" src="/images/wink.gif" alt="Wink"> </a> <a href="smile.html"> <img border="0" src="/images/smiley.gif" alt="Smiley"> </a> </body>
      Note this doesn't require XHTML.

      Makeshifts last the longest.

Re: Why doens't non-greediness work?
by halley (Prior) on May 10, 2003 at 12:55 UTC

    All of the other folks have pointed out the right way to match just one HTML tag. I would recommend that you search "defensively" to avoid breaking if the HTML changes slightly, to help future programmers see what you're trying to do, and to make it easier to add obvious new extensions.

    One, show that you're expecting the magic "Smiley" string in the ALT parameter.

    Two, allow for other things to follow the magic string.

    Three, search with the /i case insensitivity modifier.

    Four, optionally, make the pattern a little more readable with the /x modifier and some whitespace.

    Five, optionally, make a hash of possible magic strings and your desired emoticon replacements, to make new extensions very easy.

    Six, optionally, comment on the intent of complex patterns with a very brief example.

    my %emoticons = ( smiley => ':)', wink => ';)', ); # example: <img alt="smiley"> becomes :) foreach my $e (keys %emoticons) { $body =~ s{ \< img [^>]*? alt = "$e" [^>]*? \> } {$emoticons{$e}}igex; }

    --
    [ e d @ h a l l e y . c c ]

      Simpler yet.
      my %emoticons = ( smiley => ':)', wink => ';)', ); # example: <img alt="smiley"> becomes :) $body =~ s{ ( \< img [^>]*? alt = "(.*?)" [^>]*? \> ) } {$emoticons{$2} || $1}igex;

      --
      [ e d @ h a l l e y . c c ]

Re: Why doens't non-greediness work?
by benn (Vicar) on May 10, 2003 at 12:26 UTC
    For this specific case, you just need to swap those two substitutions round - the first one is grabbing everything from the first "<img" all the way through to "Smiley".

    HTH
    Ben

    Update Ha! - beaten to the draw :) Notice we're both talking about this particular case though - if you want a generalised "swap my smilies anywhere", you'll need to rethink the regex.

    Cheers,Ben.

Re: Why doens't non-greediness work?
by cciulla (Friar) on May 10, 2003 at 12:56 UTC

    Update: Seriously beaten to the punch and the above answers are WAY better than mine. :)

    Because the first sub is clobbering the second sub.

    Check this out...

    my $body = qq!<img border="0" src="/images/wink.gif" alt="Wink"> <img +border="0" src="/images/smiley.gif" alt="Smiley">!; print "Inital Value\t $body\n"; $body =~ s/<img(.+?)="Smiley">/:)/g; print "First Sub\t $body\n"; $body =~ s/<img(.+?)="Wink">/;)/g; print "Second Sub\t $body\n";

    Translating the first sub, ala Friedl:

    find <img followed by one or more characters repeated zero or one times followed by ="Smiley"

    So, it's being TOO greedy!

Re: Why doens't non-greediness work?
by graff (Chancellor) on May 10, 2003 at 13:32 UTC
    I think what you're after is something like this:
    my $body = qq!<img border="0" src="/images/wink.gif" alt="Wink"> <img +border="0" src="/images/smiley.gif" alt="Smiley">!; %replacer = ( "Smiley" => ":)", "Wink" => ";)" ); $body =~ s/<img(?:.+?)="(Smiley|Wink)">/$replacer{$1}/g; print "$body\n";
    which prints: ";) :)" -- note that it uses "(?:.+?)" to cluster, not capture, the irrelevant characters that precede the "=".
Re: Why doens't non-greediness work?
by kiat (Vicar) on May 10, 2003 at 12:40 UTC
    My own solution is this:
    #<img border="0" src="/images/wink.gif" alt="Wink"> # Match <img, followed by a minimum 38 and a # maximum 50 of any characters before reaching # the specific target word $body =~ s/<img.{38,50}="Smiley">/:)/g; $body =~ s/<img.{38,50}="Wink">/;)/g;
    The above code gets me the desired output but I'm not sure if it's the right way to do it. It's overly dependent on the context.
Re: Why doens't non-greediness work?
by kiat (Vicar) on May 10, 2003 at 12:30 UTC
    I see, thanks :)

    How do I modify the code to produce the desired output? Swapping solves the problem when Wink appears before Smiley but what if Smiley appears before Wink? Is it possible to have a code that does it whatever the order?
Re: Why doesn't non-greediness work?
by jonadab (Parson) on May 11, 2003 at 02:09 UTC

    Your problem is perfectly suited to a solution involving the application of quantum regular expresion dynamics (QRED). Your code as it stands attempts to match each type of emoticon in turn; thus, it is jumping the gun and testing the match before the match actually occurs, but the test changes the match and forces it to happen in a certain way -- which isn't what you want. Instead, you want to write your regex so that it will match either type of emoticon; thus, as it matches, it enters a superposition of states wherein it is simultaneously both a smiley and a winkie. *Then* you test which it is, and at that point the waveform collapses to a particle and you get your answer, which you can use to decide which replacement text to use.

    In other words, do what Ed Halley said.


    {my$c;$ x=sub{++$c}}map{$ \.=$_->()}map{my$a=$_->[1]; sub{$a++ }}sort{_($a->[0 ])<=>_( $b->[0])}map{my@x=(& $x( ),$ _) ;\ @x} split //, "rPcr t lhuJnhea eretk.as o";print;sub _{ord(shift)*($=-++$^H)%(42-ord("\r"))};
Re: Why doesn't non-greediness work?
by gmpassos (Priest) on May 10, 2003 at 23:16 UTC
    Is simple. Just change the dot "." to "[^>]"
    $body =~ s/<img([^>]+?)="Smiley">/:)/g; $body =~ s/<img([^>]+?)="Wink">/;)/g;

    Graciliano M. P.
    "The creativity is the expression of the liberty".

Re: Why doesn't non-greediness work?
by wolfger (Deacon) on May 13, 2003 at 01:01 UTC
    $body =~ s/<img(.+?)="Wink">/;)/g;

    Why doesn't the wink smiley captured?

    Probably because perl is interpreting this as two commands...
    $body =~ s/<img(.+?)="Wink">/; )/g;
    The one serious conviction that a man should have is that nothing is to be taken too seriously. -- Nicholas Butler
      actually perl is smart enough to avoid that.. in fact you can do something as disturbing as

      $body =~ s;<img(.+?)="Wink">;\;;g;

      which is why they say only perl can parse perl.

        Yes, but notice that you escaped one of those semicolons... In the original, the semicolon that was part of the smiley was not escaped.

        The one serious conviction that a man should have is that nothing is to be taken too seriously. -- Nicholas Butler
      No it isn't. )/g; is not valid Perl so you'd get an error message about that in that case. perl is very smart about inferring the correct delimiter in the overwhelming majority of cases anyway; please check your assumptions next time.

      Makeshifts last the longest.