RE greedyness

huck has asked for the wisdom of the Perl Monks concerning the following question:

While new to perlmonks im not new to perl, having used even bigperl on win95. Yet i still find it hard to understand regular expressions sometimes. below is a case that i would like some input about;


my               $content1='&#39;&#22 x';
                 $content1=~s/\&(\#\d*[^;\d]+)/\&amp;$1/gs; 
print '1:'.$content1."\n"; 

my               $content2='&#39;&#22 x';
                 $content2=~s/\&(\#\d*[^;]+)/\&amp;$1/gs; 
print '2:'.$content2."\n";
[download]

Result

1:&#39;&amp;#22 x
2:&amp;#39;&amp;#22 x
[download]

I first wrote the content2 code, expecting \d* to be greedy, and was surprised to find it was not greedy enough. My fix is the content1 code, and i am happy that it works and i figured it out. But i still cant understand why the \d* in content2 was not greedy enough to capture all of the '39', instead taking the '9' to be NOT ';' . Can someone enlighten me please?

This is perl 5, version 20, subversion 1 (v5.20.1) built for MSWin32-x
+86-multi-thread-64int
(with 1 registered patch, see perl -V for more detail)

Copyright 1987-2014, Larry Wall

Binary build 2000 [298557] provided by ActiveState http://www.ActiveSt
+ate.com
Built Oct 15 2014 22:10:49
[download]

Now in all reality the &#22 shouldn't be there, this is from text that has been html encoded by the yahoo groups api and returned within a JSON content. the user did expect to see &#22 on the page and for some reason yahoo did not encode the & to & im just trying to work around it.

Thank you for any explanations

Comment on RE greedyness Select or Download Code

Replies are listed 'Best First'.

Re: RE greediness
by Athanasius (Archbishop) on Nov 03, 2016 at 09:02 UTC

Hello huck, and welcome to the Monastery!

But i still cant understand why the \d* in content2 was not greedy enough to capture all of the '39', instead taking the '9' to be NOT ';' .

Actually, that’s exactly what it did do — at first. But that didn’t lead to a match, because 39; doesn’t match \d*[^;]. So, as Corion says, the regex had to give up the last character and try again, giving 39, which does match. “Greedy” means as greedy as possible while still matching. This is explained in detail in the Camel Book (which I don’t have to hand) in a section called “The Little Engine That /Could(n’t)?/”

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: RE greedyness
by Corion (Patriarch) on Nov 03, 2016 at 08:56 UTC

The \d* is greedy, but as the regex engine will try its best to make a regular expression match, to satisfy the [^;]+, at least one of the digits has to be given back.

While trying out to fix your first approach using possesive matching, it became clear to me that I don't really know what your end goal is. Is it to fix just &#22 to &#22, leaving everything else alone? Then my approach below does that, by making the \d* never give anything back:

my               $content1='&#39;&#22 x';
                 $content1=~s/\&(\#(?>\d*)[^;]+)/\&amp;$1/gs;
[download]

If you wanted something else, I have misunderstood you - please explain with some more examples of input and output what should happen.

[reply]
[d/l]
[select]

Re^2: RE greedyness

by huck (Prior) on Nov 03, 2016 at 09:22 UTC

Thank you, i didnt understand that part about possessive matching and giving back.

to better explain, the html &#dd; encoding string should always terminate with the ; (as i understand it). I wanted to change any sequences of &#dd that did not terminate with the ; to &#dd. My next step was to run $content thru decode_entities from HTML::Entities then decode from JSON; seems decode_entities was ok with &#22 and then decode gave me

 
JSON:invalid character encountered while parsing JSON string, at chara
+cter offset 7279 (before "\x{22}
[download]

[reply]
[d/l]