Re: Did the inefficiency of /i get fixed?

The issue is chronicled in Mastering Regular Expressions (The Owls Book) by Jeoffrey Friedl. I still only have the first edition of MRE, but I dug it out to refresh my memory on the issue.

According to Friedl, /i causes a temporary copy to be made of the entire target string. This is in addition to any copies made by the dirty variables like $&. The /i copy is done prior to any match occurring, whereas $& only makes a copy after a successful match.

After the copy is made, the RE engine takes a second pass over the string, converting upper case letters to lower case, even if the original was already all lowercase.

As the RE itself is compiled, it too has all of its literals converted to lowercase.

So you have extra copies being made, target string lowercase conversion, a lowercasing of the RE itself during the compilation phase of the RE, and that all adds up to, as Friedl puts it, "one of the most gratuitous inefficiencies in all of Perl." (He is a RE guy though.)

In Friedl's testing, he found worst case scenarios that took over four orders of magnitude with regards to using the /i modifier on strings of 1192395 bytes (1.2MB). He calculated that due to the "needless copying", Perl shuffled more than 647,585MB around inside his CPU.

That was a worst case test. In what he considered more of a real-world test, he found that m/\b[wW][hH][iI][lL][eE]\b/g resulted in a testrun of fifty times faster than m/\bwhile\b/gi on a huge string.

His conclusion was, "don't use /i unless you really have to."

I've done a quick skim through the various perldelta documents and haven't seen any mention since 5.005 of an improved and more efficient /i modifier. That doesn't mean I couldn't have missed it; there's a lot to skim. Someone may correct me, but it looks like at least for now that part of the RE engine hasn't changed.

Dave

Comment on Re: Did the inefficiency of /i get fixed? Select or Download Code

Replies are listed 'Best First'.
Re: Re: Did the inefficiency of /i get fixed? by Fuzzy Frog (Sexton) on May 19, 2004 at 05:49 UTC
I do have a copy of the second edition. At the top of page 255 Friedl says Somewhat related, users of Perl that read the first edition of this book may sometimes write something like `^[Ff][Rr][Oo][Mm]:` instead of a case-insensitive use of `^from:`. Old versions of Perl were very ineffecient with their case-insensitive matching, so I recommended the use of classes like this in some cases. That recomendation has been lifted, as the case-insensitive inefficiency has been fixed for some years now. I have not read the first edition of the book, so i don't know whether it's worthwhile to upgrade. I can say that Mastering Regular Expressions is one of the most useful and readable reference manuals I have ever seen. It has my highest recomentation. -- Fuzzy Frog My "highest recomendation" may not be much, but it's all I have to give.	[reply] [d/l]
Re: Re: Did the inefficiency of /i get fixed? by diotalevi (Canon) on May 19, 2004 at 12:57 UTC
I wonder sometimes whether anyone things to go read the source to see what the answer is. /i appears to change such nodes as `EXACT` to `EXACTF` which for non-unicode is a "find the first character of the match, then call util.c:Perl_ibcmp which is a dainty little function which relies on a built-in case folding table. There's no implicit lc() and copy to a buffer going on here. If you have a utf8 regular expression then it does what Friedl's book suggests. If you merely have utf8 data then see utf8.c:Perl_ibcmp_utf8 which is long enough to deter me from wanting to read it at the moment but still doesn't appear to make a separate lc()'d copy of the data. This is all from looking at 5.8.3, btw.	[reply]