(Ovid - Regex efficiency issues) RE(3): A two-liner for Backtracking for substitutions

There are a few problems with your code. However, if I write a book, I'm going to dedicate it to:

#!/usr/bin/perl -w
use strict;
[download]

Those two little lines have saved me more trouble than you can possibly imagine and I would strongly recommend that you incorporate them. Admittedly, you just posted what you did for testing purposes, but I still have this "knee jerk" reaction regarding anything without the -w switch or use strict.

Your first regex can be made a bit more efficient (and accurate) by eliminating the .*, matching to the beginning of the line and using the /m switch:

$mydata =~ s/^([\w\s]+)\s([\w]+)\s(0000)/$1,$2,$3/mg;
[download]

I haven't actually benchmarked this, but I'd bet good money that this is the case. See Death to Dot Star! for information on why .* is problematic. The accuracy issue is probably a mute point if you have relatively clean data.

The second regex has two issues. You forgot to put parentheses around the \s0000. Those parentheses were supposed to capture this data and substitute it back using $2. I just changed it to the following:

$mydata =~ s/(:\d\d)\s0000/$1, 0000/g;
[download]

The other problem is a really just a minor efficiency issue: \d{2} is better written as \d\d (this is from MRE, so it may be out of date for newer regex engines). Basically, when you use \d{2}, the regex engine is forced to keep track of the number of instances of \d. This slows it down just a tad (which can be significant when iterating over a large amount of data). However, when the regex engine sees \d\d, it just matches each instance of \d which is faster.

Hope this helps!

Cheers,
Ovid

Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

Comment on (Ovid - Regex efficiency issues) RE(3): A two-liner for Backtracking for substitutions Select or Download Code