in reply to (Ovid - Common regex error) RE: A two-liner for Backtracking for substitutions
in thread Backtracking for substitutions

Thanks Ovid! I appreciate the friendly amendment. How does the rest of the code look? I'd also like to hear from the original anonymous poster. Did he get his problem solved? --Mark
  • Comment on RE: (Ovid - Common regex error) RE: A two-liner for Backtracking for substitutions

Replies are listed 'Best First'.
(Ovid - Regex efficiency issues) RE(3): A two-liner for Backtracking for substitutions
by Ovid (Cardinal) on Oct 03, 2000 at 03:10 UTC
    There are a few problems with your code. However, if I write a book, I'm going to dedicate it to:
    #!/usr/bin/perl -w use strict;
    Those two little lines have saved me more trouble than you can possibly imagine and I would strongly recommend that you incorporate them. Admittedly, you just posted what you did for testing purposes, but I still have this "knee jerk" reaction regarding anything without the -w switch or use strict.

    Your first regex can be made a bit more efficient (and accurate) by eliminating the .*, matching to the beginning of the line and using the /m switch:

    $mydata =~ s/^([\w\s]+)\s([\w]+)\s(0000)/$1,$2,$3/mg;
    I haven't actually benchmarked this, but I'd bet good money that this is the case. See Death to Dot Star! for information on why .* is problematic. The accuracy issue is probably a mute point if you have relatively clean data.

    The second regex has two issues. You forgot to put parentheses around the \s0000. Those parentheses were supposed to capture this data and substitute it back using $2. I just changed it to the following:

    $mydata =~ s/(:\d\d)\s0000/$1, 0000/g;
    The other problem is a really just a minor efficiency issue: \d{2} is better written as \d\d (this is from MRE, so it may be out of date for newer regex engines). Basically, when you use \d{2}, the regex engine is forced to keep track of the number of instances of \d. This slows it down just a tad (which can be significant when iterating over a large amount of data). However, when the regex engine sees \d\d, it just matches each instance of \d which is faster.

    Hope this helps!

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just go the the link and check out our stats.