in reply to Re: Multiple regex matches in single string
in thread Multiple regex matches in single string

The negative look ahead works great for matching on the proper blocks of text. The issue this injects is an extreme slow down of the app. Without the look ahead, it runs in less than a second, with the lookahead, we are talking 10 minutes or so.

My actual data file has thousands of lines of text and much more text on each line than a single word, but still, not using the negative look ahead runs against these large files in less than a second. Why would this cause such a slow down? Is there a way around this?

  • Comment on Re^2: Multiple regex matches in single string

Replies are listed 'Best First'.
Re^3: Multiple regex matches in single string
by johngg (Canon) on Apr 26, 2008 at 22:22 UTC
    If I am reading hipowls's regex correctly it will be checking the negative look-ahead for every character between 'start' and 'end'. Just doing the look-ahead once should locate the last 'start' in a group then the .+? can run without keep checking after every character.

    use strict; use warnings; my $string = <<'EOT'; start start start go one end start start start go two end EOT my $rxGroup = qr {(?isx) ( start (?!\nstart) .+? end ) }; print qq{$1\n\n} while $string =~ m{$rxGroup}g;

    The output.

    start go one end start go two end

    I hope I am correct and this slight change will speed up your code.

    Cheers,

    JohnGG

      That assumes that the starts are on consecutive lines which may be a perfectly valid assumption. It pays to know your data.

      Another approach is to use the original regex, which may have multiple starts and then trim it using s/^.*start/start/is.

      The loop then looks something like

      while ( $string =~ /(start.+?end)/gis ) { my $data = $1; $data =~ s/^.*start/start/is; print $data, "\n\n"; }
      If the intent is to strip off multiple starts only on consecutive lines then the regex would be s/^(?:start\s*)+start/start/is which used on the input
      start start start go one end start start data start go two end
      would produce
      start go one end start data start go two end
      But as I said you really need to know your data and other factors such as if you need to validate the input.