regex gotcha moving from 5.8.8 to 5.30.0?

mordibity has asked for the wisdom of the Perl Monks concerning the following question:

After a long long time using a fairly ancient Perl, I'm now able to use a more modern build. But I'm running into a fairly severe regex performance regression, and not sure where to start looking...

I've reduced it to this small testcase, where he sub below is called with a large input file (118Mb) that's been slurped into a string:

sub parse_foo {
    my ($text) = @_;
    my $name;
    {
        last if $text =~ /\G \s* \Z/gcmsx;

        if     ($text =~ /\G \s* ^ \s* begfoo \s+ (\S+?) \s* \( \s* (.
+*?) \s* \) \s* ;/gcmsx) { $name = $1 }
        elsif  ($text =~ /\G \s* ^ \s* endfoo            /gcmsx) { }
        elsif  ($text =~ /\G \s* ^ \s* \S+ \s+  .*? \s* ;/gcmsx) { }
        else { die "ERROR: unknown syntax\n" }

        redo;
    }
    print "LAST FOO: $name\n";
}
[download]

Using 5.8.8, it runs in about 5 seconds. Using 5.30.0, it takes about 105 seconds. (And it's the same story when I try it on the latest stable release, 5.32.1). I ran NYTProf on both 5.8.8 and 5.30.0, and it boils down to the difference in these two lines:

5.8.8

  last if $text =~ /\G \s* \Z/gcmsx;
  # spent   181ms making 866465 calls to main::CORE:match, avg 208ns/c
+all
  if     ($text =~ /\G \s* ^ \s* begfoo \s+ (\S+?) \s* \( \s* (.*?) \s
+* \) \s* ;/gcmsx) { $name = $1 }
  # spent  3.74s making 2547279 calls to main::CORE:match, avg 1盜/cal
+l
[download]

5.30.0

  last if $text =~ /\G \s* \Z/gcmsx;
  # spent  289ms making 866465 calls to main::CORE:match, avg 334ns/ca
+ll
  if     ($text =~ /\G \s* ^ \s* begfoo \s+ (\S+?) \s* \( \s* (.*?) \s
+* \) \s* ;/gcmsx) { $name = $1 }
  # spent   103s making 2547279 calls to main::CORE:match, avg 41盜/ca
+ll
[download]

Am I unwittingly doing something in my code that has been deprecated (I think I got this parsing/nibbling regex style using a block for looping from the original Effective Perl Programming), and now there's a better way? Thanks!

Comment on regex gotcha moving from 5.8.8 to 5.30.0? Select or Download Code

Replies are listed 'Best First'.
Re: regex gotcha moving from 5.8.8 to 5.30.0? by choroba (Cardinal) on Feb 09, 2021 at 20:22 UTC
Can you please provide a sample input? I generated one using `my $str = ""; $str .= "\nbegfoo bar ( xyz ) ;\nendfoo\nqux 123 ;" while 100E+06 > length $str;` [download] but the whole script including the data creation takes only 5 seconds. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^2: regex gotcha moving from 5.8.8 to 5.30.0? by mordibity (Acolyte) on Feb 09, 2021 at 21:17 UTC
Hmm, that's interesting, there is some data-dependency! My first attempt to make some fake data, like yours, didn't lead to any performance difference between 5.8.8 and 5.30.0. So I made the data-faker a little smarter (in particular, multi-line begfoo declarations) and was able to get a delta to show up: `my $num = shift or die "num?\n"; for my $i (0 .. $num) { my @in = map { "input$_" } (0..int(rand(100))); my @out = map { "output$_" } (0..int(rand(100))); print "begfoo FOO_$i (\n", join(",\n", @in, @out), ");\n"; print " input $_;\n" foreach @in; print " output $_;\n" foreach @out; print " foo inst$_ (j, k, l, m, n, o, p);\n" foreach 0 .. int(ran +d(100)); print "endfoo\n\n"; }` [download] I generated some dummy output with 50000 definitions: "make_out.pl 50000 > 50k.foo", giving a file about 263Mb and that was large/real enough to show a definite 2x difference: 5.8.8 : 0.01s user 0.02s system 0% cpu 14.474 total 5.30.0 : 0.01s user 0.02s system 0% cpu 37.312 total	[reply] [d/l]
Re^3: regex gotcha moving from 5.8.8 to 5.30.0? by choroba (Cardinal) on Feb 09, 2021 at 22:44 UTC
I usually `use re 'debug';` when debugging regular expressions, but I'm not sure it's helpful in this case. The usual suspects are `.` or `.?`, because they start by matching the whole string and then backtracking to match less. Can't you replace them with `[^;]` or similar? `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]`	[reply] [d/l] [select]
Re^4: regex gotcha moving from 5.8.8 to 5.30.0? by mordibity (Acolyte) on Feb 10, 2021 at 00:16 UTC
Re: regex gotcha moving from 5.8.8 to 5.30.0? by SBECK (Chaplain) on Feb 10, 2021 at 14:18 UTC
So I created a 50k.txt file and ran the the function for all of the versions of perl that I have (I have one of each major version installed). For 5.8, 5.10, 5.12, 5.14, 5.16, and 5.18, it took around 11s to run. Starting at 5.20 and on, it took 30s to run. I've been looking through the perldelta doc for 5.20 at the regexp related stuff, and there are several changes, but I'm not seeing one that jumps out as related to this specifically... but it's at least a place to start. The 5.20 version is definitely where the jump happened. FWIW, I was using the last 5.20.3 release... I did not test earlier 5.20 releases, so I have not narrowed it down to the specific 5.20 release.	[reply]
Re^2: regex gotcha moving from 5.8.8 to 5.30.0? by swl (Parson) on Feb 10, 2021 at 21:37 UTC
Just a guess, but the delta for 5.20 includes this entry: Executing a regex that contains the ^ anchor (or its variant under the /m flag) has been made much faster in several situations. https://metacpan.org/pod/release/RJBS/perl-5.20.0/pod/perldelta.pod#Performance-Enhancements Maybe that enhancement has some side effects triggered by the `\s* ^ \s` patterns in the regexps. Are you able to test what happens in pre-5.20 if these patterns are changed to `\s`? Edit - I should have asked for what happens either side of 5.20.	[reply] [d/l] [select]
Re: regex gotcha moving from 5.8.8 to 5.30.0? by swl (Parson) on Feb 10, 2021 at 07:20 UTC
There is some repetition across your regexes that can be factored out. This maybe relates to the underlying cause. Each regex starts with the same pattern: `\s* ^ \s`. Checking for that before running the if conditions makes things about 250-260% faster under Strawberry perl 5.32, testing with a file of 500 begfoo sets generated using the code in 11128154. See code in sub parse_foo2. parse_foo1 is from the OP. I also converted the condition to run in a while loop, mostly for style. The addition of the `/aa` flag makes a slight difference which could just be noise. Note that I have not checked if all begfoo sets are parsed correctly... I also don't have a version 5.8 to work with. use 5.022; use warnings; use Benchmark qw {:all}; open my $fh, 'x.txt' or die; my $data = do {local $/ = undef; <$fh>}; cmpthese ( 10, { one => sub {parse_foo1($data)}, two => sub {parse_foo2($data)}, } ); sub parse_foo1 { my ($text) = @_; my $name; { last if $text =~ /\G \s \Z/gcmsx; if ($text =~ /\G \s* ^ \s* begfoo \s+ (\S+?) \s* $ \s* (. +?) \s $ \s* ;/gcmsx) { $name = $1 } elsif ($text =~ /\G \s* ^ \s* endfoo /gcmsx) { } elsif ($text =~ /\G \s* ^ \s* \S+ \s+ .? \s ;/gcmsx) { } else { die "ERROR: unknown syntax\n" } redo; } print "LAST FOO1: $name\n"; } sub parse_foo2 { my ($text) = @_; my $name; while (not $text =~ /\G \s* \Z/gcmsx) { $text =~ /\G \s* /gcsmx; # march through any white space if ($text =~ /\G begfoo \s+ (\S+?) \s* $ \s* (.?) \s $ + \s* ;/gcmsxaa) { $name = $1 } elsif ($text =~ /\G endfoo /gcmsx) { } elsif ($text =~ /\G \S+ \s+ .? \s ;/gcmsx) { } else { die "ERROR: unknown syntax\n" } } print "LAST FOO2: $name\n"; } [download] Example results: `v5.32.0 LAST FOO1: FOO_500 LAST FOO1: FOO_500 LAST FOO1: FOO_500 LAST FOO1: FOO_500 LAST FOO1: FOO_500 LAST FOO1: FOO_500 LAST FOO1: FOO_500 LAST FOO1: FOO_500 LAST FOO1: FOO_500 LAST FOO1: FOO_500 LAST FOO2: FOO_500 LAST FOO2: FOO_500 LAST FOO2: FOO_500 LAST FOO2: FOO_500 LAST FOO2: FOO_500 LAST FOO2: FOO_500 LAST FOO2: FOO_500 LAST FOO2: FOO_500 LAST FOO2: FOO_500 LAST FOO2: FOO_500 Rate one two one 2.08/s -- -72% two 7.53/s 261% --` [download]	[reply] [d/l] [select]
Re: regex gotcha moving from 5.8.8 to 5.30.0? by AnomalousMonk (Archbishop) on Feb 10, 2021 at 04:32 UTC
I don't have much experience using regexes on Unicode strings (and that only in the distant past), so this is a stab in the dark, but might some of the slowdown be due to the regex processing changes made to accommodate Unicode? Perl version 5.8 was more or less in the midst of those changes IIRC. If you're not, in fact, dealing with Unicode text, maybe check out the discussion in Character set modifiers (and other places) WRT the `/a` and `/aa` "ASCII-safe" modifiers. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: regex gotcha moving from 5.8.8 to 5.30.0? by kschwab (Vicar) on Feb 10, 2021 at 01:34 UTC
I suspect it's \G and all the matching groups. Do you have some sample data and a description of what you're trying to do? There may be a way to do it without all the backrefs.	[reply]
Re^2: regex gotcha moving from 5.8.8 to 5.30.0? by AnomalousMonk (Archbishop) on Feb 10, 2021 at 03:48 UTC
There may be a way to do it without all the backrefs. I don't understand. There don't seem to be any backreferences in any of the regexes presented so far. Can you please clarify? Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^3: regex gotcha moving from 5.8.8 to 5.30.0? by kschwab (Vicar) on Feb 10, 2021 at 03:54 UTC
Bad terminology. I mean the \G having to refer back to the last match.	[reply]
Re^4: regex gotcha moving from 5.8.8 to 5.30.0? by AnomalousMonk (Archbishop) on Feb 10, 2021 at 04:04 UTC
Re^5: regex gotcha moving from 5.8.8 to 5.30.0? by kschwab (Vicar) on Feb 10, 2021 at 05:43 UTC
Re: regex gotcha moving from 5.8.8 to 5.30.0? by mordibity (Acolyte) on Feb 10, 2021 at 18:20 UTC
Thank you all for your thoughts! I guess being coy isn't really helpful, I was just trying to avoid distractions. This is (really) part of a parser I've written for structural Verilog netlists in an IC design environment. The whole parser does much more work and handles many more nuances of the input format, but I've been happy enough (for years) with the performance without trying to speed it up further (eg sw1's whitespace handling tips). For example, a typical report on a typical netlist input file (200+Mb) would run in ~10 seconds. But then CAD upgraded our central Perl (from 5.8.8 to 5.30) and the same code, on the same input took 1 hour 38 minutes! (With the same output results, so it's purely a performance issue, not a correctness issue.) So then I set about cutting it down to a very small testcase with the 5 sec to 105 sec delta. AnomalousMonk: it's just plain ascii text; I tried throwing "aa" on it (for 5.30) and didn't change things. kschwab: below I've put a better fake-data generator to get a consistent 2.5x slowdown (still not as bad as the 10x on the real data, but hopefully more representative of the problem. The majority of the names would be unique strings, etc, etc). And also here's the un-foo-ified cut-down parser, too. SBECK: thank you! I guess I'll start looking at 5.20-related deltas. But I don't know that I'll be able to infer anything on my own; suggestions welcome from all on how to proceed (file a github issue, etc?) faker & parser are used like this: `$ ./makev.pl 10000 > test.10k.v $ time perl_5.8.8 -w testv.pl test.10k.v LAST MODULE (Perl 5.008008): je0dhj perl_5.8.8 -w testv.pl test.10k.v 0.01s user 0.01s system 0% cpu 11.3 +52 total $ time perl_5.30 -w testv.pl test.10k.v LAST MODULE (Perl 5.030000): je0dhj perl_5.30 -w testv.pl test.10k.v 0.01s user 0.02s system 0% cpu 26.29 +1 total` [download] makev.pl: !/usr/bin/perl -w use Text::Wrap; use strict; my $num = shift or die "num?\n"; my @chars = ( "a" .. "z", 0 .. 9 ); for my $i (0 .. $num) { my $name = join("", @chars[ map { rand @chars } ( 1 .. 2+int(rand( +8)) )]); my @io = map { "p${name}${_}" } (0..int(rand(100))); my @hier = map { "m${name}${_}" } (0..int(rand(20))); my @leaf = map { "s${name}${_}" } (0..int(rand(200))); print "module ${name} ( ", wrap('', ' ', join(", ", @io)), "\n);\ +n"; print " inout $_;\n" foreach @io; print " wire $_;\n" foreach @io; for my $leaf (@leaf) { my @conn = map { ".P${name}${_} (n${name}${_})" } (0..int(rand +(5))); print "$leaf u_$leaf ( ", wrap('', ' ', join(", ", @conn)), " +\n);\n"; } for my $hier (@hier) { my @conn = map { ".p${name}${_} (n${name}${_})" } (0..int(rand +(100))); print "$hier u_$hier ( ", wrap('', ' ', join(", ", @conn)), " +\n);\n"; } print "endmodule\n\n"; } [download] testv.pl: `use strict; use File::Slurp; my $file = shift or die "file?\n"; my $text = read_file($file); parse_v($text); sub parse_v { my $text = shift; my $name; { last if $text =~ /\G \s* \Z/gcmsx; if ($text =~ /\G \s* ^ \s* module \s+ (\S+?) \s* $ \s* (. +?) \s $ \s* ;/gcmsx) { $name = $1 } elsif ($text =~ /\G \s* ^ \s* endmodule /gcmsx) { } elsif ($text =~ /\G \s* ^ \s* \S+ \s+ .? \s ;/gcmsx) { } else { die "ERROR: unknown syntax\n" } redo; } print "LAST MODULE (Perl $]): $name\n"; }` [download]	[reply] [d/l] [select]
Re^2: regex gotcha moving from 5.8.8 to 5.30.0? by Tux (Canon) on Feb 11, 2021 at 14:37 UTC
Thanks for the test case. Having every perl ever released available threaded and unthreaded, I get the folowing results when I only run on last minor of every major release (see below: left column is sorted on elapsed time, right column is sorted on perl version) It shows that performance for your test-case was really bad in 5.19, 5.20, 5.21, and 5.22, but it was much better in 5.23 again. It might be a worrying indicator that 5.31 and 5.32 are slower again then 5.30 and below. The last release that runs your test comparable to 5.8.8 is 5.18.4 (or 5.16.3) $ perl-all --time --last testv.pl test.2k.v Running perl-all testv.pl test.2k.v On 57 versions of perl ranging from perl5.6.2 to tperl5.33.6 : : rank elapsed cuser csys pass perl \| perl + rank elapsed ==== ========== ======= ======= ==== ================= \| ============= +==== ==== ========== 1 1.53014 1.440 0.090 PASS base/tperl5.6.2 \| /usr/bin/perl + 4 1.821 2 1.54001 1.450 0.080 PASS base/perl5.6.2 \| base/perl5.6. +2 2 1.540 3 1.77906 1.690 0.100 PASS base/perl5.8.9 \| base/tperl5.6 +.2 1 1.530 4 1.82076 1.730 0.090 PASS /usr/bin/perl \| base/perl5.7. +3 26 2.641 5 1.87299 1.790 0.080 PASS base/perl5.9.5 \| base/tperl5.7 +.3 27 2.852 6 1.94761 1.850 0.090 PASS base/tperl5.8.9 \| base/perl5.8. +9 3 1.779 7 1.96104 1.880 0.080 PASS base/tperl5.9.5 \| base/tperl5.8 +.9 6 1.948 8 2.04763 1.960 0.090 PASS base/perl5.13.11 \| base/perl5.9. +5 5 1.873 9 2.07248 1.990 0.090 PASS base/perl5.12.5 \| base/tperl5.9 +.5 7 1.961 10 2.08193 1.990 0.100 PASS base/perl5.14.4 \| base/perl5.10 +.1 15 2.171 11 2.08614 1.990 0.090 PASS base/perl5.15.9 \| base/tperl5.1 +0.1 19 2.236 12 2.09114 2.010 0.090 PASS base/tperl5.15.9 \| base/perl5.11 +.5 21 2.266 13 2.10513 2.020 0.080 PASS base/perl5.16.3 \| base/tperl5.1 +1.5 16 2.176 14 2.11247 2.020 0.090 PASS base/tperl5.16.3 \| base/perl5.12 +.5 9 2.072 15 2.17058 2.090 0.090 PASS base/perl5.10.1 \| base/tperl5.1 +2.5 18 2.216 16 2.17647 2.080 0.100 PASS base/tperl5.11.5 \| base/perl5.13 +.11 8 2.048 17 2.19196 2.100 0.090 PASS base/tperl5.14.4 \| base/tperl5.1 +3.11 20 2.245 18 2.21637 2.130 0.090 PASS base/tperl5.12.5 \| base/perl5.14 +.4 10 2.082 19 2.23559 2.120 0.110 PASS base/tperl5.10.1 \| base/tperl5.1 +4.4 17 2.192 20 2.24530 2.150 0.090 PASS base/tperl5.13.11 \| base/perl5.15 +.9 11 2.086 21 2.26649 2.180 0.080 PASS base/perl5.11.5 \| base/tperl5.1 +5.9 12 2.091 22 2.44358 2.360 0.090 PASS base/tperl5.17.11 \| base/perl5.16 +.3 13 2.105 23 2.45182 2.370 0.090 PASS base/perl5.17.11 \| base/tperl5.1 +6.3 14 2.112 24 2.47311 2.380 0.100 PASS base/perl5.18.4 \| base/perl5.17 +.11 23 2.452 25 2.56640 2.470 0.090 PASS base/tperl5.18.4 \| base/tperl5.1 +7.11 22 2.444 26 2.64125 2.540 0.100 PASS base/perl5.7.3 \| base/perl5.18 +.4 24 2.473 27 2.85200 2.750 0.100 PASS base/tperl5.7.3 \| base/tperl5.1 +8.4 25 2.566 28 4.81823 4.780 0.030 PASS base/perl5.30.3 \| base/perl5.19 +.11 54 12.516 29 4.84806 4.800 0.040 PASS base/perl5.28.3 \| base/tperl5.1 +9.11 55 12.527 30 4.87266 4.830 0.040 PASS base/perl5.29.10 \| base/perl5.20 +.3 57 12.621 31 4.87333 4.840 0.040 PASS base/tperl5.29.10 \| base/tperl5.2 +0.3 50 12.173 32 4.90558 4.870 0.030 PASS base/perl5.26.3 \| base/perl5.21 +.11 53 12.495 33 4.91229 4.880 0.040 PASS base/perl5.27.11 \| base/tperl5.2 +1.11 56 12.529 34 4.93008 4.880 0.050 PASS base/tperl5.26.3 \| base/perl5.22 +.4 52 12.443 35 4.94565 4.900 0.040 PASS base/tperl5.30.3 \| base/tperl5.2 +2.4 51 12.416 36 4.96723 4.930 0.030 PASS base/perl5.23.9 \| base/perl5.23 +.9 36 4.967 37 4.97168 4.940 0.040 PASS base/perl5.24.4 \| base/tperl5.2 +3.9 41 5.148 38 5.02614 4.980 0.040 PASS base/tperl5.25.12 \| base/perl5.24 +.4 37 4.972 39 5.05600 5.010 0.040 PASS base/tperl5.28.3 \| base/tperl5.2 +4.4 43 5.201 40 5.12577 5.080 0.040 PASS base/tperl5.27.11 \| base/perl5.25 +.12 42 5.160 41 5.14769 5.110 0.030 PASS base/tperl5.23.9 \| base/tperl5.2 +5.12 38 5.026 42 5.16021 5.110 0.040 PASS base/perl5.25.12 \| base/perl5.26 +.3 32 4.906 43 5.20146 5.150 0.040 PASS base/tperl5.24.4 \| base/tperl5.2 +6.3 34 4.930 44 8.68506 8.640 0.040 PASS base/perl5.32.1 \| base/perl5.27 +.11 33 4.912 45 8.84718 8.800 0.040 PASS base/perl5.31.11 \| base/tperl5.2 +7.11 40 5.126 46 9.22812 9.200 0.030 PASS base/tperl5.31.11 \| base/perl5.28 +.3 29 4.848 47 9.39026 9.360 0.030 PASS base/tperl5.32.1 \| base/tperl5.2 +8.3 39 5.056 48 9.56254 9.530 0.040 PASS base/perl5.33.6 \| base/perl5.29 +.10 30 4.873 49 10.05364 10.010 0.040 PASS base/tperl5.33.6 \| base/tperl5.2 +9.10 31 4.873 50 12.17332 12.130 0.040 PASS base/tperl5.20.3 \| base/perl5.30 +.3 28 4.818 51 12.41640 12.380 0.030 PASS base/tperl5.22.4 \| base/tperl5.3 +0.3 35 4.946 52 12.44274 12.400 0.030 PASS base/perl5.22.4 \| base/perl5.31 +.11 45 8.847 53 12.49524 12.450 0.040 PASS base/perl5.21.11 \| base/tperl5.3 +1.11 46 9.228 54 12.51614 12.470 0.040 PASS base/perl5.19.11 \| base/perl5.32 +.1 44 8.685 55 12.52714 12.480 0.040 PASS base/tperl5.19.11 \| base/tperl5.3 +2.1 47 9.390 56 12.52865 12.480 0.050 PASS base/tperl5.21.11 \| base/perl5.33 +.6 48 9.563 57 12.62089 12.570 0.040 PASS base/perl5.20.3 \| base/tperl5.3 +3.6 49 10.054 [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l]
Re^3: regex gotcha moving from 5.8.8 to 5.30.0? by mordibity (Acolyte) on Feb 11, 2021 at 21:34 UTC
That's fascinating, thank you! It looks like through blind luck I've been frozen-in-time with a version whose performance happened to complement the way I was trying to tackle the parsing problem... For my side, I suppose I can stick with 5.8.8, or jump to 5.30+ by possibly changing my regexes as sw1 and SBECK have suggested to avoid "\s* ^ \s*" and compromising a little on the strictness of the parser. But for the sake of Perl, however, is there a recommended escalation path to alert the developers about this regression / add it to the test suite / etc? Or is it niche enough not to be interesting?	[reply]
Re^4: regex gotcha moving from 5.8.8 to 5.30.0? by Animator (Hermit) on Feb 12, 2021 at 17:16 UTC
Re^5: regex gotcha moving from 5.8.8 to 5.30.0? by demerphq (Chancellor) on Feb 12, 2021 at 21:47 UTC
Some notes below your chosen depth have not been shown here
Re^4: regex gotcha moving from 5.8.8 to 5.30.0? by rsFalse (Chaplain) on Feb 11, 2021 at 22:17 UTC
Re^2: regex gotcha moving from 5.8.8 to 5.30.0? by SBECK (Chaplain) on Feb 11, 2021 at 13:13 UTC
So I did some more testing to tweak the regexps to see if I could narrow it down, and I found a new set of regexps that run the same speed from 5.8 all the way to 5.30 AND they are something like 10% faster than the original (on my host, your regexps take 10-12 secs, and mine take 9-11). All of your regexps match up to (but not including) the newline. Based on the comment that '^' might be causing problems, I just added a '\s' to the end of each regexp so basically each one of them grabs the newline and any additional whitespace. The following runs quickly for all versions: `sub parse_foo { my ($text) = @_; my $name; { last if $text =~ /\G \Z/gcmsx; if ($text =~ /\G begfoo \s+ (\S+?) \s $ \s* (.?) \s $ + \s* ; \s/gcmsx) { $name = $1 } elsif ($text =~ /\G endfoo \s /gcmsx) { } elsif ($text =~ /\G \S+ \s+ .? \s ; \s*/gcmsx) { } else { die "ERROR: unknown syntax\n" } redo; } print "LAST FOO: $name\n"; }` [download]	[reply] [d/l]
Re^3: regex gotcha moving from 5.8.8 to 5.30.0? by mordibity (Acolyte) on Feb 11, 2021 at 21:09 UTC
Very cool, thanks! It's a seductive solution, being even faster than the original regexes, but I'm having pangs about "correctness" of the format... By letting each sub-regex consume its trailing newline, I can no longer enforce that the main keywords are the first token on any given line, and input like this (all smushed together on one line) isn't flagged as illegal/unknown syntax: `begfoo a ( a, b, c); endfoo begfoo b ( d, e, f ); input d; foo inst1 (a,b,c); endfoo` In other words, way too liberal in what I accept! :-) The commercial tools would reject that instantly. But, for my reporting and analysis purposes, it's harmless, and it would let me move to 5.30 and pick up the other benefits of a more modern Perl... Hmm. I did spend some time experimenting/trying to write the sub-regexes to avoid the possibly-poisonous "\s* ^ \s" to instead all begin with "\G ^" by either having each sub-regex consume their respective newline OR consuming them all in a separate sub-regex (like sw1 suggested in their "# march through any white space"), but I couldn't get it to work. I think it may be a catch-22 scenario: if the newline is present/next in the string, "\G ^" won't match it, since it matches after* a newline. But if the newline has been consumed, "\G ^" also won't match it, since it's not there...)	[reply] [d/l]
Re^4: regex gotcha moving from 5.8.8 to 5.30.0? by tybalt89 (Monsignor) on Feb 12, 2021 at 02:56 UTC
Re^4: regex gotcha moving from 5.8.8 to 5.30.0? by rsFalse (Chaplain) on Feb 11, 2021 at 22:40 UTC
Re^2: regex gotcha moving from 5.8.8 to 5.30.0? by choroba (Cardinal) on Feb 10, 2021 at 20:05 UTC
What about reading the file in chunks? "endmodule" seems like a good candidate for the separator. `#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; local $/ = 'endmodule'; my $name; while (<>) { if (/^ \s* module \s+ (\S+?) \s* $ \s* .? \s $ \s* ;/xsm) { $name = $1; say $name; } } say "LAST MODULE (Perl $]): $name";` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^2: regex gotcha moving from 5.8.8 to 5.30.0? by haukex (Archbishop) on Feb 12, 2021 at 08:09 UTC
below I've put a better fake-data generator to get a consistent 2.5x slowdown (still not as bad as the 10x on the real data, but hopefully more representative of the problem. IMHO, it would be best if you could figure out what's causing the 10x slowdown* and generate a fake file with that data. Perhaps anonymizing the original data would be an option? * Update: By which I mean: which part of the original data is causing the slowdown.	[reply]
Re: regex gotcha moving from 5.8.8 to 5.30.0? by kschwab (Vicar) on Feb 10, 2021 at 16:40 UTC
I'm assuming the test data and sample code reproduces the slowness, but doesn't really show all the requirements. Is that right? Because looking at the generated test data, it seems like there should be some much faster way of parsing it. Assuming, for example, the file has important bits already newline separated, it feels like a simpler line-by-line read watching for /^begfoo/ and /^endfoo/ and keeping track of state would be faster. Can you provide a more real-world view of what's being done, or is it just too much test data and code to present here? Edit: Not dismissing the importance of the performance regression...that's pretty bad.	[reply]
Re: regex gotcha moving from 5.8.8 to 5.30.0? by rsFalse (Chaplain) on Feb 11, 2021 at 09:34 UTC
Interestingly that only two lines differ in execution times. What is common in these two lines and what differs from other lines? I see the only difference that both slowered 'if's ('if' and 'elsif') proceed to non-empty expressions. swl guessed (Re^2: regex gotcha moving from 5.8.8 to 5.30.0?) that performance could be influenced by '^' (and its interplay with `'\s'`). I see that 3rd and 4th regexes also have '^', but they do not influence speed of execution(!?). My guess is if `'\s ^ \s*'` is somehow cached after executing 2nd regex and then 3rd and 4th regexes use the result of 2nd's(?). And very strange that 1st regex executes slowly. Can it be because of '\Z' being a kind of a relative to '^'...	[reply] [d/l] [select]
Re: regex gotcha moving from 5.8.8 to 5.30.0? by rsFalse (Chaplain) on Feb 11, 2021 at 16:36 UTC
And what if `tr/^/$/` ? It should ~~do the same here~~, but is it also slower? Upd. I think it would do the same except of not matching the first 'line'. But prepending one endline ("\n") to the very beginning of the ~~slurped~~multiline text would make matching similar.	[reply] [d/l]
Re: regex gotcha moving from 5.8.8 to 5.30.0? by Anonymous Monk on Feb 10, 2021 at 17:01 UTC
That's still an astonishing and pretty-scary regression. Did both versions produce the same answer, and, was it in fact the right answer?	[reply]

Back to Seekers of Perl Wisdom