in reply to Re^2: regex gotcha moving from 5.8.8 to 5.30.0?
in thread regex gotcha moving from 5.8.8 to 5.30.0?

Very cool, thanks! It's a seductive solution, being even faster than the original regexes, but I'm having pangs about "correctness" of the format... By letting each sub-regex consume its trailing newline, I can no longer enforce that the main keywords are the first token on any given line, and input like this (all smushed together on one line) isn't flagged as illegal/unknown syntax:

begfoo a ( a, b, c); endfoo begfoo b ( d, e, f ); input d; foo inst1 (a,b,c); endfoo

In other words, way too liberal in what I accept! :-) The commercial tools would reject that instantly. But, for my reporting and analysis purposes, it's harmless, and it would let me move to 5.30 and pick up the other benefits of a more modern Perl... Hmm.

I did spend some time experimenting/trying to write the sub-regexes to avoid the possibly-poisonous "\s* ^ \s*" to instead all begin with "\G ^" by either having each sub-regex consume their respective newline OR consuming them all in a separate sub-regex (like sw1 suggested in their "# march through any white space"), but I couldn't get it to work. I think it may be a catch-22 scenario: if the newline is present/next in the string, "\G ^" won't match it, since it matches after a newline. But if the newline has been consumed, "\G ^" also won't match it, since it's not there...)

Replies are listed 'Best First'.
Re^4: regex gotcha moving from 5.8.8 to 5.30.0?
by tybalt89 (Monsignor) on Feb 12, 2021 at 02:56 UTC

    Tried a whole bunch of things, not all worked, but currently at about 20X faster on 231MB fake file (perl v5.32.0).

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11128141 use warnings; use Time::HiRes qw( time ); my $string = do { local(@ARGV, $/) = '50k.foo'; <> }; my $start = time; parse_v( $string ); printf "seconds %.3f for length %d file\n", time - $start, length $str +ing; sub parse_v { local $_ = shift; my $name; while( 1 ) { if(/\G (?: (?!endmodule\b|module\b) \S+ \s [^;]* ; | (?<!\N) endmodule \b) \s* /gcx) { } elsif(/\G (?<!\N) module \s+ (\S+?) \s* \( [^)]* \) \s* ; \s* /gcx +) { $name = $1 } else { /\G \z/gcx ? last : die "ERROR: unknown syntax at @{[pos($_ +)]}\n" } } print "LAST MODULE (Perl $]): $name\n"; }

    For double negative fans, (?<!\N) means "not preceded by not a newline".

Re^4: regex gotcha moving from 5.8.8 to 5.30.0?
by rsFalse (Chaplain) on Feb 11, 2021 at 22:40 UTC
    Hm. May some other variants help? Or are they way slower?....
    "\G \s*? ^ \s*" # non-greedy
    "\G (?= \s* ^ ) \s*" # look-ahead
    Upd. And do they reproduce regression?

    Upd. May that factoring out of "\s* ^ \s*" help?
    { last if $text =~ /\G \s* \Z/gcmsx; if ($text =~ /\G \s* ^ \s*/gcmsx) { if ($text =~ /\G module \s+ (\S+?) \s* \( \s* (.*?) \s +* \) \s* ;/gcmsx) { $name = $1 } elsif ($text =~ /\G endmodule /gcmsx) { } elsif ($text =~ /\G \S+ \s+ .*? \s* ;/gcmsx) { } else { die "ERROR: unknown syntax\n" } } else { die "ERROR: unknown syntax\n" } redo; }