Foodeywo has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

today I re-stumbled upon an issue I quick-and-dirty solved a while ago but I want to solve it more elegantly while I am doing code polishing these days.

I use Regexp::Assemble to assemble regex that are about 15kb to 87kb large. Now I very simply run through a large (~10GB) file and match the regex. I used to do this on the command line in the style

perl -ne 'print if (/MYLARGEREGEX_HERE/../END_OF_BLOCK/)' inputfile > +outputfile

this was fast as hell. However when my regex grew in size, I was not able to copy paste them into the bash so I started to read the regex from a file and did something like this

#!perl use strict; use warnings; open my $fh_big_file, '<', $ARGV[0] || die; #first argument must be th +e input file open my $fh_regex, '<', $ARGV[1] || die; # second argument points to t +he file containing the regex my $regex = <$fh_regex>; while(<$fh_big_file>) { print if (/$regex/../^END_OF_BLOCK/); }

The funny thing is, that this flavour of code costs me factor 20 in speed or even more. I can reclaim the speed by avoiding to store the regex in a variable, e.g.

while(<$fh_big_file>) { print if (/MY_HUGE_REGEX_JUST_PLAIN/../^END_OF_BLOCK/); }

so I assume this has something to do with fetching the content of the variable (from RAM to CPU?) over and over for each loop of while(<>), whereas inputing the regex directly doesnt need to re-read it every time.

This approach however requires me to manually copy the regex to its place each time I run the whole procedure of "assembling, searching, processing, assembling, seachring, processing" and I would like to automize it without loss of performance. Any ideas?

thanks and cheers!

Update/Solution/Close

The suggestion to use the o operator works. However it needs to be behind /$regex/ not behind /END_OF_BLOCK/. i.e. like shmem suggested:

while(<$fh_big_file>) { print if (/$regex/o .. /^END_OF_BLOCK/); }

thanks!

Replies are listed 'Best First'.
Re: Performance of assambled regex
by Not_a_Number (Prior) on Jul 26, 2015 at 18:22 UTC

    Not an answer to your question, but there is another problem in your code. This line:

    open my $fh_big_file, '<', $ARGV[0] || die;

    and the following line are buggy for precedence reasons. Use either parentheses:

    open( my $fh_big_file, '<', $ARGV[0] ) || die;

    or the low precedence or operator;

    open my $fh_big_file, '<', $ARGV[0] or die;

    Proof:

    # There's no file called 'rabbit' in my cwd open my $fh, '<', 'rabbit' || die; print "Still alive!\n"; __END__ Still alive!
Re: Performance of assambled regex
by BillKSmith (Monsignor) on Jul 26, 2015 at 17:28 UTC
    In my reply to your earlier question, I recommended that you store a compiled regex (using qr//) in your variable and showed you how to use it. The advantage of this over /o is that variables continue to work as expected.
    Bill
Re: Performance of assambled regex
by Laurent_R (Canon) on Jul 26, 2015 at 12:52 UTC
    I would suspect that the problem is due to the fact that your code has to recompile the regex each time through the loop.

    Perhaps using the o modifier would help:

    while(<$fh_big_file>) { print if (/MY_HUGE_REGEX_JUST_PLAIN/../^END_OF_BLOCK/o); }
    Update: or rather:
    while(<$fh_big_file>) { print if (/MY_HUGE_REGEX_JUST_PLAIN/o .. /^END_OF_BLOCK/); }
    Update 2: sorry, wrong copy and paste (and thanks to shmem for pointing out):
    while(<$fh_big_file>) { print if (/$regex/o .. /^END_OF_BLOCK/); }

      It is the $regex which gets recompiled.

      perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: Performance of assambled regex
by shmem (Chancellor) on Jul 26, 2015 at 12:52 UTC

    Use

    while(<$fh_big_file>) { print if (/$regex/o .. /^END_OF_BLOCK/); }

    to avoid recompilation of the regex.

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: Performance of assambled regex
by natan (Novice) on Jul 27, 2015 at 09:43 UTC

    Hi Foodeywo

    I would discourage using /o modifier, because of this note in perlre for v5.22:

      o  - pretend to optimize your code, but actually introduce bugs

    using qr// is much better practice, as you can control when the compilation happens. If you want to know exactly how perl handles your regexps you can use:

    perl -Mre=debug -d -ne 'print if (/MYLARGEREGEX_HERE/../END_OF_BLOCK/)' inputfile > outputfile

    -Mre=debug gives you nice output about regex compilation and matching
    and
    -d puts you into debugger.

    Go through your program step by step with "s" debugger option.

    Happy hacking,
    natan

    ---- PS: my first post on PerlMonks. Hello everybody!

Re: Performance of assambled regex
by anonymized user 468275 (Curate) on Jul 27, 2015 at 13:42 UTC
    The route to improving performance is often a long and windy one. Sometimes it's possible to save by examining the boundaries. For example, are there any consistent features of the regexes that can be used to cut down the algorithm you end up with? Are the regexes being assembled of multiple regexes which are available separately (e.g. capturable as cgi parameters)? Generally if big wodges of data are coming my way, before turning the mill, I try to cut them into much smaller and more manageable pieces either size-wise (e.g. compare fixed-size chunks) or scope-wise, (e.g. using a lexer and parser).

    One world, one people