NetWallah has asked for the wisdom of the Perl Monks concerning the following question:
I'm looking for some recommendations/best practice/elegant solutions for the following (This code already produces the right results).
I was surprised by the fact that Method#2 (My original code) is SLOWER than Method#1.
use strict; use warnings; my $x= <<"__X__" x 4; # Increase multiplier for benchmarking A data for A B data for b C data for c __X__ # Method #1 - works but I don't like using $1,$2 - would rather use na +mes while($x=~/(^A|^B)(.+)$/mg){ print "$1 method1 $2\n" } # Method #2 - open my $f,"<",\$x or die $!; while(<$f>){ my ($name,$data) = m/(^A|^B)(.+)$/ or next; print "$name method2 $data\n" } close $f; # Method #3 (infinite loop) #while(my ($name,$data)=$x=~/(^A|^B)(.+)$/mg){ # print "$name method3 $data\n" #}
"These opinions are my own, though for a small fee they be yours too."
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Extracting /regex/mg in a while loop
by haukex (Archbishop) on Oct 09, 2023 at 22:25 UTC | |
Your Method #3 doesn't work because a m//g regex with capture groups in list context returns a list of the substrings matched by any capturing parentheses in the regular expression, that is, ($1, $2...), repeated for each match (see my node here).
I would guess that in Method #2, the overhead of splitting the string into lines and then running a regex match on each of the lines (as opposed to a single regex iterating through a single string) might be what is causing the slowdown. But I would address your comment in Method #1 about using names as follows, and this is the variation I might suggest in this case: named capture groups (see also perlre).
(There are of course plenty of other ways, such as m/\G.../gc parsing.) | [reply] [d/l] [select] |
Re: Extracting /regex/mg in a while loop
by NERDVANA (Priest) on Oct 10, 2023 at 08:10 UTC | |
There's always a trade-off between elegance and performance and convenience. But I don't think you should be surprised that #1 is faster. Doing a quick back-of-the-envelope estimation of them might look like: Normally, the performance of something like this wouldn't matter much, unless you're processing really huge data, so I'd say write it whichever way is most convenient and re-usable. (like if you often process files instead of scalars already loaded in memory, then might as well leave it in a read loop) But, if you want to try some other things for performance (no promises, I'm lazy and just suggesting experiments) you could try: or
| [reply] [d/l] [select] |
Re: Extracting /regex/mg in a while loop
by kcott (Archbishop) on Oct 10, 2023 at 11:03 UTC | |
G'day NetWallah, Here's an actual benchmark, using everyone's code (posted to date) plus a couple of my own. I've aimed to keep as close to the originals as possible. Update: My apologies. I made a huge mistake with the code I originally posted. I've stricken that code and placed it in the spoiler below. Here's a largely rewritten version which works correctly.
Here's the output from a sample run:
Output:
I changed $bench_mult = 1_000 to $bench_mult = 1_000_000 — it made little difference to the results.
The 'use v5.38;' represents the version I'm using. You can change that to v5.36 without needing to change any other parts of the code. And, of course, you can wind it back to earlier versions: the further back you go, the more changes that will be needed. CAVEAT for MSWin users: I expect the /dev/null on line 14 will need changing; however, I don't know what would be appropriate. Update: See my updated reply to haj — that CAVEAT is no longer valid for the updated code. — Ken | [reply] [d/l] [select] |
by haj (Vicar) on Oct 10, 2023 at 11:54 UTC | |
Nice work! About that CAVEAT for Windows: File::Spec covers that in a portable way: $devnull = File::Spec->devnull(); | [reply] |
by kcott (Archbishop) on Oct 10, 2023 at 16:20 UTC | |
G'day haj, ++ Thanks for the tip. For anyone looking for portability, DRYness, and an easy way to add more tests, replace all the code prior to the start of the subroutine definitions with: Update: I made an error with the original code I posted. I've stricken that code and replaced it with a rewrite which works correctly. That rewrite includes all of the additional features: "portability, DRYness, and an easy way to add more tests". The code I posted below is no longer valid; I've stricken it and placed it in a spoiler. The final paragraph, about adding new tests, is still valid: I left it unchanged.
I've run this a few times: the results are comparable with what I posted earlier. The multiplier for $base_str is now the constant BENCH_MULT; play around with this if you're so inclined. For more tests, add your subroutine and its name to @bench_names; the rest is done for you. — Ken | [reply] [d/l] [select] |
by Marshall (Canon) on Oct 10, 2023 at 19:18 UTC | |
The "bit bucket" on Windows is the psuedo file called NUL. This is a reserved file name and you cannot create a file named that. On command line: type someFile > NUL reads someFile and sends it to nowhere. You can open a filehandle to NUL and write to it. | [reply] [d/l] |
by kcott (Archbishop) on Oct 10, 2023 at 23:04 UTC | |
G'day Marshall, ++haj already pointed me to a portable solution, using devnull() from File::Spec, which I incorporated into my code.
While writing my original code, I had used state variables so that the "# Increase multiplier for benchmarking" (from NetWallah's OP) would only be called once for each of the multitude of calls by Benchmark::cmpthese(). Unfortunately, subsequently adding code (prior to posting) to test each of the subroutines to be benchmarked ("say 'Test', ...") caused major problems. I didn't spot that until after I posted: I pretty much rewrote the code to fix these problems and added Updates to explain what I'd done. In order to avoid invalidating what haj had written (as per "How do I change/delete my post?") I left "CAVEAT for MSWin users: ..." but immediately followed it with "Update: ... CAVEAT is no longer valid ..." I appreciate this ended up all a bit messy. My apologies if it caused confusion. — Ken | [reply] [d/l] [select] |
by NetWallah (Canon) on Oct 11, 2023 at 05:54 UTC | |
I updated the code somewhat to allow it to run on my slightly older (5.34) Perl. Read more... (3 kB)
My results are different than yours:
"These opinions are my own, though for a small fee they be yours too." | [reply] [d/l] [select] |
by kcott (Archbishop) on Oct 11, 2023 at 09:36 UTC | |
"My results are different than yours:" Well, that's hardly surprising. Surely the fact that none of the tests output anything raised a red flag. I put the code you posted in op_bench1.pl and ran it:
I copied op_bench1.pl to op_bench1_mod1.pl; added
to the end of the file and ran it:
I hit Ctrl-C at that point. I copied op_bench1_mod1.pl to op_bench1_mod2.pl. I added
after
and ran it:
I copied op_bench1_mod2.pl to op_bench1_mod3.pl. Then changed
to
The names of the subroutines identified the author (first three characters of username in lower-case) and the order of suggested solutions (a single digit). I changed
to
You'd placed kco3() between kco1() and kco2(): I initially had difficulty locating it as I expected it to be added to the end. You also had it erroneously reporting that it was kco2:
I moved it to the end and fixed those problems. The last code in the script is now:
I then ran this script:
Now the results more closely resemble mine. Here's op_bench1_mod3.pl in full:
— Ken | [reply] [d/l] [select] |