in reply to RE on lines read from in-memory scalar is very slow

It could be related to your system or to the cygwin build or its general environment.

Running with Strawberry Perl 5.38 I see the in-memory version being faster. Results are similar for SP 5.36.1.

perl 11157166.pl tempfile.txt is size 29 Mb 0.133742 read lines from disk and do RE (2047 matches). 0.079146 read lines from in-memory file and do RE (2047 matches).

Edit: Tried with MSYS2 and results are slower for the in-memory loop.

tempfile.txt is size 29 Mb 0.057704 read lines from disk and do RE (2047 matches). 0.805546 read lines from in-memory file and do RE (2047 matches).

(End of edit)

Modified code is below. Main change is to generate the file if needed (beware lack of overwrite safety). It also reports the number of matches.

#!/usr/bin/env perl use warnings; use strict; use Time::HiRes qw( time ); my $file = shift @ARGV; my ($fh, $time); if (!$file) { # should use Path::Tiny::tempfile $file = 'tempfile.txt'; open my $ofh, '>', $file or die "Cannot open $file for writing, $! +"; srand(1234567); for my $i (0..200000) { my $string = 'some random text ' . rand(); $string = $string x int (rand() * 10); if (rand() < 0.01) { $string = " Query${string}"; } say {$ofh} $string; } $ofh->close or die "Cannot close $file, $!"; printf "%s is size %i Mb\n", $file, (-s $file) / (1028**2); } open $fh, "<", $file; $time = time; my $match_count1; while(<$fh>) { /^ ?Query/ && $match_count1++; } printf "%f read lines from disk and do RE ($match_count1 matches).\n", + time - $time; seek $fh, 0, 0; my $s = ""; while(<$fh>) { $s .= $_; } open $fh, "<", \$s; $time = time; my $match_count2; while(<$fh>) { /^ ?Query/ && $match_count2++;; } printf "%f read lines from in-memory file and do RE ($match_count2 mat +ches).\n", time - $time;

Replies are listed 'Best First'.
Re^2: RE on lines read from in-memory scalar is very slow
by Danny (Chaplain) on Jan 23, 2024 at 06:10 UTC
    I just found another interesting observation. That regex I was using matches 125,277 times (16.3% of the lines) in the example file I was using. If I change the regex to /^ ?QueryXXX/; so that it matches nothing I get:
    0.114286 read lines from disk and do RE; n=769114. 0.104568 read lines from in-memory file and do RE; n=769114.
    This is probably more for the perl developers or cygwin distribution people, but still pretty quirky.

      Modifying my code so 16.3% of the lines will match the regex (previously it was 1%) gives me these timings for Strawberry Perl 5.38:

      tempfile.txt is size 29 Mb 0.121952 read lines from disk and do RE (32571 matches). 0.491674 read lines from in-memory file and do RE (32571 matches).

      So consistent with your results for Strawberry Perl.

      And for MSYS2 perl 5.38.2:

      tempfile.txt is size 29 Mb 0.064073 read lines from disk and do RE (32571 matches). 9.538524 read lines from in-memory file and do RE (32571 matches).

      So something would appear to be awry with the regex matching under MSYS2 and Cygwin.

        I ran the code using use re 'debug' and there is no difference in the regex processing.

        I then instrumented the code with some metamod::Devel::Peek Dumps. The in-memory strings have rapidly increasing amounts of memory allocated (the LEN field), plateauing at close to the size of the input string. This pattern is the same for both Strawberry Perl and MSYS2 Perl, which makes me wonder if the delay is related to memory management. Others are more qualified to comment on that front than me, though.

        Edit: Just for completeness I also tested using Perl 5.36.0 on Ubuntu via WSL and the memory usage is the same.

        Updated code is below behind inside the readmore tags.

        tempfile.txt is size 35 Mb SV = PV(0x177f89db120) at 0x177f88bcca0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x177fa280250 " Querysome random text 0.271320203145251\n"\0 CUR = 41 LEN = 408 COW_REFCNT = 2 SV = PV(0x177f89db120) at 0x177f88bcca0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x177fa34f4d0 " Querysome random text 0.775348369818055some ran +dom text 0.775348369818055\n"\0 CUR = 75 LEN = 201 COW_REFCNT = 2 SV = PV(0x177f89db120) at 0x177f88bcca0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x177fa3a28b0 " Querysome random text 0.785001144808529\n"\0 CUR = 41 LEN = 43 COW_REFCNT = 2 SV = PV(0x177f89db120) at 0x177f88bcca0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x177fa297c60 " Querysome random text 0.894431999356865\n"\0 CUR = 41 LEN = 309 COW_REFCNT = 2 SV = PV(0x177f89db120) at 0x177f88bcca0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x177f88a13b0 " Querysome random text 0.412049736815259some ran +dom text 0.412049736815259some random text 0.412049736815259some rand +om text 0.412049736815259some random text 0.412049736815259some rando +m text 0.412049736815259some random text 0.412049736815259some random + text 0.412049736815259some random text 0.412049736815259\n"\0 CUR = 313 LEN = 392 COW_REFCNT = 2 SV = PV(0x177f89db120) at 0x177f88bcca0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x177fa3750d0 " Querysome random text 0.809515115277865\n"\0 CUR = 41 LEN = 275 COW_REFCNT = 2 0.005142 read lines from disk and do RE (6 matches). SV = PV(0x177f89db120) at 0x177f88bcca0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x177f88b3ed0 " Querysome random text 0.271320203145251\n"\0 CUR = 41 LEN = 656 COW_REFCNT = 2 SV = PV(0x177f89db120) at 0x177f88bcca0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x1778000b070 " Querysome random text 0.775348369818055some ran +dom text 0.775348369818055\n"\0 CUR = 75 LEN = 37784908 COW_REFCNT = 2 SV = PV(0x177f89db120) at 0x177f88bcca0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x1778242b070 " Querysome random text 0.785001144808529\n"\0 CUR = 41 LEN = 37784634 COW_REFCNT = 2 SV = PV(0x177f89db120) at 0x177f88bcca0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x1778000d070 " Querysome random text 0.894431999356865\n"\0 CUR = 41 LEN = 37784593 COW_REFCNT = 2 SV = PV(0x177f89db120) at 0x177f88bcca0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x1778242b070 " Querysome random text 0.412049736815259some ran +dom text 0.412049736815259some random text 0.412049736815259some rand +om text 0.412049736815259some random text 0.412049736815259some rando +m text 0.412049736815259some random text 0.412049736815259some random + text 0.412049736815259some random text 0.412049736815259\n"\0 CUR = 313 LEN = 37784176 COW_REFCNT = 2 SV = PV(0x177f89db120) at 0x177f88bcca0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x1778000f070 " Querysome random text 0.809515115277865\n"\0 CUR = 41 LEN = 37783182 COW_REFCNT = 2 0.004073 read lines from in-memory file and do RE (6 matches).

        FWIW, Win10 & Strawberry:

        v5.32.1 tempfile.txt is size 29 Mb 0.115239 read lines from disk and do RE (32571 matches). 0.676642 read lines from in-memory file and do RE (32571 matches). v5.38.0 tempfile.txt is size 29 Mb 0.156708 read lines from disk and do RE (32571 matches). 0.628221 read lines from in-memory file and do RE (32571 matches). v5.26.3 tempfile.txt is size 29 Mb 0.122374 read lines from disk and do RE (32571 matches). 0.671405 read lines from in-memory file and do RE (32571 matches). v5.16.3 tempfile.txt is size 28 Mb 0.119628 read lines from disk and do RE (32760 matches). 0.057724 read lines from in-memory file and do RE (32760 matches).
Re^2: RE on lines read from in-memory scalar is very slow
by Danny (Chaplain) on Jan 23, 2024 at 05:49 UTC
    It could be related to your system or to the cygwin build or its general environment.

    Yeah, it is definitely specific to the cygwin perl. I just installed strawberry 5.38.0 and running from the same cygwin terminal as the original test, it looks fairly normal:

    0.231036 read lines from disk and do RE; n=769114. 1.459296 read lines from in-memory file and do RE; n=769114.
    It would be nice if someone else using cygwin could test it. I have the current cygwin version of perl which is 5.36.3.