in reply to RE on lines read from in-memory scalar is very slow

It is possible to modify the code to get an Out of Memory error on MSYS2 5.36 and Strawberry Perl 5.20 and 5.38. This does not occur with a perlbrewed 5.36 on Ubuntu via WSL, nor Strawberry Perl 5.18.

Collating an array of lines from the in-memory file handle is sufficient. Commenting out the regex in the in-memory file handle loop makes the OOM go away, as does modifying the string before adding it to the array.

(I also modified the code to use a separate variable for the in-memory file handle. It has no effect but is arguably cleaner.)

#!/usr/bin/env perl use warnings; use strict; use Time::HiRes qw( time ); use Devel::Peek; my $file = shift @ARGV; my ($fh, $time); my (@arr1, @arr2); my $use_dump = 0; if (!$file) { # should use Path::Tiny::tempfile $file = 'tempfile.txt'; open my $ofh, '>', $file or die "Cannot open $file for writing, $! +"; srand(1234567); for my $i (0..200000) { my $string = 'some random text ' . rand(); $string = $string x (1 + int (rand() * 10)); if (rand() < 0.163) { $string = " Query${string}"; } say {$ofh} $string; } $ofh->close or die "Cannot close $file, $!"; printf "%s is size %i Mb\n", $file, (-s $file) / (1028**2); } open $fh, "<", $file; my $s = do {local $/ = undef; <$fh>}; seek $fh, 0, 0; print "\n\n"; $time = time; my $match_count1 = 0; my $i1 = 0; my $xx; while(<$fh>) { /^ ?Query/ && $match_count1 ++; push @arr1, $_; if ($use_dump and /^ Query/) { Dump $_; $i1 ++; last if $i1 > 5; } } printf "%f read lines from disk and do RE ($match_count1 matches).\n", + time - $time; $fh->close; open my $mfh, "<", \$s; $time = time; my $match_count2 = 0; my $i2 = 0; while(<$mfh>) { # comment this out to avoid the OOM /^ ?Query/ && $match_count2++; #push @arr2, ($_ . ""); # avoids OOM push @arr2, $_; # OOM! if ($use_dump and /^ Query/) { Dump $_; $i2++; last if $i2 > 5; } } printf "%f read lines from in-memory file and do RE ($match_count2 mat +ches).\n", time - $time; $mfh->close;
  • Comment on Re: RE on lines read from in-memory scalar is very slow (OOM variant)
  • Download Code

Replies are listed 'Best First'.
Re^2: RE on lines read from in-memory scalar is very slow (OOM variant)
by Danny (Chaplain) on Jan 24, 2024 at 02:56 UTC
    Very interesting. I was able to reproduce the OOM error with my cygwin perl, and the two modifications you mentioned to avoid OOM worked for me also. Strangely, when I watch the perl process in the task manager, or just watch the total Memory usage it never varies. On my system the process reaches about 230 MB memory usage and the total memory stays at about 15.3 GB total over the whole execution. It seems that it might suddenly encounter a memory leak that happens so fast that the task manager doesn't detect it before the process dies. On my system the time between the start of the loop that produces the OOM and the exception is about 22 seconds.
      Another observation. When I add a "print;" before the push @arr2, I get a file that is 352602493 / 35048455 = 10.06 times larger than the input file. The first 10% seems to match the original and then there are more lines. I'm looking at what these lines correspond to.
        Strangely, if I strip my input file of carriage returns (s/\r//g) the file with printed lines is only 1.8% the size of the input file. This also takes about 22 seconds for the OOM after the start of the push @arr2 loop. Too weird.

        EDIT: I now do not think the carriage returns had anything to do with the file size. I've rerun the original file and the number of printed lines vary each time and are usually less than the input file. The first time I ran it the output happened to be 10x larger than the input but I haven't been able to reproduce this.