Re: RE on lines read from in-memory scalar is very slow

Note, that when I just loop through the lines without doing the RE the disk and in-memory files take about the same amount of time, so it seems it is the RE that is causing the problem.

That suggests to me that the data seen in the first and second loops are not the same. If you put a counter in the loops, do they both see the same number of lines?

In particular if you are running this on Windows, I believe it may do some automatic conversion of CRLF line endings on read, in which case building up $s in this way may result in something that would read the whole in-memory file as a single line.

You might try the following approach to set up the in-memory file, which would be more efficient and may be more certain to give the same content:

seek $fh, 0, 0;
my $s = do {
  local $/;
  <$fh>;
};
[download]

This reads the full content of the file in one go, rather than reading it line by line and then appending. (See also what's faster than .= for more detail on why building up a string piecewise can be really inefficient.)

Comment on Re: RE on lines read from in-memory scalar is very slow Select or Download Code

Replies are listed 'Best First'.
Re^2: RE on lines read from in-memory scalar is very slow by ikegami (Patriarch) on Jan 23, 2024 at 00:48 UTC
Or use File::Slurper. Nice clean interface, and I believe the author actually looked into what's the fastest approach.	[reply]
Re^2: RE on lines read from in-memory scalar is very slow by Danny (Chaplain) on Jan 23, 2024 at 01:30 UTC
This code doesn't make any difference on the relative times on the windows 11/cygwin system. If I count the lines in the original code they are the same for both. The efficiency of populating a scalar from a file is important in general, but not really relevant to the problem here. `my $s = do { local $/; <$fh>; };` [download] A couple of additional pieces of info. I did 100k random seeks on the disk file and the memory file and the times were about the same. When the code is running with regex loop, it seems CPU limited.	[reply] [d/l]
Re^2: RE on lines read from in-memory scalar is very slow (cygwin and \n) by LanX (Saint) on Jan 23, 2024 at 13:06 UTC
If this problem is only reproducible on windows + cygwin , than "OS confusion" like ... > CRLF line endings ... seems to be a plausible theory. (I seem to remember a similar discrepancy discussed here not too long ago... (or was it WSL?)) Now I'd turn the test around. I'd try to autogenerate the string with various forms of line endings and try to see what happens. Those variants could also be written to disk and tested again. Unfortunately I have no windows at disposal right now. My bet is that a string generated with plain `"\n"` behaves normal. If not we would at least have taken the filesystem out of the equation. And with generated text we could test if the time consumption is linear to the the size. Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply] [d/l]
Re^3: RE on lines read from in-memory scalar is very slow (cygwin and \n) by kcott (Archbishop) on Jan 24, 2024 at 16:04 UTC
"My bet is that a string generated with plain `"\n"` behaves normal." I went back and did an additional check on the data I used for my earlier tests. Each record in that test ends with just `LF`. For a specific test, I created two tiny files: `check_lf` which just contains "`qwerty<LF>`"; and `check_crlf` in which I forced a `CRLF` ending, its contents are "`qwerty<CR><LF>`". `$ for i in test_data test_data_Q check_lf check_crlf; do head -1 $i \| +cat -vet; done XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX$ QueryXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX$ qwerty$ qwerty^M$` [download] For anyone unfamiliar with "`cat -vet`", a newline is shown as a "`$`" and a carriage return is shown as "`^M`". Edit: I changed several instances of `NL` to `LF`. This was for consistency with other parts of my post as well as `LF` being a generally recognised de facto standard (`CRLF` is more usual than `CRNL`). — Ken	[reply] [d/l] [select]
Re^4: RE on lines read from in-memory scalar is very slow (cygwin and \n) by LanX (Saint) on Jan 24, 2024 at 21:40 UTC
Sorry if I'm too tired to see the conclusion of these tests. (?) Are you saying that you reran them with different line endings, but the results were the same? Cheers Rolf _{(addicted to the Perl Programming Language :) see Wikisyntax for the Monastery}	[reply]