Perl reads minimum 8KB from files. Can this be lowered?

sectokia has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl reads minimum 8KB from files. Can this be lowered? by GrandFather (Saint) on Apr 08, 2022 at 06:58 UTC
It would be unusual for a small read size to boost performance, at least with spinning rust. Even if it were the case, changing Perl's minimum read size will almost certainly not make any difference to how the OS buffers file I/O so it's likely that you can only hurt performance by increasing the number of calls into the OS to get work done. Maybe there is some other reason you are getting a performance hit (assuming this isn't just "I wonder what would happen ..." territory). If you have a real performance problem to solve, perhaps you would like to tell us more about it so we may be able to offer some more substantial advice? Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply]
Re: Perl reads minimum 8KB from files. Can this be lowered? by Corion (Patriarch) on Apr 08, 2022 at 06:38 UTC
You can set the amount that Perl reads by setting the `$/` variable, to (in your case) \512. See perlvar for more information on `$/`.	[reply] [d/l] [select]
Re^2: Perl reads minimum 8KB from files. Can this be lowered? by ikegami (Patriarch) on Apr 08, 2022 at 16:13 UTC
That's not true. That will just change the amount returned. `$ strace perl -e'$/ = \512; 1 while <STDIN>' <a 2>&1 \| grep 'read(0' read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 3616 read(0, "", 8192) = 0` [download]	[reply] [d/l]
Re: Perl reads minimum 8KB from files. Can this be lowered? by ikegami (Patriarch) on Apr 08, 2022 at 16:16 UTC
The size of reads for buffered I/O is configurable (since 5.014), but only when Perl is compiled. You could use `syread` instead of buffered I/O. $ strace perl -e'1 while read(STDIN, $buf, 512)' <a 2>&1 \| grep 'read( +0' read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 3616 read(0, "", 8192) = 0 $ strace perl -e'1 while sysread(STDIN, $buf, 512)' <a 2>&1 \| grep 're +ad(0' read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 512) = 512 read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", 512) = 32 read(0, "", 512) = 0 [download]	[reply] [d/l] [select]
Re^2: Perl reads minimum 8KB from files. Can this be lowered? by sectokia (Friar) on Apr 09, 2022 at 02:43 UTC
This seems the best solution and fixes the problem for me.	[reply]
Re: Perl reads minimum 8KB from files. Can this be lowered? by Anonymous Monk on Apr 08, 2022 at 06:43 UTC
Why? 4k read buffer is too small Configurable IO buffersize?	[reply]
Re^2: Perl reads minimum 8KB from files. Can this be lowered? by sectokia (Friar) on Apr 08, 2022 at 07:24 UTC
It seems like you get faster random IO when its smaller, but faster sequential when its higher.	[reply]
Re^3: Perl reads minimum 8KB from files. Can this be lowered? by roboticus (Chancellor) on Apr 08, 2022 at 12:55 UTC
sectokia: When reading this thread, I was curious. If perl uses the same block size as the record size in the OS, I'd expect reading short records randomly might be a little faster than reading the default size blocks (due to copying less data into the perl variables). If perl uses a larger block size than the OS, I'd expect that reading smaller block sizes would be significantly faster. So I whipped up a little test case: #!env perl # # pm11142815.pl <FName> <recSize> # # How fast is random I/O with big (default size) vs small buffers? To + find out, # we'll first read N blocks (randomly) from the first half of the big +file using # the default block size. Then we'll set the block size to the record + size and # read N blocks from the second half of the big file. # # 20220408 use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $FName = shift or die "Missing FName and record size!"; my $recSize = shift or die "Missing record size!"; $recSize = $recSize + 0; open my $FH, '<', $FName or die "Can't open $FName: $!\n"; binmode $FH; my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size, $atime,$mtime,$ctime,$blksize,$blocks) = stat($FName); my $half_file_size = int($size/2); print "File size: $size, record size=$recSize\n"; my $N = 10_000; read_N_blocks($N, $FH, 0, $half_file_size); print "Setting record size to $recSize bytes\n"; $/=\$recSize; read_N_blocks($N, $FH, $half_file_size, $size); sub read_N_blocks { my ($N, $FH, $start, $end) = @_; print "Read $N records from file between $start and $end\n"; my $dS = $end - $start - $recSize; my $t = [gettimeofday]; for my $i (1 .. $N) { my $offset = int($dS * rand) + $start; seek($FH, $offset, 0); my $record = <$FH>; } my $dT = tv_interval($t, [gettimeofday]); print "\ttook ${dT}s\n\n"; } [download] When I ran it, I saw this: `$ perl ~/pm11142815.pl /Work/OmniGlyph/big_tmp_file 32 File size: 12854692173, record size=32 Read 10000 records from file between 0 and 6427346086 took 52.865314s Setting record size to 32 bytes Read 10000 records from file between 6427346086 and 12854692173 took 6.5254s` [download] Reading small blocks is much faster than reading default-sized blocks. But at this point, I've got to go to work and stop playing with this. I haven't done any research to find out what the default block size is on the OS (Windows 8.1 64b), or used by my perl (v5.30.3 built for x86_64-cygwin-threads-multi). So I don't know if the speed difference is due to block size differences between perl and the OS or perl carving the block up into smaller chunks or whatever. I'll leave that as an exercise for interested parties. (Of course, there's always the small chance that I'll revisit this tomorrow.) The differences are probably due to one or more of the following things I can think of (and probably others that aren't immediately coming to mind): Perl and the OS may have different default buffer sizes. This should be a simple matter to determine. By not explicitly setting a size on the first case, Perl is spending more time that I'd expect looking for a newline in the buffer. The time to create the record variable is more significant than I imagined, and that creating the smaller records saved a lot of time. It might be interesting to try different record sizes and see if there are any inflection points in the timings to get a few more clues, and to help narrow down the reasons. Anyway, thanks for an interesting diversion this morning. My coffee kicked in, and it's time to head into the shower and go to work! ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re^4: Perl reads minimum 8KB from files. Can this be lowered? by Anonymous Monk on Apr 08, 2022 at 13:17 UTC
Re: Perl reads minimum 8KB from files. Can this be lowered? by harangzsolt33 (Deacon) on Apr 10, 2022 at 18:04 UTC
I've read somewhere that DOS/Windows reads an entire cluster at a time. This size can vary anywhere between 4KB and 32KB. But this is hard-coded into the OS. So, there's not much you can do. If you're using an NTFS file system with 4KB clusters, then that means each read operation that is 4KB or less will grab 4KB. Even if you're just reading one byte from a file. If you want to read 5000 bytes, it will read 8 kilobytes. There's no way around it. In Perl, you can turn off buffering, and maybe that will help, but I'm no expert. The sysread() function will give you however many bytes you want to get, but that doesn't mean that at the lowest level you're forcing the OS to read smaller chunks. You can't.	[reply]
Re^2: Perl reads minimum 8KB from files. Can this be lowered? by sectokia (Friar) on Apr 11, 2022 at 12:11 UTC
It appears on windows reguardless of the file system cluster size, you can fetch 512b with sysread. At least on my system this is the minimum it ends up being. Clusters really only matter to the file system for its allocation of clusters to a file, where as the devices only care about blocks.	[reply]
Re^3: Perl reads minimum 8KB from files. Can this be lowered? by harangzsolt33 (Deacon) on Apr 12, 2022 at 00:48 UTC
Well, what I was saying is that you can fetch any number of bytes using sysread(). You can fetch just one byte or you can read the entire file with one call. But behind the scenes, Windows does a lot of buffering. So, instead of just reading 512 bytes of a file, it reads an entire cluster. That may be 4KB or 32KB...whatever the size of the cluster. Sysread() will give you 512 or 513 bytes, if that's what you requested, but the OS will read more, because that's how the system is designed.	[reply]
Re^4: Perl reads minimum 8KB from files. Can this be lowered? by afoken (Chancellor) on Apr 12, 2022 at 13:21 UTC