Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^2: Perl reads minimum 8KB from files. Can this be lowered?

by sectokia (Pilgrim)
on Apr 08, 2022 at 07:24 UTC ( #11142821=note: print w/replies, xml ) Need Help??


in reply to Re: Perl reads minimum 8KB from files. Can this be lowered?
in thread Perl reads minimum 8KB from files. Can this be lowered?

It seems like you get faster random IO when its smaller, but faster sequential when its higher.
  • Comment on Re^2: Perl reads minimum 8KB from files. Can this be lowered?

Replies are listed 'Best First'.
Re^3: Perl reads minimum 8KB from files. Can this be lowered?
by roboticus (Chancellor) on Apr 08, 2022 at 12:55 UTC

    sectokia:

    When reading this thread, I was curious. If perl uses the same block size as the record size in the OS, I'd expect reading short records randomly might be a little faster than reading the default size blocks (due to copying less data into the perl variables). If perl uses a larger block size than the OS, I'd expect that reading smaller block sizes would be significantly faster. So I whipped up a little test case:

    #!env perl # # pm11142815.pl <FName> <recSize> # # How fast is random I/O with big (default size) vs small buffers? To + find out, # we'll first read N blocks (randomly) from the first half of the big +file using # the default block size. Then we'll set the block size to the record + size and # read N blocks from the second half of the big file. # # 20220408 use strict; use warnings; use Time::HiRes qw( gettimeofday tv_interval ); my $FName = shift or die "Missing FName and record size!"; my $recSize = shift or die "Missing record size!"; $recSize = $recSize + 0; open my $FH, '<', $FName or die "Can't open $FName: $!\n"; binmode $FH; my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size, $atime,$mtime,$ctime,$blksize,$blocks) = stat($FName); my $half_file_size = int($size/2); print "File size: $size, record size=$recSize\n"; my $N = 10_000; read_N_blocks($N, $FH, 0, $half_file_size); print "Setting record size to $recSize bytes\n"; $/=\$recSize; read_N_blocks($N, $FH, $half_file_size, $size); sub read_N_blocks { my ($N, $FH, $start, $end) = @_; print "Read $N records from file between $start and $end\n"; my $dS = $end - $start - $recSize; my $t = [gettimeofday]; for my $i (1 .. $N) { my $offset = int($dS * rand) + $start; seek($FH, $offset, 0); my $record = <$FH>; } my $dT = tv_interval($t, [gettimeofday]); print "\ttook ${dT}s\n\n"; }

    When I ran it, I saw this:

    $ perl ~/pm11142815.pl /Work/OmniGlyph/big_tmp_file 32 File size: 12854692173, record size=32 Read 10000 records from file between 0 and 6427346086 took 52.865314s Setting record size to 32 bytes Read 10000 records from file between 6427346086 and 12854692173 took 6.5254s

    Reading small blocks is much faster than reading default-sized blocks. But at this point, I've got to go to work and stop playing with this. I haven't done any research to find out what the default block size is on the OS (Windows 8.1 64b), or used by my perl (v5.30.3 built for x86_64-cygwin-threads-multi). So I don't know if the speed difference is due to block size differences between perl and the OS or perl carving the block up into smaller chunks or whatever. I'll leave that as an exercise for interested parties. (Of course, there's always the small chance that I'll revisit this tomorrow.)

    The differences are probably due to one or more of the following things I can think of (and probably others that aren't immediately coming to mind):

    • Perl and the OS may have different default buffer sizes. This should be a simple matter to determine.
    • By not explicitly setting a size on the first case, Perl is spending more time that I'd expect looking for a newline in the buffer.
    • The time to create the record variable is more significant than I imagined, and that creating the smaller records saved a lot of time.

    It might be interesting to try different record sizes and see if there are any inflection points in the timings to get a few more clues, and to help narrow down the reasons.

    Anyway, thanks for an interesting diversion this morning. My coffee kicked in, and it's time to head into the shower and go to work!

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11142821]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2023-10-04 07:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?