Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Why re-reading DATA is slow

by mkmcconn (Chaplain)
on Nov 04, 2001 at 02:45 UTC ( [id://123103]=perlquestion: print w/replies, xml ) Need Help??

mkmcconn has asked for the wisdom of the Perl Monks concerning the following question:

I had the idea that re-reading DATA after it's been exhausted would be faster than dumping it into an array for re-use. I'll post my test after the READMORE tag, in case I've made some blunder that stands out, or for your convenience

But, my question is more general: why did re-reading DATA prove to be so incredibly much slower than re-iterating over the array? I thought that the "cursor" was a procedure that points to a memory address - am I thinking of it incorrectly. Is the performance penalty restricted to this special handle, all handles, or is it associated with tell() and seek()?

Thanks in advance for the wonderful help this place consistently provides, in learning to think better in Perl
mkmcconn
updated tests

#!/usr/bin/perl -w use strict; use Benchmark; # uncomment to print sample output for each function # my $ap = 1; #for my $ret ( # rewhile_data(), # refor_data(), # reinfor_data(), # read_array() # ){ # for (my $id = 1;$id < 8; $id++){ # show_out($ret->{"$ap.$id"}); # } # $ap++; #} # benchmark tests timethese(10000,{ 'REWHILE' => \&rewhile_data, 'REFOR' => \&refor_data, 'READAR' => \&read_array, 'REINFOR' => \&reinfor_data, }); # functions sub rewhile_data { my $cursor = tell DATA; my %ahash; for my $i (1..100){ while (my $j = <DATA>){ next if $j =~ m/^\s*$/; my ($num,$fn,$ln) = $j =~ m/(\w+)/g; $ahash{"$i.$num"} = [ "$i.$num",$fn,$ln]; } seek (DATA, $cursor, 0); } return \%ahash; } sub refor_data { my $cursor = tell DATA ; my %bhash; for my $i (1..100){ for my $j (<DATA>){ next if $j =~ m/^\s*$/; my ($num,$fn,$ln) = $j =~ m/(\w+)/g; $bhash{"$i.$num"} = [ "$i.$num",$fn,$ln]; } seek (DATA, $cursor, 0); } return \%bhash; } sub reinfor_data { my $cursor ; $cursor = tell DATA ; my %chash; for my $i (1..100){ for ( ;my $j = <DATA>; ){ next if $j =~ m/^\s*$/; my ($num,$fn,$ln) = $j =~ m/(\w+)/g; $chash{"$i.$num"} = [ "$i.$num",$fn,$ln]; } seek (DATA, $cursor, 0) } return \%chash; } sub read_array { my @data_array = <DATA>; my %dhash; for my $i (1..100){ foreach my $j (@data_array){ next if $j =~ m/^\s*$/; my ($num,$fn,$ln) = $j =~ m/(\w+)/g; $dhash{"$i.$num"} = ["$i.$num",$fn,$ln]; } } return \%dhash; } sub show_out { my $ref_ = shift; print "$ref_->[0]:\t$ref_->[1]\t$ref_->[2]\n"; } __DATA__ 1 First _____________ 2 Last _____________ 3 Street _____________ 4 Apt _____________ 5 City _____________ 6 State _____________ 7 _______________________________________ _

Results:

Benchmark: timing 10000 iterations of READAR, REFOR, REINFOR, REWHILE. +.. READAR: 1 wallclock secs ( 2.80 usr + 0.00 sys = 2.80 CPU) @ 35 +71.43/s (n=10000) REFOR: 43 wallclock secs (42.95 usr + 0.00 sys = 42.95 CPU) @ 23 +2.83/s (n=10000) REINFOR: 35 wallclock secs (34.33 usr + 0.00 sys = 34.33 CPU) @ 29 +1.29/s (n=10000) REWHILE: 35 wallclock secs (34.77 usr + 0.00 sys = 34.77 CPU) @ 28 +7.60/s (n=10000)
mkmcconn

Replies are listed 'Best First'.
Re: Why re-reading DATA is slow
by clintp (Curate) on Nov 04, 2001 at 05:14 UTC
    Regardless of how pointers, buffering and whatnot are handled for DATA, there's two obvious considerations to make here:
    • When iterating over an array, perl's simply walking down a list of scalar-structures already assembled and ready to be used. When reading the filehandle, this data has to be taken from the input buffers and put into an sv before it can be dealt with. There's one chunk of overhead.
    • The other chunk of overhead is the fact that to get to the data in DATA, perl has to pass all of this data through all of its I/O code -- reading, keeping track of filehandle meta-information, manipulating buffers. It's cheap (with buffering) but it's not that cheap.
    Others feel free to pile on.
(bbfu) (DATA is file IO) Re: Why re-reading DATA is slow
by bbfu (Curate) on Nov 04, 2001 at 04:34 UTC

    Not 100% on this one but I'm pretty sure that the DATA handle is not actually buffered in memory but the handle is just automagically already opened and pointing at the correct place in the source file. Otherwise, if you had a large amount of data in the DATA section, it would require a lot of memory up front for your program.

    bbfu
    Seasons don't fear The Reaper.
    Nor do the wind, the sun, and the rain.
    We can be like they are.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://123103]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (7)
As of 2024-03-28 19:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found