smferris has asked for the wisdom of the Perl Monks concerning the following question:
Me again. (Can you tell I just really found this site. I knew it existed but really didn't browse it)
I'm parsing a fixed width flat file for use in loading to different destinations. Possibly back to a file, possibly into a database.
I figured unpack would be faster and cleaner and it is. As long as you don't assign the output of unpack to an array. EG:
open(FH,"large.file") or die $!;
while($row=<FH>){
unpack("a9 a40 a15 a15 a15 a2 a9 a9 a9",$row);
}
Runs in about 20seconds on 2.1million rows. Modify the code as such: (added the array assignment)
open(FH,"large.file") or die $!;
while($row=<FH>){
@data=unpack("a9 a40 a15 a15 a15 a2 a9 a9 a9",$row);
}
Now the code runs in over a minute. I have to "transform" the different elements in @data. Is there anyway to make this faster? I'm assuming the memory structure for @data is being reallocated every iteration. Is that true?
As always, all help is greatly appreciated.
Shawn M Ferris
Oracle DBA
Re: Assining data to an array is slow..
by chromatic (Archbishop) on Mar 01, 2001 at 03:06 UTC
|
I don't understand the question.
The first snippet doesn't do anything useful. It unpacks things, then throws away the results. You might as well not open the file at all. A program like that will run in approximately zero seconds. Much faster!
As for the second snippet, yes, there's memory allocation for each iteration. That's because unpack creates and you assign new values to @data for each row. You can't get around that if you want to do something with the data. And, depending on your data structure, unpack is probably the fastest way to get at it.
Assuming your code handles exactly 2.1 million rows and the code runs in 70 seconds, that's 30,000 iterations per second. That's pretty fast.
This falls in the category of "things you can't optimize away without breaking the program" -- you're not using regular expressions to get at the data, which would slow you down, and you're not using split, which is probably slower than unpack in this case.
You're probably as fast as you can get without removing anything useful. | [reply] [Watch: Dir/Any] |
|
I understand that not assigning the data back to an array isn't useful. But it is still parsing the row, correct? My point was that to parse 2.1 million rows is fast. But storing it slows it considerably.
I think what's taking the time is the deletion and re-creation of the memory structure for each iteration of the loop. Uneccessary in my mind as the successive iterations (in this case) are always going to be of identical size.
Given the above, I was hoping for..
a) That the memory used by the unpack itself could be reused, rather than having to copy it to a perl structure.
or
b) That I could predefine the size of @data and not have it destroy with each iteration.
Of course.. I'm not a seasoned programmer and this entire thread is just a waste of everyones time in which case I apologize. 8)
I just think if unpack has to put it into it's own array(has to or how does it know what to send back) that assigning it to a perl data type shouldn't take at least 6 times as long. If course, I really don't know what the behind the scenes of perl actually does to store data in memory.
Regards,
Shawn M Ferris
Oracle DBA
| [reply] [Watch: Dir/Any] |
|
my @list = (1, 2, 3); # list context, @list -> (1, 2, 3)
my $num = @list; # scalar context, $num -> 3
my @second_list = @list; # list context, @second_list -> (1, 2, 3)
Perl the interpreter is smart enough not to do more work than it has to (in most cases), so it usually determines the context of an operation before performing the operation to weasel out of extra work or to produce the right results for the context. You can do the same if you use wantarray().
This is important because unpack performs differently in scalar and in list context. Its perldoc page says that in scalar context, it returns just the first value. In list context, it returns all values.
In your first code snippet, it's evaluated in scalar context (more properly void, but we'll keep this simple). Perl can tell that you don't care about the return values, so it only has to unpack the first bit of data. It ignores the rest. (Since it's in void context, it may *completely* ignore the *entire* string, but I haven't looked at the source.)
This means the first snippet isn't doing as much work as the second, even in the unpack statement itself. Put aside the array assignment for the moment -- besides that, the two snippets aren't doing an equal amount of work!
To find out how much work the unpack would do in list context, put it in list context:
while ($row = <FH>) {
() = unpack("a9 a40 a15 a15 a15 a2 a9 a9 a9",$row);
}
This will be a more meaningful benchmark.
Besides all that, Perl handles memory internally via a reference-like mechanism. None of this tedious copying-the-contents-of-one-location-to-another jive you get in C. So the overhead is creating an array structure and populating it with the things unpack returns anyway. It's a whole lot smarter about these things than C.
In short, don't worry about memory management in Perl for now. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
Re (tilly) 1: Assining data to an array is slow..
by tilly (Archbishop) on Mar 01, 2001 at 03:23 UTC
|
The reason for the slow-down is that the first time you are calling unpack in scalar (well really void) context so it is only extracting one field, while the second time you are extracting all of the fields. So Perl is doing a lot more work.
Now I have seen a couple of signs that unpack might not as fast as it could be. But I would need to look at it closely to figure out why. (Or even if that is true.)
In any case I wouldn't worry about it. This isn't running interactively, is it? If not then wait until you are done and see if it is fast enough... | [reply] [Watch: Dir/Any] |
(boo) Re: Assinging data to an array is slow..
by boo_radley (Parson) on Mar 01, 2001 at 03:12 UTC
|
Runs in about 20seconds on 2.1million rows
then
Now the code runs in over a minute
You're annoyed at this speed on 2.1million rows? Really? Are you sure?
as for serious advice, maybe you could store your data in a database, and access it through DBI? then you could run your transforms through an insert or update statement
| [reply] [Watch: Dir/Any] |
Re: Assigning data to an array is slow..
by Albannach (Monsignor) on Mar 01, 2001 at 03:26 UTC
|
| [reply] [Watch: Dir/Any] |
Re: Assining data to an array is slow..
by rbi (Monk) on Mar 01, 2001 at 16:45 UTC
|
Hi,
some days ago I faced the time taken by unpack compared to other ways of extracting fields from records.
I think that using substr can speed up the things.
After your posting I took the occasion to learn how to use Benchmark and tested this code below on a 400000-record file:
#/usrl/bin/perl -w
use Benchmark;
$filename = @ARGV[0];
timethese (
$count,
{'Method One' => '&One',
'Method Two' => '&Two',
'Method Three' => '&Three'}
);
sub One {
open(FILE,@ARGV[0]);
while($row=<FILE>) {
@data = unpack('a4a2a2a2a2',$row);
}
close(FILE);
}
sub Two {
open(FILE,@ARGV[0]);
while($row=<FILE>) {
($data[0],$data[1],$data[2],$data[3],$data[4]) =
unpack('a4a2a2a2a2',$row);
}
close(FILE);
}
sub Three {
open(FILE,@ARGV[0]);
while($row=<FILE>) {
$data[0] = substr($row,0,4);
$data[1] = substr($row,4,2);
$data[2] = substr($row,6,2);
$data[3] = substr($row,8,2);
$data[4] = substr($row,10,2);
}
close(FILE);
}
and I got this.
Method One: 43 wallclock secs (40.72 usr + 1.23 sys = 41.95 CPU) @ 0
+.02/s (n=1)
(warning: too few iterations for a reliable count)
Method Two: 43 wallclock secs (41.50 usr + 1.42 sys = 42.92 CPU) @ 0
+.02/s (n=1)
(warning: too few iterations for a reliable count)
Method Three: 36 wallclock secs (33.76 usr + 1.42 sys = 35.18 CPU) @
+ 0.03/s (n=1)
(warning: too few iterations for a reliable count)
Again I think sub Three approach (substr) proves to be faster than sub One (unpack into an array) or Two (unpack into array elements).
Hope this may help.
ciao,
Roberto | [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] |
|
hi davorg,
sure that warning it's not very nice... :) however I also saw similar difference (about 15%) by running separately the routines and checking with the ps - process status - command.
As I said, I've tried to use Benchmark for the first time (there's always a first time...) for this test. However, I guess it's not a problem of file size that warning.
I'd appreciate, for my learning, if the code would be changed by someone to something that can be bechmarked (if it is not a problem of input file size).
ciao
Roberto
| [reply] [Watch: Dir/Any] |
|
|
Re: Assining data to an array is slow..
by jeroenes (Priest) on Mar 01, 2001 at 12:30 UTC
|
You should also make sure that the data actually fit in
memory, or at least physical memory.
perl -e '@a=(1)x2E6;sleep();' takes 48 Mb on
perl5.6, and I recall it was nearly twice as much in perl5.0.
And that's without any data.
So the actual allocation of the data takes time as well.
Jeroen
"We are not alone"(FZ) | [reply] [Watch: Dir/Any] [d/l] |
|
|