comment on

Hello, Perl Monks - I need to read in a very large amount of data (~100 MB) from a file, process it a bit, and store it in an array of references to hashes. This is what I am doing at the moment (I'm a bit of a Perl newbie, so I doubtlessly have many inefficiencies here; suggestions appreciated):

my @cases=();
my @FILE=<DATA>;
close(DATA);

my $num_lines=scalar(@FILE);

$#cases=$num_lines; #pre-extend array

foreach my $line (@FILE) {
   if (($dot % 1000) == 0) {
      print STDERR ".";
   }

   $line=~/^(\S*) [0-9.]* (.*)$/o;
   my ($class, $feature_vector) = ($1, $2); 
   my %case;
    
   $case{'class'}=$class;

   foreach my $feature (split /\s+/, $feature_vector) {
      $case{'fv'}{$feature}=1;
   }

   push @cases, \%case;
   $dot++;
}
[download]

This is very fast for the first ~20,000 lines (out of a total of ~300,000), then suddenly slows down dramatically. Lack of memory is not the problem - at the point it slows down I still have upwards of 700 MB free. At first I thought the processing of each line into the case hash with its attendent splits and regular expressions was the problem, but if I alter the above code to:

my @FILE=<DATA>;
close(DATA);

my $fred;

foreach my $line (@FILE) {
   if (($dot % 1000) == 0) {
      print STDERR ".";
   }

   $line=~/^(\S*) [0-9.]* (.*)$/o;
   my ($class, $feature_vector) = ($1, $2); 
   my %case;
    
   $case{'class'}=$class;

   foreach my $feature (split /\s+/, $feature_vector) {
      $case{'fv'}{$feature}=1;
   }

   $fred= \%case;
   $dot++;
}
[download]

then the entire file is processed on the order of 100 times more quickly. I've tried using something like $cases[$dot]=\%case or even making cases a hash indexed by case number, but both approaches exhibit a similar slow-down. Any ideas on why this slow-down occurs? (Perl version 5.6.1 being run under a Windows XP system with 1 GB RAM) Thanks, Ryan Gabbard

In reply to Slowness when inserting into pre-extended array by ryangabbard

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.