baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

hi monks ,

why is this procedure so slow:

--first_program.pl-- use strict; #use warnings; open (F, "<", "test.out"); while(<F>){ print "$_"; } -- second program-- use strict; use warnings; my $name = ''; my @id_list; open (STDIN,"perl first_program.pl|") || die "$!"; open (OUT, ">", "out.txt") || die "$!"; # chomp std line in while(my $line_in =<STDIN>){ chomp($line_in); my @line_array = split('\t',$line_in); my @subline_array = split('\|', $line_array[1]); @id_list = () unless ($name eq $line_array[0]); $name = $line_array[0] unless ($name eq $line_array[0]); next if (grep{$_ == $subline_array[3]}@id_list); push(@id_list,$subline_array[3]); print OUT "$line_in\n"; }
so what i'm doing is starting the second program that starts the first one , takes its stdout and parses its results to a file: the line that first one reads looks like:
mmenr hh|gg|kk|3445|uu|zzz 234 wwe we qw 233
without parsing the procedure takes only 3 min (but i'm writing the lines of test directly into the file not stdout) and with parsing it it takes something like 2h

is there a way to speed this up?

thnx

Replies are listed 'Best First'.
Re: slow parser how to make it faster
by shmem (Chancellor) on Jun 09, 2009 at 13:14 UTC

    First, split takes a regular expression, not a string (except sometimes ;-). So splitting on '\t' won't give the expected results. Next, I suspect that @id_list grows fairly long over time, and you are iterating over the entire array with grep. A hash is more suitable here:

    use strict; use warnings; my $name = ''; my %id_list; open (STDIN,"perl first_program.pl|") || die "$!"; open (OUT, ">", "out.txt") || die "$!"; while(my $line_in =<STDIN>){ chomp($line_in); my @line_array = split(/\t/,$line_in); my @subline_array = split(/\|/, $line_array[1]); unless ($name eq $line_array[0]) { %id_list = (); $name = $line_array[0]; } next if $id_list{$subline_array[3]}++; print OUT "$line_in\n"; }

    If you experience memory shortage, tie the hash %id_list to a disk file via e.g. DB_File.

      And perhaps a little faster with an RE instead of splits.

      use strict; use warnings; my $name = ''; my %id_list; open (STDIN,"perl first_program.pl|") || die "$!"; open (OUT, ">", "out.txt") || die "$!"; while( my $line_in = <STDIN> ) { line_in =~ m/^([^\t]*)(?:[^\|]*\|){3}([^\|]*)/; unless($name eq $1) { %id_list = (); $name = $1; } next if $id_list{$2}++; print OUT "$line_in"; }

      Although the parsing of the line is quicker with the RE, the difference may be very small compared to the I/O time. The benefit will depend very much on how often the loop prints.

      use strict; use warnings; use Benchmark; my $line_in = "mmenr\thh|gg|kk|3445|uu|zzz\t234\twwe\twe\tqw\t233\n"; Benchmark::cmpthese( 1000000, { 'split' => sub { chomp($line_in); my @line_array = split(/\t/,$line_in); my @subline_array = split(/\|/, $line_array[1]); }, 're' => sub { my ($name, $id) = ($line_in =~ m/^([^\t]*)(?:[^\|]*\|){3}([^\|]* +)/); } }); __END__ Rate split re split 84962/s -- -68% re 266667/s 214% --