slow parser how to make it faster

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

hi monks ,

why is this procedure so slow:


--first_program.pl--


use strict;
#use warnings;

open (F, "<", "test.out");

while(<F>){
  print "$_";
}


-- second program--

use strict;
use warnings;


my $name = '';
my @id_list;
open (STDIN,"perl first_program.pl|") || die "$!";
open (OUT, ">", "out.txt") || die "$!";
# chomp std line in
while(my $line_in =<STDIN>){
chomp($line_in);
  my @line_array = split('\t',$line_in);
  my @subline_array = split('\|', $line_array[1]);
  @id_list = () unless ($name eq $line_array[0]); 
  $name = $line_array[0] unless ($name eq $line_array[0]);  
  next if (grep{$_ == $subline_array[3]}@id_list);
  push(@id_list,$subline_array[3]);        
  print OUT "$line_in\n";
}
[download]

so what i'm doing is starting the second program that starts the first one , takes its stdout and parses its results to a file: the line that first one reads looks like:

mmenr hh|gg|kk|3445|uu|zzz 234 wwe we qw 233
[download]

without parsing the procedure takes only 3 min (but i'm writing the lines of test directly into the file not stdout) and with parsing it it takes something like 2h

is there a way to speed this up?

thnx

Comment on slow parser how to make it faster Select or Download Code

Replies are listed 'Best First'.
Re: slow parser how to make it faster by shmem (Chancellor) on Jun 09, 2009 at 13:14 UTC
First, split takes a regular expression, not a string (except sometimes ;-). So splitting on `'\t'` won't give the expected results. Next, I suspect that `@id_list` grows fairly long over time, and you are iterating over the entire array with grep. A hash is more suitable here: `use strict; use warnings; my $name = ''; my %id_list; open (STDIN,"perl first_program.pl\|") \|\| die "$!"; open (OUT, ">", "out.txt") \|\| die "$!"; while(my $line_in =<STDIN>){ chomp($line_in); my @line_array = split(/\t/,$line_in); my @subline_array = split(/\\|/, $line_array[1]); unless ($name eq $line_array[0]) { %id_list = (); $name = $line_array[0]; } next if $id_list{$subline_array[3]}++; print OUT "$line_in\n"; }` [download] If you experience memory shortage, tie the hash `%id_list` to a disk file via e.g. DB_File.	[reply] [d/l] [select]
Re^2: slow parser how to make it faster by ig (Vicar) on Jun 11, 2009 at 08:39 UTC
And perhaps a little faster with an RE instead of splits. `use strict; use warnings; my $name = ''; my %id_list; open (STDIN,"perl first_program.pl\|") \|\| die "$!"; open (OUT, ">", "out.txt") \|\| die "$!"; while( my $line_in = <STDIN> ) { line_in =~ m/^([^\t])(?:[^\\|]\\|){3}([^\\|])/; unless($name eq $1) { %id_list = (); $name = $1; } next if $id_list{$2}++; print OUT "$line_in"; }` [download] Although the parsing of the line is quicker with the RE, the difference may be very small compared to the I/O time. The benefit will depend very much on how often the loop prints. `use strict; use warnings; use Benchmark; my $line_in = "mmenr\thh\|gg\|kk\|3445\|uu\|zzz\t234\twwe\twe\tqw\t233\n"; Benchmark::cmpthese( 1000000, { 'split' => sub { chomp($line_in); my @line_array = split(/\t/,$line_in); my @subline_array = split(/\\|/, $line_array[1]); }, 're' => sub { my ($name, $id) = ($line_in =~ m/^([^\t])(?:[^\\|]\\|){3}([^\\|] +)/); } }); __END__ Rate split re split 84962/s -- -68% re 266667/s 214% --` [download]	[reply] [d/l] [select]