speeding up parsing, jump to line

cburger has asked for the wisdom of the Perl Monks concerning the following question:

I'm parsing one input file for human chromosome locations, retrieve start and end position and then read in the chromosome file with scores to filter out a score at the specific locations - then calculate a mean score value for the entire region. The process is terribly inefficient, even if I split files. Each chromosome file is already around 3 GB. is there a better approach to this problem? Perhaps can I jump to the right locations in a speedier way? Help appreciated.

open (F, $data) || die;
while(<F>) {

chomp;
if ($_!~/Header/g) {
@line=split("\t", $_);
$table=$_;
$chr=$line[0];
$start=$line[2];
$end=$line[3];

# file name
$name="score.txt";
@array=();
@sorted=();



#print "$chr\.$name\n"; 
open (R, "$chr\.$name") || die ("cannot open chrom file\n");

while(<R>) {
chomp;
# format : chrom ID, position, score
# chr1    10925   0.239

@q=split("\t", $_);
$pos=$q[1];
$score=$q[2];
#print "$pos\n"; 

if ($pos>=$start) {
    if ($pos<=$end) {
    
    
push (@array, $score);
$sum=$sum+$score;
}}

}
close(R);

# get mean min and max 
$len=$#array+1;

@sorted=sort {$a <=> $b} @array;
$min=$sorted[0];
$max=$sorted[$#sorted];
$mean=$sum/$len;

$stat = Statistics::Descriptive::Full->new();
$stat->add_data(@sorted);
$geo_mean= $stat->geometric_mean();

#print "$table\t$geo_mean\t$min\t$max\t$len\n"; 
print OUT "$table\t$mean\t$geo_mean\t$min\t$max\t$len\n";

}
}
close(F);
close(OUT);
[download]

Comment on speeding up parsing, jump to line Download Code

Replies are listed 'Best First'.
Re: speeding up parsing, jump to line by dave_the_m (Monsignor) on Aug 15, 2014 at 09:28 UTC
How many lines (approx) are there in the $data file? Are the chromosome files already sorted by pos or not? I presume that there are approx 23 chromosome files? Dave.	[reply]
Re^2: speeding up parsing, jump to line by cburger (Acolyte) on Aug 15, 2014 at 13:22 UTC
there are 40,000 lines for the data file and each chromosome has about 50-200 Mio lines. At the moment the script takes ~20 min to print out one line	[reply]
Re: speeding up parsing, jump to line by graff (Chancellor) on Aug 15, 2014 at 13:19 UTC
The answers to dave_the_m's questions will determine what sort of solution to look for. E.g. if the $data file is relatively small, you can load it into memory, then read each .score.txt file exactly once to load up relevant info for each item in $data, then do stats on the info - something like this: use strict; use warnings; my $data = "some/file.name"; open( F, $data ) or die "$data:$!\n"; my %targets; while(<F>) { next if ( /Header/ ); chomp; my ( $chr, $start, $end ) = ( split( /\t/ ))[0,2,3]; push @{$targets{$chr}}, { table => $_, start => $start, end => $en +d }; } for my $chr ( keys %targets ) { open( R, "$chr.score.txt" ) or die "$chr.score.txt: $!\n"; while (<R>) { chomp; my ( $pos, $score ) = ( split( /\t/ ))[1,2]; for my $range ( @{$targets{$chr}} ) { if ( $pos >= $$range{start} and $pos <= $$range{end} ) { push @{$$range{scores}}, $score; } } } } # do statistics on contents of %targets… [download] If $data contains too much stuff, and/or requires too much stuff from the .score.txt files to be held in memory for each $chr, then maybe you have to create output files for each $chr (or each start-end range in $data), so that it'll be quick/easy to do stats on those files.	[reply] [d/l]
Re: speeding up parsing, jump to line by Cristoforo (Curate) on Aug 18, 2014 at 19:45 UTC
Just one way I see that you can get a speedup is to not use Statistics::Descriptive (for the geometric mean). You can roll your own. Here is a benchmark - it compares your method with one where I didn't use Statistics::Descriptive; Depending on the number of scores in the array, my method shows a speedup of 12 times, for 1000 scores to 20 times, for 100 scores in the array. #!/usr/bin/perl use strict; use warnings; use Statistics::Descriptive; use Benchmark qw/ cmpthese /; use List::Util qw/ sum min max /; for my $len (100, 500, 1000) { my @scores = map {sprintf "%.3f", rand 29} 1 .. $len; print "\n\nNumber of scores: $len\n"; my $results = cmpthese (-1, { stat => sub { my $stat = Statistics::Descriptive::Full->new(); $stat->add_data(@scores); @scores = sort {$a <=> $b} @scores; my ($min, $max) = @scores[0, -1]; my $len = @scores; # number of scores my $mean = sum(@scores) / @scores; my $geo_mean = $stat->geometric_mean(); }, util => sub { my $len = @scores; # number of scores my $min = min @scores; my $max = max @scores; my $mean = sum(@scores) / @scores; my $geo_mean = geo_mean(@scores); }, }, ); } sub geo_mean { my $prod = 1; $prod = $_ for @_; return $prod * (1/ @_); } [download] This benchmark produced the following output. `Number of scores: 100 Rate stat util stat 5265/s -- -95% util 103936/s 1874% -- Number of scores: 500 Rate stat util stat 1652/s -- -93% util 24549/s 1386% -- Number of scores: 1000 Rate stat util stat 890/s -- -93% util 12274/s 1278% --` [download] I believe further speedups could be achieved by putting your large 3GB score files into a database. The program might look like this. It assumes you have created a separate table for each of the score files. #!/usr/bin/perl use strict; use warnings; use List::Util qw/ sum min max /; use DBI; my $dbh = DBI->connect("dbi:SQLite:dbname=junk.lite","","", {PrintError => 1}) or die "Can't connect"; my $data = "data.txt"; open F, '<', $data or die "Unable to open '$data' for reading: $!"; open OUT, '>', 'outfile.csv' or die "Unable to open 'outfile.csv' for writing: $!"; while(my $table = <F>) { next if $table =~ /Header/; chomp $table; my ($chr, undef, $start, $end) = split /\t/, $table; my $sql = <<SQL; SELECT score FROM $chr WHERE pos >= ? AND pos <= ? SQL my $sth = $dbh->prepare($sql); $sth->execute( $start, $end ) or die $DBI::errstr; ; my $set = $sth->fetchall_arrayref; my @scores = map $_->[0], @$set; next unless @scores; my $len = @scores; # number of scores my $min = min @scores; my $max = max @scores; my $mean = sum(@scores) / @scores; my $geo_mean = geo_mean(@scores); print OUT join("\t", $table, $mean, $geo_mean, $min, $max, $len), "\n"; } close(F) or die "Unable to close '$data': $!"; close(OUT) or die "Unable to close 'outfile.csv' $!"; sub geo_mean { my $prod = 1; $prod = $_ for @_; return $prod * (1/ @_); } [download]	[reply] [d/l] [select]
Re: speeding up parsing, jump to line by Anonymous Monk on Aug 15, 2014 at 17:20 UTC
If the score-file entries are distinguished by a unique position, you essentially have a (sparse) array `$score[$pos]`. You'll want to build a temporary database (mapping) for each of the score files.	[reply] [d/l]