cburger has asked for the wisdom of the Perl Monks concerning the following question:

I'm parsing one input file for human chromosome locations, retrieve start and end position and then read in the chromosome file with scores to filter out a score at the specific locations - then calculate a mean score value for the entire region. The process is terribly inefficient, even if I split files. Each chromosome file is already around 3 GB. is there a better approach to this problem? Perhaps can I jump to the right locations in a speedier way? Help appreciated.

open (F, $data) || die; while(<F>) { chomp; if ($_!~/Header/g) { @line=split("\t", $_); $table=$_; $chr=$line[0]; $start=$line[2]; $end=$line[3]; # file name $name="score.txt"; @array=(); @sorted=(); #print "$chr\.$name\n"; open (R, "$chr\.$name") || die ("cannot open chrom file\n"); while(<R>) { chomp; # format : chrom ID, position, score # chr1 10925 0.239 @q=split("\t", $_); $pos=$q[1]; $score=$q[2]; #print "$pos\n"; if ($pos>=$start) { if ($pos<=$end) { push (@array, $score); $sum=$sum+$score; }} } close(R); # get mean min and max $len=$#array+1; @sorted=sort {$a <=> $b} @array; $min=$sorted[0]; $max=$sorted[$#sorted]; $mean=$sum/$len; $stat = Statistics::Descriptive::Full->new(); $stat->add_data(@sorted); $geo_mean= $stat->geometric_mean(); #print "$table\t$geo_mean\t$min\t$max\t$len\n"; print OUT "$table\t$mean\t$geo_mean\t$min\t$max\t$len\n"; } } close(F); close(OUT);

Replies are listed 'Best First'.
Re: speeding up parsing, jump to line
by dave_the_m (Monsignor) on Aug 15, 2014 at 09:28 UTC
    How many lines (approx) are there in the $data file? Are the chromosome files already sorted by pos or not? I presume that there are approx 23 chromosome files?

    Dave.

      there are 40,000 lines for the data file and each chromosome has about 50-200 Mio lines. At the moment the script takes ~20 min to print out one line

Re: speeding up parsing, jump to line
by graff (Chancellor) on Aug 15, 2014 at 13:19 UTC
    The answers to dave_the_m's questions will determine what sort of solution to look for. E.g. if the $data file is relatively small, you can load it into memory, then read each *.score.txt file exactly once to load up relevant info for each item in $data, then do stats on the info - something like this:
    use strict; use warnings; my $data = "some/file.name"; open( F, $data ) or die "$data:$!\n"; my %targets; while(<F>) { next if ( /Header/ ); chomp; my ( $chr, $start, $end ) = ( split( /\t/ ))[0,2,3]; push @{$targets{$chr}}, { table => $_, start => $start, end => $en +d }; } for my $chr ( keys %targets ) { open( R, "$chr.score.txt" ) or die "$chr.score.txt: $!\n"; while (<R>) { chomp; my ( $pos, $score ) = ( split( /\t/ ))[1,2]; for my $range ( @{$targets{$chr}} ) { if ( $pos >= $$range{start} and $pos <= $$range{end} ) { push @{$$range{scores}}, $score; } } } } # do statistics on contents of %targets…
    If $data contains too much stuff, and/or requires too much stuff from the *.score.txt files to be held in memory for each $chr, then maybe you have to create output files for each $chr (or each start-end range in $data), so that it'll be quick/easy to do stats on those files.
Re: speeding up parsing, jump to line
by Cristoforo (Curate) on Aug 18, 2014 at 19:45 UTC
    Just one way I see that you can get a speedup is to not use Statistics::Descriptive (for the geometric mean). You can roll your own.

    Here is a benchmark - it compares your method with one where I didn't use Statistics::Descriptive;

    Depending on the number of scores in the array, my method shows a speedup of 12 times, for 1000 scores to 20 times, for 100 scores in the array.

    #!/usr/bin/perl use strict; use warnings; use Statistics::Descriptive; use Benchmark qw/ cmpthese /; use List::Util qw/ sum min max /; for my $len (100, 500, 1000) { my @scores = map {sprintf "%.3f", rand 29} 1 .. $len; print "\n\nNumber of scores: $len\n"; my $results = cmpthese (-1, { stat => sub { my $stat = Statistics::Descriptive::Full->new(); $stat->add_data(@scores); @scores = sort {$a <=> $b} @scores; my ($min, $max) = @scores[0, -1]; my $len = @scores; # number of scores my $mean = sum(@scores) / @scores; my $geo_mean = $stat->geometric_mean(); }, util => sub { my $len = @scores; # number of scores my $min = min @scores; my $max = max @scores; my $mean = sum(@scores) / @scores; my $geo_mean = geo_mean(@scores); }, }, ); } sub geo_mean { my $prod = 1; $prod *= $_ for @_; return $prod ** (1/ @_); }
    This benchmark produced the following output.
    Number of scores: 100 Rate stat util stat 5265/s -- -95% util 103936/s 1874% -- Number of scores: 500 Rate stat util stat 1652/s -- -93% util 24549/s 1386% -- Number of scores: 1000 Rate stat util stat 890/s -- -93% util 12274/s 1278% --
    I believe further speedups could be achieved by putting your large 3GB score files into a database. The program might look like this. It assumes you have created a separate table for each of the score files.
    #!/usr/bin/perl use strict; use warnings; use List::Util qw/ sum min max /; use DBI; my $dbh = DBI->connect("dbi:SQLite:dbname=junk.lite","","", {PrintError => 1}) or die "Can't connect"; my $data = "data.txt"; open F, '<', $data or die "Unable to open '$data' for reading: $!"; open OUT, '>', 'outfile.csv' or die "Unable to open 'outfile.csv' for writing: $!"; while(my $table = <F>) { next if $table =~ /Header/; chomp $table; my ($chr, undef, $start, $end) = split /\t/, $table; my $sql = <<SQL; SELECT score FROM $chr WHERE pos >= ? AND pos <= ? SQL my $sth = $dbh->prepare($sql); $sth->execute( $start, $end ) or die $DBI::errstr; ; my $set = $sth->fetchall_arrayref; my @scores = map $_->[0], @$set; next unless @scores; my $len = @scores; # number of scores my $min = min @scores; my $max = max @scores; my $mean = sum(@scores) / @scores; my $geo_mean = geo_mean(@scores); print OUT join("\t", $table, $mean, $geo_mean, $min, $max, $len), "\n"; } close(F) or die "Unable to close '$data': $!"; close(OUT) or die "Unable to close 'outfile.csv' $!"; sub geo_mean { my $prod = 1; $prod *= $_ for @_; return $prod ** (1/ @_); }
Re: speeding up parsing, jump to line
by Anonymous Monk on Aug 15, 2014 at 17:20 UTC

    If the score-file entries are distinguished by a unique position, you essentially have a (sparse) array $score[$pos]. You'll want to build a temporary database (mapping) for each of the score files.