Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Faster and more efficient way to read a file vertically

by LanX (Saint)
on Nov 03, 2017 at 15:15 UTC ( [id://1202694]=note: print w/replies, xml ) Need Help??


in reply to Faster and more efficient way to read a file vertically

> but this takes enormous amount of time

what does this mean?

Maybe it's just file access on the HD?

Please show some reference code.

> Any ideas?

You can slurp the whole file and run a regex ... something like @col10 = /^.{9}(.)/g on it (with the appropriate /s or /m modifier of course)

corrected my @col = ( $file =~ /^.{9}(.)/mg );

Using unpack might be even faster, but I'm no expert here.

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!

Replies are listed 'Best First'.
Re^2: Faster and more efficient way to read a file vertically
by Anonymous Monk on Nov 03, 2017 at 15:19 UTC
    So basically I have this (brute-force attack):
    while(<>) { if($_=~/^(.*?)\t(.*)/) { $read_seq=$1; $read_id=$2; @split_read=split(//, $read_seq); $respective_read_letter=$split_read[$i]; if($respective_read_letter eq 'A') {$count_A++;} elsif($respective_read_letter eq 'T') {$count_T++;} elsif($respective_read_letter eq 'C') {$count_C++;} elsif($respective_read_letter eq 'G') {$count_G++;} elsif($respective_read_letter eq '.') {$count_dot++;} else {print "ERROR in read: $read\t$respective_read_letter\ +n";} } } $total=$count_A+$count_T+$count_C+$count_G+$count_dot; $fraction_A = sprintf("%.2f", 100*($count_A/$total)); $fraction_T = sprintf("%.2f", 100*($count_T/$total)); $fraction_C = sprintf("%.2f", 100*($count_C/$total)); $fraction_G = sprintf("%.2f", 100*($count_G/$total)); $fraction_dot = sprintf("%.2f", 100*($count_dot/$total)); print $actual_pos,"\t",$expected_letter,"\t",$fraction_A,"\t",$fra +ction_T,"\t",$fraction_G,"\t",$fraction_C,"\t",$fraction_dot,"\n";

      If you're really only going to be doing one column, but want it to be chosen by the variable $i, I'd suggest substr: $respective_read_letter = substr $read_seq, $i, 1;. If finding an optimum solution is important to you (ie, if you'll use this script many times for the forseeable future, rather than just once or twice where "fast engouh" is fast enough), then I'd recommend Benchmarking the substr vs unpack vs LanX's regex (and any others that are suggested). But whatever you do, make sure to use ++LanX's hash %count.

      use warnings; use strict; use Benchmark qw/cmpthese/; use Test::More tests => 1; my @dataset = (); push @dataset, join('', map { (qw/A C G T/)[rand 4] } 1 .. 30 ) for 1 +.. 1000; my $i = $ARGV[0] // 10; sub test { my $fnref = shift; my $count; for my $read_seq( @dataset ) { my $letter = $fnref->($read_seq, $i); $count->{$letter}++; } return $count; } sub rfn { test( sub { my $skip = $_[1]; $_[0] =~ /.{$skip}(.)/; return $1; }); }; sub sfn { test( sub { substr $_[0], $_[1], 1; }); }; sub ufn { test( sub { ... # I'm no unpack expert }); }; cmpthese(0, { regex => \&rfn, substr => \&sfn, #unpack => \&ufn, }); is_deeply rfn(), sfn(), 'same results';
      $i is variable in your example. Reading vertically doesn't make sense then.

      I'd suggest $count{$letter}++ with a hash %count to speed things up.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1202694]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (2)
As of 2024-04-19 18:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found