Picking up Values By Group

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
I have the following dataset, where each DNA (ACGT) string has its corresponding value. The values comes in group and TAB separated. Given the length K string, there will be K group. Each group contain 4 values, which correspond to A,C,G,T respectively.

AGAC  <TAB>  9 -29 -39 -37 <TAB>  27 -28 -39 -37 <TAB>  26 -27 -39 -37
+ <TAB> 27 -27 -39 12
[download]

What I want to do is to extract the corresponding base value of the given DNA string. Hence with the given string above the desired output is:

$VAR = [9,-39, 26, -27];
[download]

Note that tag length may be greater than four (up to 100 bp). Is there a fast way to achieve this? For there are millions of such lines.

---
neversaint and everlastingly indebted.......

Comment on Picking up Values By Group Select or Download Code

Replies are listed 'Best First'.
Re: Picking up Values By Group by BrowserUk (Patriarch) on Feb 04, 2009 at 12:07 UTC
Mixing tab delimeters with space delimited data is a really bad idea, and if you have any choice in the matter, you should change it. On the basis that you don't have the choice, the following should work, but realise that the tabs I've embedded in the data will likely have been corrupted in the process of upload and download, and the wrapping etc, that PM does to code: #! perl -slw use strict; use Data::Dump qw[ pp ]; my %data; while( <DATA> ) { my( $str, @values ) = map{ s[^\s+\|\s+$][]g; $_ } split "\t"; $data{ $str } = [ map [ split ' ' ], @values ]; } pp %data; my @output; while( my( $key, $valueRef ) = each %data ) { my @required; for my $c ( 0 .. length( $key ) - 1 ) { push @required, $valueRef->[ $c ][ index "ACGT", substr $key, $c, +1 ]; } push @output, \@required; } pp \@output; __DATA__ AGAC 9 -29 -39 -37 27 -28 -39 -37 26 -27 -39 -37 2 +7 -27 -39 12 ACGT 1 -2 3 -4 5 -6 7 -8 9 -10 11 -12 13 -14 15 -1 +6 [download] Output: `c:\test>junk6 ( "AGAC", [ [9, -29, -39, -37], [27, -28, -39, -37], [26, -27, -39, -37], [27, -27, -39, 12], ], "ACGT", [ [1, -2, 3, -4], [5, -6, 7, -8], [9, -10, 11, -12], [13, -14, 15, -16], ], ) [[9, -39, 26, -27], [1, -6, 11, -16]]` [download] This is the same, but I've substituted the text '<TAB>' for the tab character which shoudl make it easier to try: Read more... (1202 Bytes) Like I say, if you have any influence over the file format, change the tab delimiters to something visible that does not match "\s". Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: Picking up Values By Group by holli (Abbot) on Feb 04, 2009 at 16:35 UTC
I would use a lookup hash for the indices instead of calling `index` repeatedly. holli When you're up to your ass in alligators, it's difficult to remember that your original purpose was to drain the swamp.	[reply] [d/l]
Re^3: Picking up Values By Group by BrowserUk (Patriarch) on Feb 04, 2009 at 16:47 UTC
instead of calling index repeatedly. The tradeoff is: scanning a 4 character string for a single character. hashing a single character to a 32-bit hash and then performing a modulo 4 operation upon it. Which actually favours the former: `#! perl -slw use strict; use Benchmark qw[ cmpthese ]; our %lookup = ( A=>0, B=>1, C=>2, D=>3 ); our $input = 'ACGT' x 1000; cmpthese -1, { index => q[ our( %lookup, $input ); my $n; $n = index "ACGT", $_ for split '', $input; ], hash => q[ our( %lookup, $input ); my $n; $n = $lookup{ $_ } for split '', $input; ], }; __END__ c:\test>junk5 Rate hash index hash 107/s -- -21% index 135/s 27% --` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: Picking up Values By Group by Anonymous Monk on Feb 04, 2009 at 16:27 UTC
How fast is "fast"? i.e. What code have you written? How fast does it run? How much faster does it need to run?	[reply]


P is for Practical
	PerlMonks