Data munging

umasuresh has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Data munging by ikegami (Patriarch) on Jan 22, 2010 at 00:41 UTC
The general idea is right, but you could simplify your average calculation. `use List::Util qw( sum ); my %data; while (<>) { chomp; my ($k, $v) = split /\t/; push @{ $data{$k} }, $v; } local $, = "\t"; local $\ = "\n"; for my $k (keys %data) { my $data = $data{$k}; print $k, 0+@$data, sum(@$data)/@$data; }` [download] Memory usage shouldn't be a problem with 300,000 lines, but you could reduce mem usage by summing and counting the elements as you go along. `my %data; while (<>) { chomp; my ($k, $v) = split /\t/; $data{$k}[0]++ $data{$k}[1]+= $v; } local $, = "\t"; local $\ = "\n"; for my $k (keys %data) { my $data = $data{$k}; print $k, $data->[0], $data->[1]/$data->[0]; }` [download] If the keys are sorted (or at least grouped) in the input, you could reduce memory usage to something constant. `my $last; my $sum; my $count; local $, = "\t"; local $\ = "\n"; while (<>) { chomp; my ($k, $v) = split /\t/; if (defined($last) && $k ne $last) { print $last, $count, $sum/$count; ($last, $count, $sum) = ($k, 0, 0); } $count++ $sum += $v; } if (defined($last)) { print $last, $sum/$count; }` [download] As a one-liner, how about `perl -lane' $d{$F[0]}[0]++ $d{$F[0]}[1]+= $F[1]; }{ $, = "\t"; print $_, $d{$_}[0], $d{$_}[1]/$d{$_}[0] for keys %d; '` [download] It can be shortened, but any simpler will affect readability. Update: I wasn't printing out the count. Fixed.	[reply] [d/l] [select]
Re^2: Data munging by umasuresh (Hermit) on Jan 22, 2010 at 00:56 UTC
Thanks much ikegami. The first two options are awesome, I am still trying to wrap my brain around the third!	[reply]
Re^3: Data munging by ikegami (Patriarch) on Jan 22, 2010 at 01:01 UTC
It counts lines and maintains a sum of the values seen to date. When the key changes, it prints the average, then resets the line count and the sum. `1 196 -> count = 1 sum = 196 1 190 -> count = 2 sum = 196+190 1 200 -> count = 3 sum = 196+190+200 key changed, so print average, and reset count and sum 2 20 -> count = 1 sum = 20 key changed, so print average, and reset count and sum 3 25 -> count = 1 sum = 25 3 19 -> count = 2 sum = 25+19 3 39 -> count = 3 sum = 25+19+39 key changed, so print average, and reset count and sum 4 40 -> count = 1 sum = 40 4 41 -> count = 2 sum = 40+41 4 45 -> count = 3 sum = 40+41+45 eof, so print average` [download]	[reply] [d/l]
Re: Data munging by toolic (Bishop) on Jan 22, 2010 at 00:49 UTC
Accumulate the data as you read in the file: `use warnings; use strict; my %r; while (<DATA>) { chomp; my ($k, $s) = split; $r{$k}{count}++; $r{$k}{sum} += $s; } for my $key (sort keys %r) { print "$key\t$r{$key}{count}\t", ($r{$key}{sum}/$r{$key}{count}), +"\n"; } __DATA__ 1 196 1 190 1 200 2 20 3 25 3 19 3 39 4 40 4 41` [download]	[reply] [d/l]
Re: Data munging by BrowserUk (Patriarch) on Jan 22, 2010 at 00:51 UTC
Code wrapped for clarity: `perl -anle"++$h{$F[0]}[0];$h{$F[0]}[1]+=$F[1]} {print qq[$_\t$h{$_}[0]\t],$h{$_}[1]/$h{$_}[0] for keys %h" munge.txt 4 3 42 1 3 195.333333333333 3 3 27.6666666666667 2 1 20` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply] [d/l]
Re^2: Data munging by umasuresh (Hermit) on Jan 22, 2010 at 17:01 UTC
Hi Monks, Thanks much for all the valuable suggestions. I have another challenge at hand and greatly appreciate any input. I have to compare a query text file against a reference text file and do the following: 1. match the common keys with multiple values representing fragments of each key. 2. for each key I compare the start and end columns of each fragment (column1 & column2). 3. for each key if the fragments overlap between the query and reference keep count of the overlaps. 4. report the query sequence and insert the overlap count in column3, retaining all the other columns. Note: The reference and query files are really large running >= 300,000 lines. This post is really long, so please bear with me! My questions are below: 1. I feel that the code can be much simpler than what I have! 2. Again memory issue is a concern for large files! Here are the reference and query text files: ------ref.txt--------- c1 120 134 - AG c2 120 134 + TC c1 130 300 - AA c1 12 13000 - AU c9 14000 14008 - GN c9 900 1200 - GX c10 10040 10050 + GG c1 19992 20005 - GG c1 12000 14000 + TT -----query.txt-------- c1 100 12000 + AT c1 19800 20000 - AG c1 20049 20800 - GC c9 10078 14008 - AG c11 10078 14008 - TG c15 10078 14008 - TC c9 1078 10008 - TA c10 10080 10000 - TT ------code------------- use strict; use warnings; use Data::Dumper; # if (@ARGV < 1) # { # print "Usage: $0 inputDir \n"; #e.g ./ # exit; # } my $file1 = "ref.txt"; my $file2 = "query.txt"; # my($key1, $key2, %hash1, %hash2); open (IN1,'<'.$file1) \|\| die "*can't open the file $!\n"; my @lines1 = <IN1>; close IN1; #$i=0; for (@lines1) { chomp; my @a1 = split(/\t/, $_); my $key1 = $a1[0]; my $rs = $a1[1]; my $re = $a1[2]; #push(@{ $hash1->{$key1} } , "$rs\t$re" ); push(@{ $hash1->{$key1} } , $_ ); } open (IN2,'<'.$file2) \|\| die "can't open the file $!\n"; my @lines2 = <IN2>; close IN1; for (@lines2) { chomp; my @a2 = split(/\t/, $_); my $qs = $a2[1]; my $qe = $a2[2]; my $key2 = $a2[0]; #push(@{ $hash2->{$key2} } , "$qs\t$qe"); push(@{ $hash2->{$key2} } , $_ ); } #print Dumper(\%$hash2); @common_keys = grep { exists $hash1->{$_} } sort keys %$hash2; my %seen; for (sort @common_keys) { for my $r (0..$#{ $hash1->{$_} }) { for my $q (0..$#{ $hash2->{$_} }) { my ($query_key, $query_start, $query_end, @qtail) = split(/\t/, $hash2->{$_}[$q]); my ($ref_key, $ref_start, $ref_end, @rtail) = split(/\t/, +$hash1->{$_}[$r]); if( ($query_start >= $ref_start && $query_start < += $ref_end) \|\| ($query_end >= $ref_start && $query_end <= $r +ef_end) \|\| ($ref_start >= $query_start && $ref_start <= +$query_end) \|\| ($ref_end >= $query_start && $ref_end <= $que +ry_end) ) { $seen{$_}{$query_start}++; } } } } #print Dumper(\%seen); my $overlap_count; for my $key (sort keys %$hash2) { for my $i (0..$#{ $hash2->{$key} } ) { my @s = split(/\t/, $hash2->{$key}[$i]); my @head = @s[0..2]; my @tail = @s[3..$#s]; #print "$tail[0]\n"; my $start = $s[1]; # print "*$start\n"; if( exists $seen{$key}{$start} ) { $overlap_count = $seen{$key}{$start}; #print "$hash2->{$key}[$i]\t$overlap_count\n" print map {"$_\t"} insert_field(\@head, \$overlap_count, \ +@tail); print "\n" } else { $overlap_count = 0; #print "$hash2->{$key}[$i]\t$overlap_count\n"; print map {"$_\t"} insert_field(\@head, \$overlap_count, \ +@tail); print "\n" } #print "\n"; } } [download]	[reply] [d/l]
Re^3: Data munging by BrowserUk (Patriarch) on Jan 22, 2010 at 18:19 UTC
Why is the code you've posted so completely broken? Where is the subroutine `insert_field()` Why are `%hash1 & %hash2` decalared as hashes, but used throughout as references: `$hash1->{ ... }` Why is `@common_keys` never declared? Did you think that you could fool us by sticking `use strict; use warnings;` at the top and declaring some of the variables with my, and get us to solve your problems that you've failed to solve? Tip: If you strict and warnings from the get go, you'd find it far easier to get your code right as you go, rather than painting yourself into a corner of undeclared and uninitialised variables. I have a 25 line solution to what I interpret from your description to be correct. It produces this output from your samples: `C:\test>819005 c1 100 12000 11901 + AT c1 100 12000 11901 + AT c1 100 12000 12989 + AT c1 100 12000 13901 + AT c1 19800 20000 206 - AG c9 10078 14008 3931 - AG c9 1078 10008 9109 - TA` [download] But I cannot check whether that is correct because your code doesn't produce anything for me to compare it against :( Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply] [d/l] [select]
Re^4: Data munging by umasuresh (Hermit) on Jan 22, 2010 at 20:41 UTC
Re^5: Data munging by BrowserUk (Patriarch) on Jan 23, 2010 at 01:50 UTC
Some notes below your chosen depth have not been shown here
Re^4: Data munging by umasuresh (Hermit) on Jan 22, 2010 at 19:49 UTC