Parsing and Finding Max Min

Perl_Crazy has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I was trying to write a code to get the minimum and maximum value in a given field. The code doesn't seem to work fine when I run it on large files and it multiplies the file size and adds unwanted records. I can get the output correctly for this example but it words faulty on large files. Please help me to correct the code. I get the correct output for this example:- Output:- Books 6 159290954 159331385 + Author
Here is my full code with example:-

#!/usr/bin/perl

open(FH, $ARGV[0]) || die("Cannot open:$!");

while (<FH>) {
   ($Query, $Score, $Start, $End, $one, $two) = split;

   if ( ! exists ( $hash{$Query} ) ) {
      $hash{$Query}[0] = $Score;
      $hash{$Query}[1] = $Start;
      $hash{$Query}[2] = $End;
      $hash{$Query}[3] = $one;
      $hash{$Query}[4] = $two;
   }
   else {
      if ( $hash{$Query}[1] > $Start ) {
           $hash{$Query}[0] = $Score;
           $hash{$Query}[1] = $Start;
           $hash{$Query}[2] = $End;
           $hash{$Query}[3] = $one;
           $hash{$Query}[4] = $two;
      }
   }
   if ( ! exists ( $hash2{$Query} ) ) {
      $hash2{$Query}[0] = $Score;
      $hash2{$Query}[1] = $Start;
      $hash2{$Query}[2] = $End;
      $hash{$Query}[3] = $one;
      $hash{$Query}[4] = $two;
   }
   else {
      if ( $hash2{$Query}[2] < $End ) {
           $hash2{$Query}[0] = $Score;
           $hash2{$Query}[1] = $Start;
           $hash2{$Query}[2] = $End;
           $hash{$Query}[3] = $one;
           $hash{$Query}[4] = $two;
      }
   }
}

foreach $Query ( keys (%hash) ) {
foreach $Query ( keys (%hash2) ) {
   print "$Query\t$hash{$Query}[0]\t$hash{$Query}[1]\t$hash2{$Query}[2
+]\t$hash{$Query}[3]\t$hash{$Query}[4]\n";
}
}
close(FH);
__DATA__
Books    6    159290954    159291342    +     Author
Books    6    159294558    159294653    +     Author
Books    6    159316253    159316398    +     Author
Books    6    159330999    159331385    +     Author
Books    6    159290971    159290997    +     Author
Books    6    159316253    159316398    +     Author
Books    6    159330999    159331289    +     Author
Books    6    159316268    159316398    +     Author
Books    6    159330999    159331245    +     Author
[download]

Thanks in advance

Comment on Parsing and Finding Max Min Select or Download Code

Replies are listed 'Best First'.
Re: Parsing and Finding Max Min by almut (Canon) on Dec 30, 2009 at 14:03 UTC
I think the core of your problems is your nested loops: `foreach $Query ( keys (%hash) ) { foreach $Query ( keys (%hash2) ) { print "$Query\t$hash{$Query}[0]\t$hash{$Query}[1]\t$hash2{$Query}[2 +]\t$hash{$Query}[3]\t$hash{$Query}[4]\n"; } }` [download] As soon as you have more than one entry in the hashes, the number of lines being output will multiply... Also, it's generally not a good idea to use the same loop variable (`$Query`) in both loops :)	[reply] [d/l] [select]
Re: Parsing and Finding Max Min by RMGir (Prior) on Dec 30, 2009 at 13:56 UTC
You should definitely be using strict and warnings, for starters. Next, do you really intend to update `$hash{$Query}[3] = $one; $hash{$Query}[4] = $two;` [download] in all 4 blocks? That seems strange. If you didn't intend that, what I'd suggest is modifying your code to simplify things - it would also make debugging simpler. `while (<FH>) { my ($Query, $Score, $Start, $End, $one, $two) = split; my $aref=[$Score, $Start, $End, $one, $two]; if ( ! exists ( $hash{$Query} ) ) { $hash{$Query} = $aref; } else { if ( $hash{$Query}[1] > $Start ) { $hash{$Query} = $aref; } } if ( ! exists ( $hash2{$Query} ) ) { $hash2{$Query}= $aref; } else { if ( $hash2{$Query}[2] < $End ) { $hash2{$Query} = $aref; } } }` [download] That might make things a bit easier to debug... Mike	[reply] [d/l] [select]
Re: Parsing and Finding Max Min by Cristoforo (Curate) on Dec 31, 2009 at 02:18 UTC
As almut said, as the size of the file grows, so will the hashes. I came up with a solution that only keeps one query in a hash at a time. Then your entries won't accumulate in the hash. `foreach $Query ( keys (%hash2) )` It isn't necessary to loop over the keys in hash2. Just use the first loop and, in your print string, `$hash2{$Query}[2]` will display ok. As others have noted, it's not clear why you want to re-assign all the values in the field when in a condition where you have a new `max` or `min`. In your sample data, the only fields that changed were the `start, end` fields. Chris #!/usr/bin/perl use strict; use warnings; my @cols = qw/ score start stop one two /; my %line_before; my ($query, @data) = split /\t/, <DATA>; chomp $data[-1]; # last @data has a newline @{ $line_before{$query} }{ @cols } = @data; my ($min, $max) = @data[1, 2]; while (<DATA>) { chomp; my %current_line; ($query, @data) = split /\t/; @{ $current_line{$query} }{ @cols } = @data; # if the prior and current queries don't match if (! exists $line_before{ $query }) { print_record(\%line_before, \@cols); ($min, $max) = @{ $current_line{$query} }{ qw/ start st +op / }; } else { my $start = $current_line{ $query }{ start }; if ($start < $min) { $min = $start; } else { $current_line{ $query }{ start } = $min; } my $stop = $current_line{ $query }{ stop }; if ($stop > $max) { $max = $stop; } else { $current_line{ $query }{ stop } = $max; } } %line_before = %current_line; print_record(\%line_before, \@cols) if eof; } sub print_record { my ($rec, $cols) = @_; my ($query) = keys %$rec; # There is only 1 key print join("\t", $query, @{ $rec->{$query} }{ @$cols }), "\n"; } __DATA__ Books 6 159290954 159291342 + Author Books 6 159294558 159294653 + Author Books 6 159316253 159316398 + Author Books 6 159330999 159331385 + Author Books 6 159290971 159290997 + Author Books 6 159316253 159316398 + Author Books 6 159330999 159331289 + Author Books 6 159316268 159316398 + Author Books 6 159330999 159331245 + Author Coopy 1 123456789 987654321 + Author [download] Output `C:\perlp>perl 814937.pl Books 6 159290954 159331385 + Author Coopy 1 123456789 987654321 + Author` [download] Update: Captured 'query' in $query (and changes to array indices). Reset min - max when a record changed. Update 2: I was 'assuming' the queries came in groups (that query only to appear in that bunch). Is that the case? Or, can they be interspersed.	[reply] [d/l] [select]