Perl_Crazy has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I was trying to write a code to get the minimum and maximum value in a given field. The code doesn't seem to work fine when I run it on large files and it multiplies the file size and adds unwanted records. I can get the output correctly for this example but it words faulty on large files. Please help me to correct the code. I get the correct output for this example:- Output:- Books    6    159290954    159331385    +    Author
Here is my full code with example:-
#!/usr/bin/perl open(FH, $ARGV[0]) || die("Cannot open:$!"); while (<FH>) { ($Query, $Score, $Start, $End, $one, $two) = split; if ( ! exists ( $hash{$Query} ) ) { $hash{$Query}[0] = $Score; $hash{$Query}[1] = $Start; $hash{$Query}[2] = $End; $hash{$Query}[3] = $one; $hash{$Query}[4] = $two; } else { if ( $hash{$Query}[1] > $Start ) { $hash{$Query}[0] = $Score; $hash{$Query}[1] = $Start; $hash{$Query}[2] = $End; $hash{$Query}[3] = $one; $hash{$Query}[4] = $two; } } if ( ! exists ( $hash2{$Query} ) ) { $hash2{$Query}[0] = $Score; $hash2{$Query}[1] = $Start; $hash2{$Query}[2] = $End; $hash{$Query}[3] = $one; $hash{$Query}[4] = $two; } else { if ( $hash2{$Query}[2] < $End ) { $hash2{$Query}[0] = $Score; $hash2{$Query}[1] = $Start; $hash2{$Query}[2] = $End; $hash{$Query}[3] = $one; $hash{$Query}[4] = $two; } } } foreach $Query ( keys (%hash) ) { foreach $Query ( keys (%hash2) ) { print "$Query\t$hash{$Query}[0]\t$hash{$Query}[1]\t$hash2{$Query}[2 +]\t$hash{$Query}[3]\t$hash{$Query}[4]\n"; } } close(FH); __DATA__ Books 6 159290954 159291342 + Author Books 6 159294558 159294653 + Author Books 6 159316253 159316398 + Author Books 6 159330999 159331385 + Author Books 6 159290971 159290997 + Author Books 6 159316253 159316398 + Author Books 6 159330999 159331289 + Author Books 6 159316268 159316398 + Author Books 6 159330999 159331245 + Author
Thanks in advance

Replies are listed 'Best First'.
Re: Parsing and Finding Max Min
by almut (Canon) on Dec 30, 2009 at 14:03 UTC

    I think the core of your problems is your nested loops:

    foreach $Query ( keys (%hash) ) { foreach $Query ( keys (%hash2) ) { print "$Query\t$hash{$Query}[0]\t$hash{$Query}[1]\t$hash2{$Query}[2 +]\t$hash{$Query}[3]\t$hash{$Query}[4]\n"; } }

    As soon as you have more than one entry in the hashes, the number of lines being output will multiply...  Also, it's generally not a good idea to use the same loop variable ($Query) in both loops :)

Re: Parsing and Finding Max Min
by RMGir (Prior) on Dec 30, 2009 at 13:56 UTC
    You should definitely be using strict and warnings, for starters.

    Next, do you really intend to update

    $hash{$Query}[3] = $one; $hash{$Query}[4] = $two;
    in all 4 blocks? That seems strange.

    If you didn't intend that, what I'd suggest is modifying your code to simplify things - it would also make debugging simpler.

    while (<FH>) { my ($Query, $Score, $Start, $End, $one, $two) = split; my $aref=[$Score, $Start, $End, $one, $two]; if ( ! exists ( $hash{$Query} ) ) { $hash{$Query} = $aref; } else { if ( $hash{$Query}[1] > $Start ) { $hash{$Query} = $aref; } } if ( ! exists ( $hash2{$Query} ) ) { $hash2{$Query}= $aref; } else { if ( $hash2{$Query}[2] < $End ) { $hash2{$Query} = $aref; } } }
    That might make things a bit easier to debug...

    Mike
Re: Parsing and Finding Max Min
by Cristoforo (Curate) on Dec 31, 2009 at 02:18 UTC
    As almut said, as the size of the file grows, so will the hashes. I came up with a solution that only keeps one query in a hash at a time. Then your entries won't accumulate in the hash.

    foreach $Query ( keys (%hash2) )

    It isn't necessary to loop over the keys in hash2. Just use the first loop and, in your print string, $hash2{$Query}[2] will display ok.

    As others have noted, it's not clear why you want to re-assign all the values in the field when in a condition where you have a new max or min.

    In your sample data, the only fields that changed were the start, end fields.

    Chris

    #!/usr/bin/perl use strict; use warnings; my @cols = qw/ score start stop one two /; my %line_before; my ($query, @data) = split /\t/, <DATA>; chomp $data[-1]; # last @data has a newline @{ $line_before{$query} }{ @cols } = @data; my ($min, $max) = @data[1, 2]; while (<DATA>) { chomp; my %current_line; ($query, @data) = split /\t/; @{ $current_line{$query} }{ @cols } = @data; # if the prior and current queries don't match if (! exists $line_before{ $query }) { print_record(\%line_before, \@cols); ($min, $max) = @{ $current_line{$query} }{ qw/ start st +op / }; } else { my $start = $current_line{ $query }{ start }; if ($start < $min) { $min = $start; } else { $current_line{ $query }{ start } = $min; } my $stop = $current_line{ $query }{ stop }; if ($stop > $max) { $max = $stop; } else { $current_line{ $query }{ stop } = $max; } } %line_before = %current_line; print_record(\%line_before, \@cols) if eof; } sub print_record { my ($rec, $cols) = @_; my ($query) = keys %$rec; # There is only 1 key print join("\t", $query, @{ $rec->{$query} }{ @$cols }), "\n"; } __DATA__ Books 6 159290954 159291342 + Author Books 6 159294558 159294653 + Author Books 6 159316253 159316398 + Author Books 6 159330999 159331385 + Author Books 6 159290971 159290997 + Author Books 6 159316253 159316398 + Author Books 6 159330999 159331289 + Author Books 6 159316268 159316398 + Author Books 6 159330999 159331245 + Author Coopy 1 123456789 987654321 + Author
    Output
    C:\perlp>perl 814937.pl Books 6 159290954 159331385 + Author Coopy 1 123456789 987654321 + Author

    Update: Captured 'query' in $query (and changes to array indices). Reset min - max when a record changed.

    Update 2: I was 'assuming' the queries came in groups (that query only to appear in that bunch). Is that the case? Or, can they be interspersed.