in reply to for each unique value in a column find the max value..need perl script

Use List::Utils. Row by row, split each field by space. Use the 'order' column as a hash key, and the rest of the columns get put into an anonymous array and pushed onto your hash for that key (use a HoAoA so that you may have multiple entries per key).

Next, iterate over the keys. For each hash entry, pull a list of mtimes out of the AoA portion of the datastructure. get a max() of those values. Then replace the mtime column in the AoA with the value that max() returned.

Now move your original file (rename) to filename.bak (for example). Then open a new file for output with the original file's name, and write your structure back out again in the intended format.

This solution does hold the entire file in memory, so it wouldn't scale well to huge files. But if you were dealing with truly huge data sets you would already have a database, and updates would be as simple as an SQL statement.

If you have a question on part of the implementation, be specific as to which part eludes you, and we'll try to help.


Dave

  • Comment on Re: for each unique value in a column find the max value..need perl script

Replies are listed 'Best First'.
Re^2: for each unique value in a column find the max value..need perl script
by qmenon (Initiate) on May 31, 2011 at 17:20 UTC
    Thanks all for your reply

    reformatted the date as below:

    order mtime no size id day date

    14098703993 154538.354300 200 1 101510

    14098703993 154539.420000 200 1 101511

    14098703994 154538.398200 487 1 100888

    14098703994 154610.720000 487 1 91588

    14098703995 154538.401200 200 1 101502

    14098703995 154539.420000 200 1 101500

    use List::Util qw(max min); my %id_hash; open (DATA, ".txt"); while (<DATA>) { chomp; my ($order, $mtime, $size, $id, $date) = split /\t/; push @{ $id_hash{$order}{$id}{mtime} }, $mtime; push @{ $id_hash{$order}{$id}{size } }, $size; push @{ $id_hash{$order}{$id}{date } }, $date; } open (OUT, ">output.txt"); for my $order (keys %id_hash) { for my $id (keys %{ $id_hash{$order} }) { my $Low = min( @ { $id_hash{$order}{$id}{mtime} } ); print OUT "$order $Low \n"; } }

    Now the problem is this does not give duplicate order entires! I think I am unable to do the below: I want the duplicate order values as it is. Unable to replace the oldest mtime into the date field.

    So now I am doing the below stupid code which will be the loooongest code of my life...:

    open (OUT, ">output.txt"); open (IN, "input1.txt");->original file1 with all data while($line=<IN>){ chomp($line); ($Date,$MTime,$inserdate,$inserttime,$Id,$Phase,$Size,$day,$order) += split(/ /,$line); open (INL, "file2.txt");-->contains the sorted order values of order v +alues($x) from file 1 while($linel=<INL>){ chomp($linel); ($x,$y,$z,$q,$t)= split(/ /,$linel); if($x == $order) { #print OUT "$a $x $b $z $q $t \n"; print OUT $Date," ",$M_Time," ",$inserdate," ",$y," ",$Id," ",$Pha +se," ",$Size," ",$day," ",$order,"\n"; } } } close(INL); close(IN); close(OUT); print "DONE";
    Hope it makes sense to you.. am not an expert in Perl.. just trial and error guys.. but now really need ur help...plzzz!

      A solution: mind you, I wouldn't use this on a very big file. Not exactly the method davido was describing, but it will do.

      my %id_hash; my @lines = (); open (DATA, "test.txt"); while ($line = <DATA>) { chomp $line; my @line = split /\t/ , $line; push @lines, \@line; # push the original line in an array as an an +onymous array if ($line[1] > $id_hash{$line[0]}) {$id_hash{$line[0]} = $line[1]; +} # calculate the biggest mtime for a given order id } open (OUT, ">output.txt"); foreach my $item (@lines) { my @line = @{$item}; # get the original line $line[4] = $id_hash{$line[0]}; # replace the fifth element with th +e calculated maximum print OUT join "\t" ,@line, "\n"; # print the adapted line }

      Greetings

      Martell

        Thanks Martell, You are right, it does look like it will take time to run it on big data files which is my case; I will give it a go anyways

        .

      That's a bit hard to follow. I count seven captions (headers),
      order mtime no size id day date,
      but even hypothesizing that the dot in the second field,
      14098703993 154538.354300 200 1 101510,
      denotes a field break, I can only find six fields.

      7==6 does not compute.

        Ah Now I get it, I just removed 1 field because I dint need it in the output file. Sorry for the confusion.

        Hi ww

        If I understand ur question correctly I guess u have missed the last field in my input file?

        20100607 154538.354300 200 1 101510 14098703993

        Thanks