in reply to File Manipulation - Need Advise!

Whenever you want the unique members of a data-set, think about using a hash, keyed from the field you want to be unique. Once you have cycled through your input, print the keys from the hash and you're done.

----
I Go Back to Sleep, Now.

OGB

Replies are listed 'Best First'.
Re^2: File Manipulation - Need Advise!
by bart (Canon) on Jan 03, 2008 at 18:07 UTC
    Workout of Old Gray Bear's idea:
    my %data; my $header = <>; # first line while(<>) { my($key) = split /\t/; $data{$key} = $_; } # output: print $header; foreach my $key (sort keys %data) { print $data{$key}; }
    To use it as is, call the script with "file2.txt" as parameter on the command line, and redirect the script's STDOUT to "file1.txt".
    perl thescript.pl file2.txt >file1.txt
      file1.txt output is the following:- COMPUTER DISTRIBUTION_ID STATUS 30F-WKS `1781183799.xxx11' IC--- 30F-WKS `1781183799.xxxx1' IC--- ADM34A3F9 `1781183799.41455' IC---
      I want COMPUTER DISTRIBUTION_ID STATUS 30F-WKS `1781183799.xxx11' IC--- ADM34A3F9 `1781183799.41455' IC---
        Like someone said in the Chatterbox: your data may not separated by tabs. Therefore, the whole record (line) would be treated as the id.

        Replace split /\t/ in my code, with split /\s+/.

        If it still won't work, then use the following code at the end, to test what's in the hash:

        use Data::Dumper; print Dumper \%data;
        and see what makes it fail.
Re^2: File Manipulation - Need Advise!
by WoodyWeaver (Monk) on Jan 04, 2008 at 22:50 UTC
    > Whenever you want the unique members of a data-set, think about using a hash
    When you want the pairwise unique members of a serial set, think about a state variable.

    If you need unique across an entire set, no question that hashes are most useful. Problem, though, is that you have to then store all the keys.

    It is not uncommon to want to dedup when there are successive runs (think unix's 'uniq'). That's when this second class comes into play. Set a state variable, and read one line at a time. You may have to keep around the previous line or two to compute your state. You may have to do some work at the end to clean up stored lines.

    my $thisKey; my $lastLine = <>; my $lastKey = ''; # first line is header, so always print while (<>) { if (/(.*?)\t.*/) { $thisKey = $1 } else { warn "bad data: $_ had no tab\n"; } if ($thisKey ne $lastKey) { print $lastLine; } $lastLine = $_; $lastKey = $thisKey; } print $lastLine;
    This is a big win when you have millions and millions of entries to sift through.