Re: File Manipulation - Need Advise!
by Old_Gray_Bear (Bishop) on Jan 03, 2008 at 17:46 UTC
|
Whenever you want the unique members of a data-set, think about using a hash, keyed from the field you want to be unique. Once you have cycled through your input, print the keys from the hash and you're done.
----
I Go Back to Sleep, Now.
OGB
| [reply] |
|
my %data;
my $header = <>; # first line
while(<>) {
my($key) = split /\t/;
$data{$key} = $_;
}
# output:
print $header;
foreach my $key (sort keys %data) {
print $data{$key};
}
To use it as is, call the script with "file2.txt" as parameter on the command line, and redirect the script's STDOUT to "file1.txt".
perl thescript.pl file2.txt >file1.txt
| [reply] [d/l] [select] |
|
file1.txt output is the following:-
COMPUTER DISTRIBUTION_ID STATUS
30F-WKS `1781183799.xxx11' IC---
30F-WKS `1781183799.xxxx1' IC---
ADM34A3F9 `1781183799.41455' IC---
I want
COMPUTER DISTRIBUTION_ID STATUS
30F-WKS `1781183799.xxx11' IC---
ADM34A3F9 `1781183799.41455' IC---
| [reply] [d/l] [select] |
|
|
|
|
> Whenever you want the unique members of a data-set, think about using a hash
When you want the pairwise unique members of a serial set, think about a state variable.
If you need unique across an entire set, no question that hashes are most useful. Problem, though, is that you have to then store all the keys.
It is not uncommon to want to dedup when there are successive runs (think unix's 'uniq'). That's when this second class comes into play. Set a state variable, and read one line at a time. You may have to keep around the previous line or two to compute your state. You may have to do some work at the end to clean up stored lines.
my $thisKey;
my $lastLine = <>;
my $lastKey = ''; # first line is header, so always print
while (<>) {
if (/(.*?)\t.*/) {
$thisKey = $1
} else {
warn "bad data: $_ had no tab\n";
}
if ($thisKey ne $lastKey) {
print $lastLine;
}
$lastLine = $_;
$lastKey = $thisKey;
}
print $lastLine;
This is a big win when you have millions and millions of entries to sift through. | [reply] [d/l] |
Re: File Manipulation - Need Advise!
by ysth (Canon) on Jan 03, 2008 at 19:13 UTC
|
You do not need a hash. In fact, your original code is very close to doing what you want. Just a few tweaks:
#!/usr/bin/perl
use strict;
use warnings;
open(FILE2,">file1.txt")|| warn "Could not open\n";
open(FILE3,"file2.txt")|| warn "Could not open\n";
my $Previous = "";
my @data = <FILE3>;
my $index=0;
foreach my $_data (@data)
{
$index++;
chomp ($_data);
my @Current = split(/\s+/, $_data);
if ($index == 1)
{
# do nothing.
}
else
{
my @Previous = split(/\s+/, $Previous);
if ($Current[0] ne $Previous[0])
{
print FILE2 $Previous, "\n";
}
}
$Previous = $_data;
}
if ($Previous) {
print FILE2 $Previous, "\n";
}
close(FILE2);
close(FILE3);
I made the following changes:
- Added strict and warnings. Corrected @foo[0] to $foo[0] (see the perldiag entry for the warning it gave before), and declared variables.
- Moved the $index check so $Previous isn't used unless it's been set.
- Changed to split on whitespaces, not tabs (since the data you provided didn't have tabs).
- Add newlines to what's written out.
- Add block after the loop to print the final line that had been saved.
| [reply] [d/l] [select] |
|
This solution (as the one from the original post) may not work properly unless all hostname entries were previously sorted. However, by using a hash you can deal with an unsorted list of hostnames.
| [reply] |
|
Correct. I was taking it for granted, given the OP's code, that only consecutive duplicates should be suppressed.
| [reply] |
Re: File Manipulation - Need Advise!
by blue_cowdawg (Monsignor) on Jan 03, 2008 at 18:58 UTC
|
Dear Monk,
Let me at the risk of being repetitive since you've already gotten advise on this subject try and make
things more clear to you.
Consider the following code:
#!/usr/bin/perl -w
use strict;
my %storage=();
# untested, but should in theory work...
my $junk=<DATA>; #get rid of header
while(my $line=<DATA>){
chomp($line); # Get rid of newline
my ($host,$dist_id,$status)=split(/[\s\n\t]+/,$line); # split on
+ any whitespace
my $storage{$host) =
{
host=> $host,
dist_id => $dist_id,
status=> $status
}; # Put this into a hash keyed on the two fields we
+want to key on
}
# We never removed the new line character from $junk so...
print $junk; # We reclaim this from the trash can
foreach my $key(sort keys %storage){
#
# Print the remaining record matching the keys
printf "%s\t%s\t%s\n",$storage{$key}->{host},$storage{$key}->{dis
+t_id},$storage{$key}->{status};
}
exit(0);
__END__
COMPUTER DISTRIBUTION_ID STATUS
30F-WKS `1781183799.xxxx1' IC---
30F-WKS `1781183799.xxx11' IC---
ADM34A3F9 `1781183799.41455' IC---
The way this works is you are going to overwrite subsequent records that you read in for the same host
in the hash %storage and as a result get the last record in your data set output. Since you
said in CB you don't know if your fields are space or tab separated I covered both bases by using the
regex /[\s\t\n]+/ in the split callout.
Hope this helps
Peter L. Berghold -- Unix Professional
Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
| [reply] [d/l] [select] |
Re: File Manipulation - Need Advise!
by jrsimmon (Hermit) on Jan 03, 2008 at 18:13 UTC
|
You need a hash. Something like this should work: use strict;
use warnings;
open(SOURCE,"test.txt")|| warn "Could not open\n";
my @data = <SOURCE>;#just fyi -- slurping is dangerous on very large f
+iles
close(SOURCE);
my %filtered_data;
foreach my $line_of_data (@data){
my @split_data_values = split(/\s/,$line_of_data);
my $computer_name = shift(@split_data_values);
$filtered_data{$computer_name} = "@split_data_values";
}
open(RESULT,">test.out")|| warn "Could not open\n";
foreach my $computer (keys(%filtered_data)){
print RESULT "$computer $filtered_data{$computer}\n";
}
close(RESULT);
| [reply] [d/l] |
Re: File Manipulation - Need Advise!
by Codon (Friar) on Jan 03, 2008 at 18:57 UTC
|
You didn't mention this, but if order matters in some way, you would might want two data structures, one to unique the output (a hash) and one to maintain ordering (an array). I don't know if you have seen this with the previous examples, but the header line could get mixed into the file somewhere (randomly, thanks to the hashing algorithm) unless handled separately.
Alternatively, I provide this quick example:
#!/usr/bin/perl
use strict;
use warnings;
my @order; my %data;
while (<DATA>) {
my ($key,$value) = split /\t/, $_, 2;
push @order, $key;
$data{$key} = $value;
}
for my $key (@order) {
printf("%s\t%s", $key, delete $data{$key}) if ($data{$key});
}
__DATA__
COMPUTER DISTRIBUTION_ID STATUS
30F-WKS `1781183799.xxxx1' IC---
30F-WKS `1781183799.xxx11' IC---
ADM34A3F9 `1781183799.41455' IC---
Ivan Heffner
Sr. Software Engineer
WhitePages.com, Inc.
| [reply] [d/l] |
Re: File Manipulation - Need Advise!
by dwm042 (Priest) on Jan 03, 2008 at 22:41 UTC
|
This is an easy problem to solve (and you can sort multiple ways with the same code) if you will use a hash (I'll note the link shows yet another way to do this kind of operation).
#!/usr/bin/perl
use warnings;
use strict;
use Getopt::Long;
use Pod::Usage;
=head1 NAME
unique.pl -- examines data and keeps the unique ones.
=head1 SYNOPSIS
unique.pl [options]
Options:
--help Brief help message
--man Full documentation
--first Keep the first one found rather than the last.
=head1 DESCRIPTION
unique.pl -- examines data and keeps the unique ones.
Program can keep the first or the last one found.
=cut
my $help = 0;
my $man = 0;
my $first = 0;
GetOptions(
'help|?' => \$help,
man => \$man,
first => \$first,
) or pod2usage(2);
pod2usage( -exitval => 0, -verbose => 1 ) if $help;
pod2usage( -exitval => 0, -verbose => 2, -noperldoc => 1 ) if $man;
my %hash = ();
while(<DATA>) {
chomp;
my ( $comp, $id, $status ) = split ( /\s+/, $_, 3 );
next if ( $comp =~ m/COMPUTER/ );
if ( $first ) {
next if ( defined( $hash{$comp} ) );
}
$hash{$comp} = [ $id, $status ];
}
for ( sort keys %hash ) {
printf "%s %s %s\n", $_, $hash{$_}->[0], $hash{$_}->[1];
}
__DATA__
COMPUTER DISTRIBUTION_ID STATUS
30F-WKS `1781183799.xxxx1' IC---
30F-WKS `1781183799.xxx11' IC---
ADM34A3F9 `1781183799.41455' IC---
and the results are:
C:\Code>perl unique.pl --help
Usage:
unique.pl [options]
Options:
--help Brief help message
--man Full documentation
--first Keep the first one found rather than the l
+ast.
C:\Code>perl unique.pl
30F-WKS `1781183799.xxx11' IC---
ADM34A3F9 `1781183799.41455' IC---
C:\Code>perl unique.pl --first
30F-WKS `1781183799.xxxx1' IC---
ADM34A3F9 `1781183799.41455' IC---
| [reply] [d/l] [select] |