File Manipulation - Need Advise!

nashkab has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: File Manipulation - Need Advise! by Old_Gray_Bear (Bishop) on Jan 03, 2008 at 17:46 UTC
Whenever you want the unique members of a data-set, think about using a hash, keyed from the field you want to be unique. Once you have cycled through your input, print the keys from the hash and you're done. ---- I Go Back to Sleep, Now. OGB	[reply]
Re^2: File Manipulation - Need Advise! by bart (Canon) on Jan 03, 2008 at 18:07 UTC
Workout of Old Gray Bear's idea: `my %data; my $header = <>; # first line while(<>) { my($key) = split /\t/; $data{$key} = $_; } # output: print $header; foreach my $key (sort keys %data) { print $data{$key}; }` [download] To use it as is, call the script with "file2.txt" as parameter on the command line, and redirect the script's STDOUT to "file1.txt". `perl thescript.pl file2.txt >file1.txt` [download]	[reply] [d/l] [select]
Re^3: File Manipulation - Need Advise! by nashkab (Novice) on Jan 03, 2008 at 18:23 UTC
file1.txt output is the following:- COMPUTER DISTRIBUTION_ID STATUS 30F-WKS `1781183799.xxx11' IC--- 30F-WKS `1781183799.xxxx1' IC--- ADM34A3F9 `1781183799.41455' IC--- [download] I want COMPUTER DISTRIBUTION_ID STATUS 30F-WKS `1781183799.xxx11' IC--- ADM34A3F9 `1781183799.41455' IC--- [download]	[reply] [d/l] [select]
Re^4: File Manipulation - Need Advise! by bart (Canon) on Jan 03, 2008 at 18:32 UTC
Re^5: File Manipulation - Need Advise! by nashkab (Novice) on Jan 03, 2008 at 18:55 UTC
Some notes below your chosen depth have not been shown here
Re^2: File Manipulation - Need Advise! by WoodyWeaver (Monk) on Jan 04, 2008 at 22:50 UTC
> Whenever you want the unique members of a data-set, think about using a hash When you want the pairwise unique members of a serial set, think about a state variable. If you need unique across an entire set, no question that hashes are most useful. Problem, though, is that you have to then store all the keys. It is not uncommon to want to dedup when there are successive runs (think unix's 'uniq'). That's when this second class comes into play. Set a state variable, and read one line at a time. You may have to keep around the previous line or two to compute your state. You may have to do some work at the end to clean up stored lines. `my $thisKey; my $lastLine = <>; my $lastKey = ''; # first line is header, so always print while (<>) { if (/(.?)\t./) { $thisKey = $1 } else { warn "bad data: $_ had no tab\n"; } if ($thisKey ne $lastKey) { print $lastLine; } $lastLine = $_; $lastKey = $thisKey; } print $lastLine;` [download] This is a big win when you have millions and millions of entries to sift through.	[reply] [d/l]
Re: File Manipulation - Need Advise! by ysth (Canon) on Jan 03, 2008 at 19:13 UTC
You do not need a hash. In fact, your original code is very close to doing what you want. Just a few tweaks: #!/usr/bin/perl use strict; use warnings; open(FILE2,">file1.txt")\|\| warn "Could not open\n"; open(FILE3,"file2.txt")\|\| warn "Could not open\n"; my $Previous = ""; my @data = <FILE3>; my $index=0; foreach my $_data (@data) { $index++; chomp ($_data); my @Current = split(/\s+/, $_data); if ($index == 1) { # do nothing. } else { my @Previous = split(/\s+/, $Previous); if ($Current[0] ne $Previous[0]) { print FILE2 $Previous, "\n"; } } $Previous = $_data; } if ($Previous) { print FILE2 $Previous, "\n"; } close(FILE2); close(FILE3); [download] I made the following changes: Added strict and warnings. Corrected `@foo[0]` to `$foo[0]` (see the perldiag entry for the warning it gave before), and declared variables. Moved the $index check so $Previous isn't used unless it's been set. Changed to split on whitespaces, not tabs (since the data you provided didn't have tabs). Add newlines to what's written out. Add block after the loop to print the final line that had been saved. -- CollegeGear.com - more than just college gear (though, yes, we have college-branded teddy bears)	[reply] [d/l] [select]
Re^2: File Manipulation - Need Advise! by alexm (Chaplain) on Jan 03, 2008 at 19:59 UTC
This solution (as the one from the original post) may not work properly unless all hostname entries were previously sorted. However, by using a hash you can deal with an unsorted list of hostnames.	[reply]
Re^3: File Manipulation - Need Advise! by ysth (Canon) on Jan 04, 2008 at 08:50 UTC
Correct. I was taking it for granted, given the OP's code, that only consecutive duplicates should be suppressed. -- CollegeGear.com - more than just college gear (though, yes, we have college-branded teddy bears)	[reply]
Re: File Manipulation - Need Advise! by blue_cowdawg (Monsignor) on Jan 03, 2008 at 18:58 UTC
Dear Monk, Let me at the risk of being repetitive since you've already gotten advise on this subject try and make things more clear to you. Consider the following code: #!/usr/bin/perl -w use strict; my %storage=(); # untested, but should in theory work... my $junk=<DATA>; #get rid of header while(my $line=<DATA>){ chomp($line); # Get rid of newline my ($host,$dist_id,$status)=split(/[\s\n\t]+/,$line); # split on + any whitespace my $storage{$host) = { host=> $host, dist_id => $dist_id, status=> $status }; # Put this into a hash keyed on the two fields we +want to key on } # We never removed the new line character from $junk so... print $junk; # We reclaim this from the trash can foreach my $key(sort keys %storage){ # # Print the remaining record matching the keys printf "%s\t%s\t%s\n",$storage{$key}->{host},$storage{$key}->{dis +t_id},$storage{$key}->{status}; } exit(0); __END__ COMPUTER DISTRIBUTION_ID STATUS 30F-WKS `1781183799.xxxx1' IC--- 30F-WKS `1781183799.xxx11' IC--- ADM34A3F9 `1781183799.41455' IC--- [download] The way this works is you are going to overwrite subsequent records that you read in for the same host in the hash `%storage` and as a result get the last record in your data set output. Since you said in CB you don't know if your fields are space or tab separated I covered both bases by using the regex `/[\s\t\n]+/` in the `split` callout. Hope this helps Peter L. Berghold -- Unix Professional Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg	[reply] [d/l] [select]
Re: File Manipulation - Need Advise! by jrsimmon (Hermit) on Jan 03, 2008 at 18:13 UTC
You need a hash. Something like this should work: use strict; use warnings; open(SOURCE,"test.txt")\|\| warn "Could not open\n"; my @data = <SOURCE>;#just fyi -- slurping is dangerous on very large f +iles close(SOURCE); my %filtered_data; foreach my $line_of_data (@data){ my @split_data_values = split(/\s/,$line_of_data); my $computer_name = shift(@split_data_values); $filtered_data{$computer_name} = "@split_data_values"; } open(RESULT,">test.out")\|\| warn "Could not open\n"; foreach my $computer (keys(%filtered_data)){ print RESULT "$computer $filtered_data{$computer}\n"; } close(RESULT); [download]	[reply] [d/l]
Re: File Manipulation - Need Advise! by Codon (Friar) on Jan 03, 2008 at 18:57 UTC
You didn't mention this, but if order matters in some way, you would might want two data structures, one to unique the output (a hash) and one to maintain ordering (an array). I don't know if you have seen this with the previous examples, but the header line could get mixed into the file somewhere (randomly, thanks to the hashing algorithm) unless handled separately. Alternatively, I provide this quick example: #!/usr/bin/perl use strict; use warnings; my @order; my %data; while (<DATA>) { my ($key,$value) = split /\t/, $_, 2; push @order, $key; $data{$key} = $value; } for my $key (@order) { printf("%s\t%s", $key, delete $data{$key}) if ($data{$key}); } __DATA__ COMPUTER DISTRIBUTION_ID STATUS 30F-WKS `1781183799.xxxx1' IC--- 30F-WKS `1781183799.xxx11' IC--- ADM34A3F9 `1781183799.41455' IC--- [download] Ivan Heffner Sr. Software Engineer WhitePages.com, Inc.	[reply] [d/l]
Re: File Manipulation - Need Advise! by dwm042 (Priest) on Jan 03, 2008 at 22:41 UTC
This is an easy problem to solve (and you can sort multiple ways with the same code) if you will use a hash (I'll note the link shows yet another way to do this kind of operation). #!/usr/bin/perl use warnings; use strict; use Getopt::Long; use Pod::Usage; =head1 NAME unique.pl -- examines data and keeps the unique ones. =head1 SYNOPSIS unique.pl [options] Options: --help Brief help message --man Full documentation --first Keep the first one found rather than the last. =head1 DESCRIPTION unique.pl -- examines data and keeps the unique ones. Program can keep the first or the last one found. =cut my $help = 0; my $man = 0; my $first = 0; GetOptions( 'help\|?' => \$help, man => \$man, first => \$first, ) or pod2usage(2); pod2usage( -exitval => 0, -verbose => 1 ) if $help; pod2usage( -exitval => 0, -verbose => 2, -noperldoc => 1 ) if $man; my %hash = (); while(<DATA>) { chomp; my ( $comp, $id, $status ) = split ( /\s+/, $_, 3 ); next if ( $comp =~ m/COMPUTER/ ); if ( $first ) { next if ( defined( $hash{$comp} ) ); } $hash{$comp} = [ $id, $status ]; } for ( sort keys %hash ) { printf "%s %s %s\n", $_, $hash{$_}->[0], $hash{$_}->[1]; } __DATA__ COMPUTER DISTRIBUTION_ID STATUS 30F-WKS `1781183799.xxxx1' IC--- 30F-WKS `1781183799.xxx11' IC--- ADM34A3F9 `1781183799.41455' IC--- [download] and the results are: C:\Code>perl unique.pl --help Usage: unique.pl [options] Options: --help Brief help message --man Full documentation --first Keep the first one found rather than the l +ast. C:\Code>perl unique.pl 30F-WKS `1781183799.xxx11' IC--- ADM34A3F9 `1781183799.41455' IC--- C:\Code>perl unique.pl --first 30F-WKS `1781183799.xxxx1' IC--- ADM34A3F9 `1781183799.41455' IC--- [download]	[reply] [d/l] [select]