oomwrtu has asked for the wisdom of the Perl Monks concerning the following question:
Alright, after trying to plow through this problem on my own, I decided I need to enlist some help. I am trying to compare all of the lines from three files and then print the result to a final file. The problem is that each line has to be identified by it's id because some of the files don't have all of the lines, such as: File 1: File 2:
0001,aname,bname 0001,aname,bname
0002,cname,bname 0003,aname,bname
0004,dname,bname 0005,fname,bname
0005,dname,bname
If the ids are not the same, I don't want it to write anything to the final file unless one of the ids was blank, but I do want to write them if they are the same, such as: Final:
0001,aname,bname
0002,cname,bname
0004,dname,bname
I can do this for about 900 of the approximately 1700 lines I have in each of the files before it just stops doing anything. My code is below (shortened for this post), along with snippets from two of the files I would like to compare: - - - solved using the code below, see my message about editing it, please.use strict;
use warnings;
use CGI qw(:standard);
use CGI::Carp qw(warningsToBrowser fatalsToBrowser);
print "Cache-Control: max-age=30\n";
my %final;
my %compare1;
my %compare2;
my %compare3;
my $maxid = "2000";
open(DAT, "data/parsed-Black_Dragon.txt");
my @data = <DAT>;
close(DAT);
for(my $i = 0; $i < scalar(@data); $i++) { # dump file data into ha
+shed array
my $id = substr($data[$i], 0, 4); # get current planet id
$compare1{$id} = $data[$i];
delete $data[$i];
}
@data = ();
open(DAT, "data/parsed-BMoom.txt");
@data = <DAT>;
close(DAT);
for(my $i = 0; $i < scalar(@data); $i++) { # dump file data into ha
+shed array
my $id = substr($data[$i], 0, 4); # get current planet id
$compare2{$id} = $data[$i];
delete $data[$i];
}
@data = ();
open(DAT, "data/parsed-Litex.txt");
@data = <DAT>;
close(DAT);
for(my $i = 0; $i < scalar(@data); $i++) { # dump file data into ha
+shed array
my $id = substr($data[$i], 0, 4); # get current planet id
$compare3{$id} = $data[$i];
delete $data[$i];
}
@data = ();
open(DAT,">data/parsed-all.txt"); # open appropriate parsed file an
+d clear it
close(DAT);
for(my $i = 1; $i <= $maxid; $i++) {
my $currid = changeID($i);
my $delid = changeID($i - 1);
delete $compare1{$delid};
delete $compare2{$delid};
delete $compare3{$delid};
next if( defined $compare1{$currid} &&
defined $compare2{$currid} &&
defined $compare3{$currid} &&
$compare1{$currid} ne $compare2{$currid} &&
$compare1{$currid} ne $compare3{$currid} &&
$compare2{$currid} ne $compare3{$currid} );
open(DAT,">>data/parsed-all.txt");
if( defined $compare1{$currid} && !defined $compare2{$currid} && !
+defined $compare2{$currid} ) {
print DAT $compare1{$currid};
next;
}
if( defined $compare2{$currid} && !defined $compare1{$currid} && !
+defined $compare3{$currid} ) {
print DAT $compare2{$currid};
next;
}
if( defined $compare3{$currid} && !defined $compare1{$currid} && !
+defined $compare2{$currid} ) {
print DAT $compare2{$currid};
next;
}
if( defined $compare1{$currid} && defined $compare2{$currid} ) {
if( $compare1{$currid} eq $compare2{$currid} ) {
print DAT $compare1{$currid};
next;
}
}
if( defined $compare1{$currid} && defined $compare3{$currid} ) {
if( $compare1{$currid} eq $compare3{$currid} ) {
print DAT $compare1{$currid};
next;
}
}
if( defined $compare2{$currid} && defined $compare3{$currid} ) {
if( $compare2{$currid} eq $compare3{$currid} ) {
print DAT $compare2{$currid};
next;
}
}
close(DAT);
}
print "Location: planetDiscovered.cgi\n\n";
exit;
sub changeID {
return sprintf "%04d", $_[0];
}
where a parsed file follows the form: 0001,Nunki 2 1,5847,71%,0.71,4151.37,ThrevenGuard,-18,-26
UPDATE: If you can make this any shorter (or more efficient), please let me know. You can find each of the three data files at http://emino.realestateetools.com/ssprog/alpha/data/parsed-Black_Dragon.txt, http://emino.realestateetools.com/ssprog/alpha/data/parsed-BMoom.txt, http://emino.realestateetools.com/ssprog/alpha/data/parsed-Litex.txt. Thank you!
Re: Comparing lines of multiple files
by Zed_Lopez (Chaplain) on Oct 09, 2005 at 19:26 UTC
|
If I've understood you right, this should do it:
my %h;
# build a giant hash of all the info. Keys are ids, values
# are hashrefs whose keys are the source filename and whose
# values are the lines themselves.
while (<>) {
my @fields = split ',';
$h{$fields[0]}{$ARGV} = $_;
}
# for each id (lexically sorted)
for my $id (sort keys %h) {
my @keys = keys %{$h{$id}};
# if it was present in only one file, print it and move on
if (scalar @keys == 1) {
print $h{$id}{$keys[0]};
next;
}
# if it was present in more than one, find out whether
# all the lines are the same by building a hash with
# each line as the key, then testing whether you end
# up with more than one key.
my %cmp;
$cmp{$_} = '' for values %{$h{$id}};
print keys %cmp if scalar keys %cmp == 1;
}
Updated: Now I feel silly. This can be much simpler.
while (<>) {
my @fields = split ',';
$h{$fields[0]}{$_} = '';
}
for my $id (sort keys %h) {
print keys %{$h{$id}} if scalar keys %{$h{$id}} == 1;
}
and, if one really wanted, the for loop could even be the gratuitously uber-terse:
scalar keys %{$h{$_}} == 1 and print keys %{$h{$_}} for sort keys %h;
I love Perl.
Updated again: You know how it goes. You start thinking about how something can be terser, and next thing you know, you're golfing.
perl -ane '$h{$F[0]}{$_}=0;END{keys%{$h{$_}}==1&&print keys%{$h{$_}}fo
+r sort keys%h}' f1.txt f2.txt
OK. I stop procrastinating now. | [reply] [d/l] [select] |
Re: Comparing lines of multiple files
by graff (Chancellor) on Oct 09, 2005 at 19:43 UTC
|
Your statement of the problem is a little confusing. You said:
I am trying to compare all of the lines from three files and then print the result to a final file.
But your code and data samples involve only two input files, not three. Next, you said:
If the ids are not the same, I don't want it to write anything to the final file unless one of the ids was blank, but I do want to write them if they are the same, such as:
But you show an example for "Final" output that has one line where the two inputs were identical (no diffs), followed by two lines whose index values exist only in "file 1". (And what do you mean, exactly, by "unless one of the ids was blank"?)
Maybe part of the problem is that you don't have an accurate and coherent spec for what the script is supposed to do? If there really are just two inputs, and those three lines you show under "Final:" are really the correct desired output, then it looks like the spec would be something like this:
For each line in File 1, print it to Final if: (a) the ID/Key value and data are identical to a line in File 2, or (b) the ID/Key value is not found in File 2.
For that, the following is one way to do it:
use strict;
my ( $file1, $file2 ) = @ARGV;
# (getting file names from command line is better than hard-coding the
+m)
# read file2 first, to get the keys and data to test against
my %refdata;
open( F, $file2 ) or die "$file2: $!";
while (<F>) {
my ( $key, $data ) = split( /,/, $_, 2 ); # (in case key is not 4
+ digits)
$refdata{$key} = $data;
}
# now read file1, and output lines that meet the spec
open( F, $file1 ) or die "$file1: $!";
while (<F>) {
my ( $key, $data ) = split( /,/, $_, 2 );
print if ( !exists( $refdata{$key} ) or $data eq $refdata{$key} );
}
# (use the command line to redirect output to a "final" file -- e.g.:
#
# shell> perl your_script file1 file2 > final
#
# again, it's better than hard-coding another file name
| [reply] [d/l] |
|
After much head-scratching (I originally wrote a "what are you asking here?" response, too), I decided that what the OP meant was:
If an ID occurs in only one file, print the corresponding line.
If an ID occurs in multiple files, and all the corresponding lines have the exact same text, print the line.
This does correspond to the sample output. (I'm still puzzled by 'unless one of the IDS was blank.')
| [reply] |
|
Thank you to everyone for your patience. I stumbled on this site and was so excited about the possibility of solving this problem that I didn't take as much time rereading what I posted (I know that's not a good thing). One thing I would like to clear up is that I am using this on a webpage, so many of the errors that you guys might be seeing aren't shown (unless I check the logs, which I should do). graff's code and Zed_Lopez's rewording had it almost entirely correct for two files. I actually have 3 files that I would like to combine, but I reduced it to 2 when I was working on it to try and simplify it.
-:-:- I deleted the rest of what I said because GrandFather posted code that I was able to use and adapt for three files. I am pretty sure it works as I want it to. It isn't nearly as efficient as graff's code, but it works. :D Again, thank you to everyone for your help. -:-:-
| [reply] |
|
Re: Comparing lines of multiple files
by GrandFather (Saint) on Oct 09, 2005 at 22:53 UTC
|
0002,Nunki 2 2,6366,59%,0.59,3755.94,Honor,-23,-19
0005,Nunki 2 5,2615,24%,0.24,627.6,Bananiel,-44,-47
0010,Sagittarius 2 5,3414,75%,0.75,2560.5,Iridium,0,-45
0013,Rigel 2 1,6870,30%,0.3,2061,Black_Dragon,-44,95
0014,Rigel 2 2,5000,50%,0.5,2500,Black_Dragon,-35,102
0015,Rigel 2 3,2854,51%,0.51,1455.54,Bananiel,-30,96
0018,Rigel 2 6,4160,59%,0.59,2454.4,Khouri,-49,75
0019,Rigel 2 7,5801,18%,0.18,1044.18,ThrevenGuard,-69,103
0023,Fornacis 2 4,5483,52%,0.52,2851.16,unoccupied
Perl is Huffman encoded by design.
| [reply] [d/l] [select] |
Re: Comparing lines of multiple files
by GrandFather (Saint) on Oct 09, 2005 at 22:39 UTC
|
It is good to see use warnings; use strict. It is very disapointing to see that when the code is run a large number of warnings are produced. Clean up the warnings first then see what problems remain, or come to us first with a small fragment of code that produces a warning and ask about the warning specifically.
That aside, this is a fairly well presented node, but you should show us how the current output is in error.
For sample code like this you could create the data files at the start and remove paths from file references to make the code easier to run by testers.
Perl is Huffman encoded by design.
| [reply] [d/l] [select] |
Re: Comparing lines of multiple files
by EvanCarroll (Chaplain) on Oct 09, 2005 at 21:52 UTC
|
| [reply] [d/l] [select] |
|
|