Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Comparing lines of multiple files

by oomwrtu (Novice)
on Oct 09, 2005 at 17:37 UTC ( [id://498614]=perlquestion: print w/replies, xml ) Need Help??

oomwrtu has asked for the wisdom of the Perl Monks concerning the following question:

Alright, after trying to plow through this problem on my own, I decided I need to enlist some help. I am trying to compare all of the lines from three files and then print the result to a final file. The problem is that each line has to be identified by it's id because some of the files don't have all of the lines, such as:
File 1: File 2: 0001,aname,bname 0001,aname,bname 0002,cname,bname 0003,aname,bname 0004,dname,bname 0005,fname,bname 0005,dname,bname
If the ids are not the same, I don't want it to write anything to the final file unless one of the ids was blank, but I do want to write them if they are the same, such as:
Final: 0001,aname,bname 0002,cname,bname 0004,dname,bname
I can do this for about 900 of the approximately 1700 lines I have in each of the files before it just stops doing anything. My code is below (shortened for this post), along with snippets from two of the files I would like to compare: - - - solved using the code below, see my message about editing it, please.
use strict; use warnings; use CGI qw(:standard); use CGI::Carp qw(warningsToBrowser fatalsToBrowser); print "Cache-Control: max-age=30\n"; my %final; my %compare1; my %compare2; my %compare3; my $maxid = "2000"; open(DAT, "data/parsed-Black_Dragon.txt"); my @data = <DAT>; close(DAT); for(my $i = 0; $i < scalar(@data); $i++) { # dump file data into ha +shed array my $id = substr($data[$i], 0, 4); # get current planet id $compare1{$id} = $data[$i]; delete $data[$i]; } @data = (); open(DAT, "data/parsed-BMoom.txt"); @data = <DAT>; close(DAT); for(my $i = 0; $i < scalar(@data); $i++) { # dump file data into ha +shed array my $id = substr($data[$i], 0, 4); # get current planet id $compare2{$id} = $data[$i]; delete $data[$i]; } @data = (); open(DAT, "data/parsed-Litex.txt"); @data = <DAT>; close(DAT); for(my $i = 0; $i < scalar(@data); $i++) { # dump file data into ha +shed array my $id = substr($data[$i], 0, 4); # get current planet id $compare3{$id} = $data[$i]; delete $data[$i]; } @data = (); open(DAT,">data/parsed-all.txt"); # open appropriate parsed file an +d clear it close(DAT); for(my $i = 1; $i <= $maxid; $i++) { my $currid = changeID($i); my $delid = changeID($i - 1); delete $compare1{$delid}; delete $compare2{$delid}; delete $compare3{$delid}; next if( defined $compare1{$currid} && defined $compare2{$currid} && defined $compare3{$currid} && $compare1{$currid} ne $compare2{$currid} && $compare1{$currid} ne $compare3{$currid} && $compare2{$currid} ne $compare3{$currid} ); open(DAT,">>data/parsed-all.txt"); if( defined $compare1{$currid} && !defined $compare2{$currid} && ! +defined $compare2{$currid} ) { print DAT $compare1{$currid}; next; } if( defined $compare2{$currid} && !defined $compare1{$currid} && ! +defined $compare3{$currid} ) { print DAT $compare2{$currid}; next; } if( defined $compare3{$currid} && !defined $compare1{$currid} && ! +defined $compare2{$currid} ) { print DAT $compare2{$currid}; next; } if( defined $compare1{$currid} && defined $compare2{$currid} ) { if( $compare1{$currid} eq $compare2{$currid} ) { print DAT $compare1{$currid}; next; } } if( defined $compare1{$currid} && defined $compare3{$currid} ) { if( $compare1{$currid} eq $compare3{$currid} ) { print DAT $compare1{$currid}; next; } } if( defined $compare2{$currid} && defined $compare3{$currid} ) { if( $compare2{$currid} eq $compare3{$currid} ) { print DAT $compare2{$currid}; next; } } close(DAT); } print "Location: planetDiscovered.cgi\n\n"; exit; sub changeID { return sprintf "%04d", $_[0]; }
where a parsed file follows the form:
0001,Nunki 2 1,5847,71%,0.71,4151.37,ThrevenGuard,-18,-26
UPDATE: If you can make this any shorter (or more efficient), please let me know. You can find each of the three data files at http://emino.realestateetools.com/ssprog/alpha/data/parsed-Black_Dragon.txt, http://emino.realestateetools.com/ssprog/alpha/data/parsed-BMoom.txt, http://emino.realestateetools.com/ssprog/alpha/data/parsed-Litex.txt. Thank you!

Replies are listed 'Best First'.
Re: Comparing lines of multiple files
by Zed_Lopez (Chaplain) on Oct 09, 2005 at 19:26 UTC

    If I've understood you right, this should do it:

    my %h; # build a giant hash of all the info. Keys are ids, values # are hashrefs whose keys are the source filename and whose # values are the lines themselves. while (<>) { my @fields = split ','; $h{$fields[0]}{$ARGV} = $_; } # for each id (lexically sorted) for my $id (sort keys %h) { my @keys = keys %{$h{$id}}; # if it was present in only one file, print it and move on if (scalar @keys == 1) { print $h{$id}{$keys[0]}; next; } # if it was present in more than one, find out whether # all the lines are the same by building a hash with # each line as the key, then testing whether you end # up with more than one key. my %cmp; $cmp{$_} = '' for values %{$h{$id}}; print keys %cmp if scalar keys %cmp == 1; }

    Updated: Now I feel silly. This can be much simpler.

    while (<>) { my @fields = split ','; $h{$fields[0]}{$_} = ''; } for my $id (sort keys %h) { print keys %{$h{$id}} if scalar keys %{$h{$id}} == 1; }

    and, if one really wanted, the for loop could even be the gratuitously uber-terse:

    scalar keys %{$h{$_}} == 1 and print keys %{$h{$_}} for sort keys %h;

    I love Perl.

    Updated again: You know how it goes. You start thinking about how something can be terser, and next thing you know, you're golfing.

    perl -ane '$h{$F[0]}{$_}=0;END{keys%{$h{$_}}==1&&print keys%{$h{$_}}fo +r sort keys%h}' f1.txt f2.txt

    OK. I stop procrastinating now.

Re: Comparing lines of multiple files
by graff (Chancellor) on Oct 09, 2005 at 19:43 UTC
    Your statement of the problem is a little confusing. You said:
    I am trying to compare all of the lines from three files and then print the result to a final file.

    But your code and data samples involve only two input files, not three. Next, you said:

    If the ids are not the same, I don't want it to write anything to the final file unless one of the ids was blank, but I do want to write them if they are the same, such as:

    But you show an example for "Final" output that has one line where the two inputs were identical (no diffs), followed by two lines whose index values exist only in "file 1". (And what do you mean, exactly, by "unless one of the ids was blank"?)

    Maybe part of the problem is that you don't have an accurate and coherent spec for what the script is supposed to do? If there really are just two inputs, and those three lines you show under "Final:" are really the correct desired output, then it looks like the spec would be something like this:

    For each line in File 1, print it to Final if: (a) the ID/Key value and data are identical to a line in File 2, or (b) the ID/Key value is not found in File 2.

    For that, the following is one way to do it:

    use strict; my ( $file1, $file2 ) = @ARGV; # (getting file names from command line is better than hard-coding the +m) # read file2 first, to get the keys and data to test against my %refdata; open( F, $file2 ) or die "$file2: $!"; while (<F>) { my ( $key, $data ) = split( /,/, $_, 2 ); # (in case key is not 4 + digits) $refdata{$key} = $data; } # now read file1, and output lines that meet the spec open( F, $file1 ) or die "$file1: $!"; while (<F>) { my ( $key, $data ) = split( /,/, $_, 2 ); print if ( !exists( $refdata{$key} ) or $data eq $refdata{$key} ); } # (use the command line to redirect output to a "final" file -- e.g.: # # shell> perl your_script file1 file2 > final # # again, it's better than hard-coding another file name

      After much head-scratching (I originally wrote a "what are you asking here?" response, too), I decided that what the OP meant was:

      If an ID occurs in only one file, print the corresponding line.

      If an ID occurs in multiple files, and all the corresponding lines have the exact same text, print the line.

      This does correspond to the sample output. (I'm still puzzled by 'unless one of the IDS was blank.')

        Thank you to everyone for your patience. I stumbled on this site and was so excited about the possibility of solving this problem that I didn't take as much time rereading what I posted (I know that's not a good thing). One thing I would like to clear up is that I am using this on a webpage, so many of the errors that you guys might be seeing aren't shown (unless I check the logs, which I should do). graff's code and Zed_Lopez's rewording had it almost entirely correct for two files. I actually have 3 files that I would like to combine, but I reduced it to 2 when I was working on it to try and simplify it.


        -:-:- I deleted the rest of what I said because GrandFather posted code that I was able to use and adapt for three files. I am pretty sure it works as I want it to. It isn't nearly as efficient as graff's code, but it works. :D Again, thank you to everyone for your help. -:-:-
Re: Comparing lines of multiple files
by GrandFather (Saint) on Oct 09, 2005 at 22:53 UTC

    Following reworking your code to clean up warnings I get the following output:

    0002,Nunki 2 2,6366,59%,0.59,3755.94,Honor,-23,-19 0005,Nunki 2 5,2615,24%,0.24,627.6,Bananiel,-44,-47 0010,Sagittarius 2 5,3414,75%,0.75,2560.5,Iridium,0,-45 0013,Rigel 2 1,6870,30%,0.3,2061,Black_Dragon,-44,95 0014,Rigel 2 2,5000,50%,0.5,2500,Black_Dragon,-35,102 0015,Rigel 2 3,2854,51%,0.51,1455.54,Bananiel,-30,96 0018,Rigel 2 6,4160,59%,0.59,2454.4,Khouri,-49,75 0019,Rigel 2 7,5801,18%,0.18,1044.18,ThrevenGuard,-69,103 0023,Fornacis 2 4,5483,52%,0.52,2851.16,unoccupied

    Perl is Huffman encoded by design.
Re: Comparing lines of multiple files
by GrandFather (Saint) on Oct 09, 2005 at 22:39 UTC

    It is good to see use warnings; use strict. It is very disapointing to see that when the code is run a large number of warnings are produced. Clean up the warnings first then see what problems remain, or come to us first with a small fragment of code that produces a warning and ask about the warning specifically.

    That aside, this is a fairly well presented node, but you should show us how the current output is in error.

    For sample code like this you could create the data files at the start and remove paths from file references to make the code easier to run by testers.


    Perl is Huffman encoded by design.
Re: Comparing lines of multiple files
by EvanCarroll (Chaplain) on Oct 09, 2005 at 21:52 UTC
    I too am unsure as to what your asking, I felt this was worth reply with because I can tell you tried to clearly express your problem. Try cutting out the content from the scripts that you can get working, ask a more targeted question, "Why does this script not do this ________", or a more open-ended question, "How would you do this," but make sure your objective is clearly laid out.

    Take notice to the gramatical parsing required by your statement:
    If the ids are not the same, I don't want it to write anything to the final file unless one of the ids was blank, but I do want to write them if they are the same, such as:

    That writes out as
    if (id1 != id2) { !write_to_file unless id1 = '' && unless id1 == id2 } ## WTF!!!!

    Which is just as easily written as I want to write to output only if an id is blank or if both correlate to each other
    if ( id1 eq id2 || not defined id1 || not defined id2 ) { write_to file() }

    Which screams to me, dump both of them in a sql table and do a full outer join
    Lol, good luck to you...


    Evan Carroll
    www.EvanCarroll.com

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://498614]
Approved by nedals
Front-paged by Tanktalus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (4)
As of 2024-03-29 07:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found