in reply to Comparing text files

I won't repeat fletch's comments; just second them.

I've got some questions (yes, this is a quiz)

  1. The files differ (you've said they're not the same size); so when you say "Compare," what do you mean? I can guess that you're looking to see if address information is consistent between two files, but no more than that.
  2. Have you considered one of the modules, such as Algorithm::Diff, which computes an "intelligent difference between two files?"

On a more minor note: you probably want and vs the & (bitwise and) operator in your if test and you probably want to use ne, which is the string comparison operator, vs !=, which is for numeric comparisons. You've used strict, but never declared your variables, which will keep your code from compiling.

My suggestions are:


added in update

I know fletch mentioned the &, use strict; without corresponding variable declarations, and the need for the OP to post samples and (at least syntactically valid) source code. I got into my pedantic mode, and ended up repeated fletch's advice.

emc

e(π√−1) = −1

Replies are listed 'Best First'.
Re^2: Comparing text files
by lynxct (Initiate) on Jul 12, 2006 at 02:59 UTC
    #!/usr/local/bin/perl use strict; open (FILE1,"Largedatafile.txt") || die "Can't open file File1 $!\n"; open (FILE2,"Smalldatafile.txt") || die "Can't open file File2 $!\n"; open (FILE3,"addressfile.txt") || die "Can't open file File3 $!\n"; while(<FILE2>) { my($var1, $var2, $var3,$var4,$var5)=split(/\t/, $_); while(<FILE1>) { my($id1, $id2,$id3)=split(/\t/,$_); while(<FILE3>) { my($citycode, $statecode,$zipcode)=split(/\t/,$_); if (($id3 ne $var3) and ($citycode ne $var5)) { print "$var1\t$id1\t$zipcode\n"; } } } } close (FILE1); close (FILE2); close (FILE3);

    input file1 "Largedatafile.txt"

    new_id county_name1 county_id1 routing_number city_cod +e1 56897987 Smith 865487 567rt9834 + 5879654 65798686 Johnson 6587654 6578ce789 + r6587t3 24598702 Rock 6548365 6456t6884 + z0rt345


    input file2 "Smalldatafile.txt"

    Old_id state_code2 fip_code 865487234 5679834 5879654 6587654098 6578789 43658753 6548365tr 6456884 50054345

    input file3 "addressfile.txt"

    city_code3 state_code3 zip_code3 5879654 5679834 987656575 r658735 6578789 657656315 z0r3454 6456884 554865434 </c <p> out put file will have new_id from file1, old_id from file2, and zip_code3 from file3 only if state_code2 from file2 not equal to state +_code3 from file3 and city_code3 from file 3 not equal to city_code1 +form file1 </p><p> after running the script I do not get the value of "Old_id" and "cit +y_code3" </p> <c> new_id Old_id zip_code3 new_id Old_id 987656575 new_id Old_id 657656315 new_id Old_id 554865434 new_id Old_id 987656575 new_id Old_id 657656315 new_id Old_id 554865434


    thanks for all the help

    20060712 Janitored by Corion: Added code tags, as per Writeup Formatting Tips and Writeup Formatting Tips

      Okay, first: update your posts by adding <code> at the beginning of the perl script, and </code> at the end of the perl script. Just do it. Likewise for sample data.

      Second: when you do this:

      open( FILE1, ... ); open( FILE2, ... ); open( FILE3, ... ): while (<FILE1>) { ... while (<FILE2>) { ... while (<FILE3>) { ... } } }
      FILE2 and FILE3 will both reach EOF during the first iteration on (the first line read from) FILE1. So don't do that.

      (You could "seek( FILE3, 0, 0 );" at the end of the while loop that reads from FILE2, and also do "seek( FILE2, 0, 0 );" at the end of the loop that reads from FILE1, but this would mean that you re-read FILE2 too many times, and you re-read FILE3 way too many times. So don't do that.)

      Since FILE2 is "small" and FILE3 is probably not too big either, read them both into hash structures first to keep them in memory while you read the "large" file. When loading the hashes with data in these two files, the hash keys should be the strings you need for linking data across files, and the values should be whatever you need to keep from each file for your final output.

      Third: you said

      output file will have new_id from file1, old_id from file2, and zip_code3 from file3 only if state_code2 from file2 not equal to state_code3 from file3 and city_code3 from file 3 not equal to city_code1 from file1
      This statement really does not make sense, unless you seriously want the "cartesian product" of all the lines in the three files. That is, supposing there are 100 lines in "largefile", 10 lines in "smallfile" and 20 lines in "addressfile", and there are some matches among the city_code and state_code values, then the condition as you phrased it would list about 99*9*19 lines of output.

      Do you mean something like this instead?

      OUTPUT file1:new_id, file2:old_id, file3:zip_code3 IF file1:city_code1 DOES NOT MATCH ANY file3:city_code3 OR ( file1:city_code1 MATCHES ONE file3:city_code3 AND THIS file3:state_code3 DOES NOT MATCH ANY file2.state_code2 +)
      If that's not what you mean, then you really need to explain it better. Given just the snippets of sample data that you have shown, what should the output be? (If those snippets would not really produce any outputs, because everything matches up, add a row or two that would generate the intended output, and show us what the output should be.) And remember to use "code" tags.

      In any case, it sounds like some sort of SQL problem, and it looks like your data came from a database (or could easily be put into a database). So maybe SQL would be the more prudent approach. (But proper use of hashes to store the relevant stuff from the two smaller files would do fine.)