Comparing text files

lynxct has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Comparing text files by Fletch (Bishop) on Jul 11, 2006 at 21:10 UTC
My eyes! The goggles do nothing! Aside from the lack of meaningful indentation because there's no <code> tags the lack of error checking on open calls the completely bizarre internesting of three different while loops the context free naming of variables the absence of correct variable declarations despite the presence of `use strict` (which just means this code could never have run to begin with) the probable misuse of `!=` for `ne` (although that's just a guess; see next item) and the complete lack of sample data or expected results What exactly's the problem?	[reply] [d/l] [select]
Re: Comparing text files by swampyankee (Parson) on Jul 11, 2006 at 21:52 UTC
I won't repeat fletch's comments; just second them. I've got some questions (yes, this is a quiz) The files differ (you've said they're not the same size); so when you say "Compare," what do you mean? I can guess that you're looking to see if address information is consistent between two files, but no more than that. Have you considered one of the modules, such as Algorithm::Diff, which computes an "intelligent difference between two files?" On a more minor note: you probably want `and` vs the & (bitwise and) operator in your `if` test and you probably want to use `ne`, which is the string comparison operator, vs !=, which is for numeric comparisons. You've used strict, but never declared your variables, which will keep your code from compiling. My suggestions are: Add some explanation of what you're trying to do. Give us some data (it has to be properly formatted and sensible, i.e., validly formatted, albeit not necessarily real addresses. Give us a working code sample, i.e., one that will at least compile. Show us what you want and what you're actually getting for a test case. added in update I know fletch mentioned the &, `use strict;` without corresponding variable declarations, and the need for the OP to post samples and (at least syntactically valid) source code. I got into my pedantic mode, and ended up repeated fletch's advice. emc e^(π√−1) = −1	[reply]
Re^2: Comparing text files by lynxct (Initiate) on Jul 12, 2006 at 02:59 UTC
#!/usr/local/bin/perl use strict; open (FILE1,"Largedatafile.txt") \|\| die "Can't open file File1 $!\n"; open (FILE2,"Smalldatafile.txt") \|\| die "Can't open file File2 $!\n"; open (FILE3,"addressfile.txt") \|\| die "Can't open file File3 $!\n"; while(<FILE2>) { my($var1, $var2, $var3,$var4,$var5)=split(/\t/, $_); while(<FILE1>) { my($id1, $id2,$id3)=split(/\t/,$_); while(<FILE3>) { my($citycode, $statecode,$zipcode)=split(/\t/,$_); if (($id3 ne $var3) and ($citycode ne $var5)) { print "$var1\t$id1\t$zipcode\n"; } } } } close (FILE1); close (FILE2); close (FILE3); [download] input file1 "Largedatafile.txt" `new_id county_name1 county_id1 routing_number city_cod +e1 56897987 Smith 865487 567rt9834 + 5879654 65798686 Johnson 6587654 6578ce789 + r6587t3 24598702 Rock 6548365 6456t6884 + z0rt345` [download] input file2 "Smalldatafile.txt" `Old_id state_code2 fip_code 865487234 5679834 5879654 6587654098 6578789 43658753 6548365tr 6456884 50054345` [download] input file3 "addressfile.txt" city_code3 state_code3 zip_code3 5879654 5679834 987656575 r658735 6578789 657656315 z0r3454 6456884 554865434 </c <p> out put file will have new_id from file1, old_id from file2, and zip_code3 from file3 only if state_code2 from file2 not equal to state +_code3 from file3 and city_code3 from file 3 not equal to city_code1 +form file1 </p><p> after running the script I do not get the value of "Old_id" and "cit +y_code3" </p> <c> new_id Old_id zip_code3 new_id Old_id 987656575 new_id Old_id 657656315 new_id Old_id 554865434 new_id Old_id 987656575 new_id Old_id 657656315 new_id Old_id 554865434 [download] thanks for all the help 20060712 Janitored by Corion: Added code tags, as per Writeup Formatting Tips and Writeup Formatting Tips	[reply] [d/l] [select]
Re^3: Comparing text files by graff (Chancellor) on Jul 12, 2006 at 06:21 UTC
Okay, first: update your posts by adding <code> at the beginning of the perl script, and </code> at the end of the perl script. Just do it. Likewise for sample data. Second: when you do this: `open( FILE1, ... ); open( FILE2, ... ); open( FILE3, ... ): while (<FILE1>) { ... while (<FILE2>) { ... while (<FILE3>) { ... } } }` [download] FILE2 and FILE3 will both reach EOF during the first iteration on (the first line read from) FILE1. So don't do that. (You could "seek( FILE3, 0, 0 );" at the end of the while loop that reads from FILE2, and also do "seek( FILE2, 0, 0 );" at the end of the loop that reads from FILE1, but this would mean that you re-read FILE2 too many times, and you re-read FILE3 way too many times. So don't do that.) Since FILE2 is "small" and FILE3 is probably not too big either, read them both into hash structures first to keep them in memory while you read the "large" file. When loading the hashes with data in these two files, the hash keys should be the strings you need for linking data across files, and the values should be whatever you need to keep from each file for your final output. Third: you said output file will have new_id from file1, old_id from file2, and zip_code3 from file3 only if state_code2 from file2 not equal to state_code3 from file3 and city_code3 from file 3 not equal to city_code1 from file1 This statement really does not make sense, unless you seriously want the "cartesian product" of all the lines in the three files. That is, supposing there are 100 lines in "largefile", 10 lines in "smallfile" and 20 lines in "addressfile", and there are some matches among the city_code and state_code values, then the condition as you phrased it would list about 99919 lines of output. Do you mean something like this instead? `OUTPUT file1:new_id, file2:old_id, file3:zip_code3 IF file1:city_code1 DOES NOT MATCH ANY file3:city_code3 OR ( file1:city_code1 MATCHES ONE file3:city_code3 AND THIS file3:state_code3 DOES NOT MATCH ANY file2.state_code2 +)` [download] If that's not what you mean, then you really need to explain it better. Given just the snippets of sample data that you have shown, what should the output be? (If those snippets would not really produce any outputs, because everything matches up, add a row or two that would generate the intended output, and show us what the output should be.) And remember to use "code" tags. In any case, it sounds like some sort of SQL problem, and it looks like your data came from a database (or could easily be put into a database). So maybe SQL would be the more prudent approach. (But proper use of hashes to store the relevant stuff from the two smaller files would do fine.)	[reply] [d/l] [select]