file comparison

gasky has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: file comparison by BrowserUk (Patriarch) on Nov 23, 2002 at 20:14 UTC
This demonstrates the basic technique for your needs. You will need to read up on perlfunc:open to find out how to open your files rather than using embedded (inline) files as I am here. You'll also need to look in perlrun and perlop if you want to supply the filenames on the command line rather than hard coding them. Hope this helps. If you have problems with converting the code to your needs, post your attempt and any errors and we'll try and help you further. #! perl -sw use strict; use Inline::Files; my %second = map{ /^(\d+),.$/; $1 => undef }<SECOND>; while(<FIRST>) { /^(\d+),.$/; print if not exists $second{$1}; } exit; __FIRST__ 1524781,Anderson,Jon,srv01,jon,12-05-02. 1843552,Bruford,William,srv01,william,12-17-02. 7589243,Howe,Steven,srv01,steven,12-05-02. 0547629,Squire,Chris,srv01,chris,12-05-02. 5486135,Wakeman,Rick,srv01,rick,12-17-02. __SECOND__ 1524781,Anderson,Jon,srv01,jon,12-05-02. 1843552,Bruford,William,srv01,william,12-17-02. 0547629,Squire,Chris,srv01,chris,12-05-02. [download] Gives this output. `C:\test>215414 7589243,Howe,Steven,srv01,steven,12-05-02. 5486135,Wakeman,Rick,srv01,rick,12-17-02. C:\test>` [download] Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color. Pick up your cloud down the end and "Yes" if you get allocated a grey one they *are* a bit damp under foot, but someone has to get them. Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory. Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.	[reply] [d/l] [select]
Re: file comparison by hawtin (Prior) on Nov 23, 2002 at 20:33 UTC
I realise that other brothers have answered most of this question but since I posted this attached to the other version of the node I felt it may be worth attaching it here (now how do I delete that other one?) I had a similar problem with comparing files in a directory and an SQL table. The simplest approach is to load up a hash table and itterate over it. use strict; my($f1,$f2,%ids_in_f1,%ids_in_f2); my($file,$count); $f1 = "f1_name.txt"; $f2 = "f2_name.txt"; open(F1,$f1); while(<F1>) { my(@col); chop; next if(/\+/); next if(!$_); @col = split(/\s,\s/); $ids_in_f1{$col[0]} = 1; } close(F1); open(F2,$f2); while(<F2>) { my(@col); chop; next if(/\+/); next if(!$_); # Split this file on tabs @col = split(/\t/); # This time the id is in the second column $ids_in_f2{$col[1]} = 1; if(!$ids_in_f1{$col[1]}) { print " Missing ID ".$col[1]."\n"; } } close(F2); $count = 0; print "\nThings in F1 not in F2\n"; foreach $file (sort keys(%ids_in_f1)) { next if($ids_in_f2{$file}); print " $file\n"; $count++; } print "Total things in F1 not in F2 $count\n\n"; [download] Another thing that I often had was duplications in the SQL table, well worth checking for while reading the data. This is a rather simplistic way to do this task but has the benefit that it is easy to modify for your needs. You also wanted to print out all the other columns, just put the values in a table and output them in a suitable way when you need to	[reply] [d/l]
Re: file comparison by gjb (Vicar) on Nov 23, 2002 at 21:03 UTC
If you have to deal with "real" CSV files where fields can have commas in them (and that are hence enclosed in duoble quotes) the suggestions of my fellow monks won't work without taking this into account. If this is the case, you may want to have a look at some modules at CPAN: Text::CSV or Text::CSV_XS (preferably the latter). If you're familiar with SQL, the DBD::CSV module may be of interest, this allows to query a CSV data file using a subset of SQL (without any database). Hope this helps, -gjb-	[reply]
Re: file comparison by graff (Chancellor) on Nov 24, 2002 at 00:53 UTC
Since the key field (id) is initial on each line (and can almost certainly be relied on as never needing quotes to contain an embedded comma), the solution is quite simple, as the earlier replies have shown, using hashes. I wrote and posted a utility here, called cmpcol that handles this directly as a command line: `cmpcol -d ',' -x1 -l1 oldlist newlist` [download] That will treat use the first comma-delimited string on each line of each file as a hash key, and print the set of lines whose keys are unique to the first of the two files (-x1, for exclusive-or set from file1, oldlist in this case). It will also do intersection and union of the two lists. By default, it only prints the key fields that meet the stated conditions, but the "-l1" ("list full line of file1") gives you what you want. If you wanted to compare old and new data for people in both lists, you could do: `cmpcol -i -d ',' -lb oldlist newlist` [download] (produce intersection, list full lines of both files).	[reply] [d/l] [select]
Re: file comparison by waswas-fng (Curate) on Nov 23, 2002 at 20:15 UTC
You could do it with hashes easy, just use ID as the key and load both lists into seperate hashes. If key esists in oldfile's hash and not newfile's hash then it is a dropout, else if key exists in new files hash but not old file it is a add. finnally if a compair of the rest of the line on a key that exists in both hashes will tell you if it has changed or is the same. If you need more help let me know, I am kinda busy today so I did not have time to plug out an example. -Waswas	[reply]
Re: file comparison by fruiture (Curate) on Nov 23, 2002 at 20:28 UTC
First fof all, it's Perl, not PERL. The simple idea is to create a hash containg all the new IDs as keys and then iterating over the old list and all IDs that arent found in the hash are the IDs of the missing students. That is probably good enough for doing it once. More advanced would be to store all new IDs in an array, where the IDs are values. The iteration over the old IDs would then use a fast search (binary) to find missing IDs. I only call this more advanced because it doesn't invlolve hashing, although hashing is probably the most efficient method here...hmm -- http://fruiture.de	[reply]