in reply to File Handling for Duplicate Records

If you're on a Unix-ish system (having sort and join)--or cygwin on Windows--you can do this with a few lines of shell:
perl -ne ' if(!/^2/) { $k = substr($_, 6, 6) . substr($_, 29, 10) . substr($_, 54, 12); print "$k|$_" }' file1 | sort -t "|" -k 1,1 >file1.sorted # This code assumes the fields are in the same place in file2 # as they are in file1, but if not, you'll have to change this. perl -ne ' $k = substr($_, 6, 6) . substr($_, 29, 10) . substr($_, 54, 12); print "$k\n" ' file2 | sort -t "|" -k 1,1 >file2.sorted # I am only outputting the key here since you don't seem # to be doing anything with the rest of 'line2' join -t '|' file1.sorted file2.sorted | cut -d '|' -f 2 > duplicates
With the input of file1:
3 110582 SFCA 4158675309 041414041421 3 060784 NYNY 2125552368 190159204657 3 121906 RANC 9195551234 123401123620
and file2:
3 110582 SFCA 4158675309 041414041421
your program and mine both produced the output:
3 110582 SFCA 4158675309 041414041421

Notes:

For example, say you have a new file, newdata and a file, alreadyprocessed, which corresponds to my file2.sorted, above. That is, it's just the keys in sorted order. You could do this:

perl -ne ' if(!/^2/) { $k = substr($_, 6, 6) . substr($_, 29, 10) . substr($_, 54, 12); print "$k|$_" }' newdata | sort -t "|" -k 1,1 >newdata.sorted join -t '|' -v 1 newdata.sorted alreadyprocessed >needsprocessing cut -d '|' -f 2 needsprocessing >processinput # Then do the processing # ... # ... # If everything runs okay cut -d '|' -f 1 needsprocessing | sort -m - alreadyprocessed >mergeout mv alreadyprocessed alreadyprocessed.bak mv mergeout alreadyprocessed

Replies are listed 'Best First'.
Re^2: File Handling for Duplicate Records
by sgt (Deacon) on Dec 22, 2006 at 15:31 UTC

    what about comm? or am I missing something. Of course if file1 needs transforming use perl or whatever filter

    # comm -12 <(sort file1) <(sort file2) > dups.out

    the 'cmd <(cmd1 ...) ...' notation if not supported by your shell means the two-step process 'cmd ... > temp1; cmd1 temp1' cmd1 being a "filter".

    or in other "words" ;)

    % stephan@armen (/home/stephan) % % cat dat1 3 110582 SFCA 4158675309 041414041421 3 060784 NYNY 2125552368 190159204657 3 121906 RANC 9195551234 123401123620 % stephan@armen (/home/stephan) % % cat dat2 3 110582 SFCB 2258675309 041414041421 3 110582 SFCA 4158675309 041414041421 % stephan@armen (/home/stephan) % % sort dat1 > dat1.sorted % stephan@armen (/home/stephan) % % sort dat2 > dat2.sorted % stephan@armen (/home/stephan) % % comm -12 dat1.sorted dat2.sorted 3 110582 SFCA 4158675309 041414041421
    hth --stephan, just another unix hacker,