comment on

If you're on a Unix-ish system (having sort and join)--or cygwin on Windows--you can do this with a few lines of shell:

perl -ne '
 if(!/^2/) {
   $k = substr($_, 6, 6) . substr($_, 29, 10) . substr($_, 54, 12);
   print "$k|$_" 
 }'  file1 | sort -t "|" -k 1,1 >file1.sorted

# This code assumes the fields are in the same place in file2
# as they are in file1, but if not, you'll have to change this.

perl -ne '
 $k = substr($_, 6, 6) . substr($_, 29, 10) . substr($_, 54, 12);
 print "$k\n" '  file2 | sort -t "|" -k 1,1 >file2.sorted

# I am only outputting the key here since you don't seem
# to be doing anything with the rest of 'line2'

join -t '|' file1.sorted file2.sorted  | cut -d '|' -f 2 > duplicates
[download]

With the input of file1:

3     110582 SFCA            4158675309               041414041421
3     060784 NYNY            2125552368               190159204657
3     121906 RANC            9195551234               123401123620
[download]

and file2:

3     110582 SFCA            4158675309               041414041421
[download]

your program and mine both produced the output:

3     110582 SFCA            4158675309               041414041421
[download]

Notes:

Make sure you use a delimiter character (I used "|") that's not in the data. You're not limited to printable characters.
Strictly speaking, there could be some difference in the output of the two programs. You truncated line1 at 210 characters; I don't. If line1 matches more that one line in file2, I produce multiple lines of output; you only one. Our output is also in a different order.
You could save time if you know one of the files is already sorted. For example, maybe file2 doesn't change each run. You can also merge two sorted files using sort -m
If you want the lines that are not duplicates, use join -v

For example, say you have a new file, newdata and a file, alreadyprocessed, which corresponds to my file2.sorted, above. That is, it's just the keys in sorted order. You could do this:

perl -ne '
 if(!/^2/) {
   $k = substr($_, 6, 6) . substr($_, 29, 10) . substr($_, 54, 12);
   print "$k|$_" 
 }'  newdata | sort -t "|" -k 1,1 >newdata.sorted

 join -t '|' -v 1 newdata.sorted alreadyprocessed  >needsprocessing
 cut -d '|' -f 2 needsprocessing >processinput

 # Then do the processing
 
 # ...
 # ...

 # If everything runs okay
 cut -d '|' -f 1 needsprocessing | 
         sort -m - alreadyprocessed >mergeout
 mv alreadyprocessed alreadyprocessed.bak
 mv mergeout alreadyprocessed
[download]

In reply to Re: File Handling for Duplicate Records by Thelonius
in thread File Handling for Duplicate Records by sheasbys

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.