chinkusimon has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I am facing a problem with text file manipulation with Perl.

I have a file with over 2 lac lines of data.
I need to find the duplicates(strings) in the file and copy
those records into another file.
Is there a function/module in Perl by which I can read the
duplicates in a file at one go and
print them
on to another file.
The following is a more detailed form of my requirement:

The input to the code is a text file with the following format of records.

dn: cn=1148734,ou=Employees,dc=jci,dc=com
displayname: Herek, Moriah L
jdirlastfourssn: 2888

dn: cn=1148735,ou=Employees,dc=jci,dc=com
displayname: Pelletier, Michael J
jdirlastfourssn: 8719
uid: cpellem

dn: cn=1148736,ou=Employees,dc=jci,dc=com
displayname: Manimanakis, Aris N
jdirlastfourssn: 0366

dn: cn=1148738,ou=Employees,dc=jci,dc=com
displayname: Bernardini, James A
jdirlastfourssn: 8540

dn: cn=1148739,ou=Employees,dc=jci,dc=com
displayname: Steyvers, Robert L
jdirlastfourssn: 8634

dn: cn=1148740,ou=Employees,dc=jci,dc=com
displayname: Vest, Elizabeth G
jdirlastfourssn: 7487

The file will look like the above.
What I need to do is:

1. Take the first entry and get the value of the display name attribute.
2. Check whether there is another record with the same display name attribute value.(There cud be
multiple records)
3. If so then extract both record and write them into
another file.
4. Delete these duplicate records from the parent file.
5. Do that for all records

I hope you got what I meant.
  • Comment on Help needed on text file/String manipulation

Replies are listed 'Best First'.
Re: Help needed on text file/String manipulation
by strat (Canon) on Jun 14, 2003 at 16:26 UTC
    Perhaps something like the following might help you:
    my $filename = "anything.ldif"; unless (open (LDIF, $filename)) { die "Error: couldn't read from '$filename': $!\n"; } else { my %seenObjects = (); local $/ = "\n\n"; # read objects, not lines while (<LDIF>) { # $_ has one object now chomp($_); s/\n //g; # kill continuation lines if ($seenObjects{$_}++) { # object already seen... write to another filehandle print OUTFILE "$_\n\n"; } # if } # while close (LDIF); } # else
    (Code not tested)

    If the LDIF file is very big, you could either save the Digest::MD5 of the object:

    use Digest::MD5 (); # no namespace pollution ... if ($seenObjects{ &Digest::MD5::md5_hex($_) }++) { ...
    or sort the LDIF (e.g. with Tie::File) and compare one line with its precessor.

    Best regards,
    perl -e "s>>*F>e=>y)\*martinF)stronat)=>print,print v8.8.8.32.11.32"