duplicate records in a csv file

sanju7 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am trying to parse a standard csv file to generate a output file which has got all the records in csv format (without and whitespace line and blank line between them etc)

The specimen input file: -------------------------

                                                                      
+                               argument_value                        
+                                                                     
+                                                                     
+                                     
----------------------------------------------------------------------
+---------------------------------------------------------------------
+---------------------------------------------------------------------
+---------------------------------------------------------------------
+---------------------------------------------------------------------
+---------------------------------------------------------------------
+---------------------------------------------------------------------
+-----------------------------------------------
 alay@nkk.com
 brps@nkk.com, luin@nkk.com
 sthn@nkk.com

 toen@nkk.com

 mara@nkk.com

 alay@nbkk.com
 wnrd@nkk.com, jpnd@ckk.com, Daim@nkk.com, nbic@ckk.com, nbrs@crawford
+.com, nbc1@Ckk.com,jodo@nkk.com, mara@nkk.com

 trrt@nkk.com
 alay@nkk.com
 alam@mkk.com, Case@nkk.com, miob@ikk.com, JTny@ikk.com, RBwn@ikk.com,
+ jsab@ikk.com, Shli@nkk.com, Stee@nkk.com, Eron@nkk.com
[download]

There is leading space before each email address right after comma, but few exception also there where email record right after comma. The input file having been generated from a pgsql export to csv has its share of blank (white space) lines between lines else where. Also the lines which has less records carry blank no whitespace remaining till next (newline).

Expected output: ----------------

alay@nkk.com, brps@nkk.com, luin@nkk.com, sthn@nkk.com, toen@nkk.com, 
+mara@nkk.com, wnrd@nkk.com, jpnd@ckk.com, Daim@nkk.com, nbic@ckk.com,
+ nbrs@crawford.com, nbc1@Ckk.com, jodo@nkk.com, trrt@nkk.com, Case@nk
+k.com, miob@ikk.com, JTny@ikk.com, RBwn@ikk.com, jsab@ikk.com, Shli@n
+kk.com, Stee@nkk.com, Eron@nkk.com
[download]

simple output all comma separated in one sentence unique records (email addresses) only.

all unique records as csv format

I am using a script that does eliminate the white space from the input file, remove duplicate email records when they are in the same line and prints one line at a time. This way the data is coming out a line at a time also with blank no-whitespace at end some of the records as they are in input file etc and not as desired. I believe i need a way to pick each record from input file from each line string it reads then write down one a time in a output file, this way input record could be matched against records written eliminating the duplicate and blank lines etc. Need your help here to find a way to accomplish this.

my script: ----------


# Require CPAN module for parsing CSV text files
use Text::CSV;

package MAIN ;
{
 # Store our CSV file name
 my $file = '/ppp.csv';

 # Obtain a file handle for our CSV file, or die upon failure
 open (CSV, '<', $file)
              or die('Unable to open csv file ', $file, "\n");

 # Obtain a Text::CSV object
 my $csv = new Text::CSV;

 # Loop on the lines in the CSV file
 foreach my $line (<CSV>)
   {

   # If the line parses successfully, print
   # otherwise, report the failure
   if ($csv->parse($line))
    {
      # Extract current line's data as an array
      my @data = $csv->fields();

      #print $data[0], "\t", # The name
      #      $data[2], "\n"; # The email address

     sub remove_duplicates(\@)
        {
         my $ar = shift ;
         my %seen;
         for ( my $i = 0; $i <= $#{$ar} ; )
             {
               splice @$ar, --$i, 1
               if $seen{$ar->[$i++]}++;

              }
         }
     remove_duplicates( @data );
     print "@data\n";
  
    }
   else
      {
         print 'Unable to parse CSV line: ', $line, "\n";
      }
   }

# Close file handle
close(CSV);
}


1;
 __END__
[download]

Comment on duplicate records in a csv file Select or Download Code

Replies are listed 'Best First'.
Re: duplicate records in a csv file by roboticus (Chancellor) on Jun 30, 2010 at 22:09 UTC
sanju7: You could've put your data files in code tags as well, I think it would help clarify your post a little. Also, you may want to fix your indentation so that the code is easy to read. I was going to take a look at the code, but since the indentation doesn't show the program structure, I can't really read it unless I bother to download the code and reformat it. Considering it's my day off, I'm not inclined to go the extra mile. Anyway, a couple of notes: Putting the key information in a hash will let you detect duplicate lines. You can easily open multiple files and print the records to different files to split your records out. If I notice that the node is easier to read, or if you have any questions about what I wrote, I'll be happy to revisit the node and see if there's anything more I can offer. ...roboticus	[reply]
Re^2: duplicate records in a csv file by sanju7 (Acolyte) on Jul 01, 2010 at 09:24 UTC
That was my lack of proper care while writing down. Reformatted indented now.	[reply]
Re: duplicate records in a csv file by Tux (Canon) on Jul 01, 2010 at 06:05 UTC
Your $csv could use a `->new ({ binary => 1, auto_diag => 1 })` to catch errors better. Uniqueifying arrays that keep the order can be done much easier. The perlfaq's give even more solutions: `tux$ cat test.pl #!/pro/bin/perl use strict; use warnings; my @data = (1, 3, 4, 5, 3, 6, 8, 4, 9); $" = ", "; print "(@data)\n"; { my %seen; @data = grep { !$seen{$_}++ } @data; } print "(@data)\n"; tux$ perl test.pl (1, 3, 4, 5, 3, 6, 8, 4, 9) (1, 3, 4, 5, 6, 8, 9)` [download] Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re: duplicate records in a csv file by Marshall (Canon) on Jul 03, 2010 at 13:12 UTC
Your input file format is not complex enough to need all the power of the parse csv module - when things get hairy that thing is definitely needed. But here, I would use a simple less than one inch regex and match global for the job. The regex below just tokenizes chunks of characters that are in the character set. This has the effect of throwing away the commas, spaces, new lines, etc. From what I could tell from your desired output, you want unique user names (keep the first one seen) as opposed to simply unique e-mail addresses, so I adjusted the grep{},%seen idea below to do that just on the stuff in front of the "@". In the join, change ',' to ', ' to add a space after the comma if that is what you want. CSV files usually do not have leading spaces. #!/usr/bin/perl -w use strict; my @emailaddr; while (<DATA>) { if (/@/) #skip lines without email addresses { push (@emailaddr, $_) foreach /[\w@.]+/g; } } my %seen; @emailaddr = grep { !$seen{(split(/[@]/,$_))[0]}++}@emailaddr; print join(',',@emailaddr),"\n"; #output line #alay@nkk.com,brps@nkk.com,luin@nkk.com,sthn@nkk.com,toen@nkk.com,mara +@nkk.com,wnrd@nkk.com,jpnd@ckk.com,Daim@nkk.com,nbic@ckk.com,nbrs@cra +wford.com,nbc1@Ckk.com,jodo@nkk.com,trrt@nkk.com,alam@mkk.com,Case@nk +k.com,miob@ikk.com,JTny@ikk.com,RBwn@ikk.com,jsab@ikk.com,Shli@nkk.co +m,Stee@nkk.com,Eron@nkk.com __DATA__ + argument_value + + + ---------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +----------------------------------------------- alay@nkk.com brps@nkk.com, luin@nkk.com sthn@nkk.com toen@nkk.com mara@nkk.com alay@nbkk.com wnrd@nkk.com, jpnd@ckk.com, Daim@nkk.com, nbic@ckk.com, nbrs@crawford +.com, nbc1@Ckk.com,jodo@nkk.com, mara@nkk.com trrt@nkk.com alay@nkk.com alam@mkk.com, Case@nkk.com, miob@ikk.com, JTny@ikk.com, RBwn@ikk.com, + jsab@ikk.com, Shli@nkk.com, Stee@nkk.com, Eron@nkk.com [download]	[reply] [d/l]
Re^2: duplicate records in a csv file by sanju7 (Acolyte) on Jul 14, 2010 at 06:45 UTC
This works like a charm. Thank you for your time.	[reply]