sanju7 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am trying to parse a standard csv file to generate a output file which has got all the records in csv format (without and whitespace line and blank line between them etc)

The specimen input file: -------------------------

+ argument_value + + + ---------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +----------------------------------------------- alay@nkk.com brps@nkk.com, luin@nkk.com sthn@nkk.com toen@nkk.com mara@nkk.com alay@nbkk.com wnrd@nkk.com, jpnd@ckk.com, Daim@nkk.com, nbic@ckk.com, nbrs@crawford +.com, nbc1@Ckk.com,jodo@nkk.com, mara@nkk.com trrt@nkk.com alay@nkk.com alam@mkk.com, Case@nkk.com, miob@ikk.com, JTny@ikk.com, RBwn@ikk.com, + jsab@ikk.com, Shli@nkk.com, Stee@nkk.com, Eron@nkk.com
There is leading space before each email address right after comma, but few exception also there where email record right after comma. The input file having been generated from a pgsql export to csv has its share of blank (white space) lines between lines else where. Also the lines which has less records carry blank no whitespace remaining till next (newline).

Expected output: ----------------

alay@nkk.com, brps@nkk.com, luin@nkk.com, sthn@nkk.com, toen@nkk.com, +mara@nkk.com, wnrd@nkk.com, jpnd@ckk.com, Daim@nkk.com, nbic@ckk.com, + nbrs@crawford.com, nbc1@Ckk.com, jodo@nkk.com, trrt@nkk.com, Case@nk +k.com, miob@ikk.com, JTny@ikk.com, RBwn@ikk.com, jsab@ikk.com, Shli@n +kk.com, Stee@nkk.com, Eron@nkk.com
simple output all comma separated in one sentence unique records (email addresses) only.

all unique records as csv format

I am using a script that does eliminate the white space from the input file, remove duplicate email records when they are in the same line and prints one line at a time. This way the data is coming out a line at a time also with blank no-whitespace at end some of the records as they are in input file etc and not as desired. I believe i need a way to pick each record from input file from each line string it reads then write down one a time in a output file, this way input record could be matched against records written eliminating the duplicate and blank lines etc. Need your help here to find a way to accomplish this.

my script: ----------
# Require CPAN module for parsing CSV text files use Text::CSV; package MAIN ; { # Store our CSV file name my $file = '/ppp.csv'; # Obtain a file handle for our CSV file, or die upon failure open (CSV, '<', $file) or die('Unable to open csv file ', $file, "\n"); # Obtain a Text::CSV object my $csv = new Text::CSV; # Loop on the lines in the CSV file foreach my $line (<CSV>) { # If the line parses successfully, print # otherwise, report the failure if ($csv->parse($line)) { # Extract current line's data as an array my @data = $csv->fields(); #print $data[0], "\t", # The name # $data[2], "\n"; # The email address sub remove_duplicates(\@) { my $ar = shift ; my %seen; for ( my $i = 0; $i <= $#{$ar} ; ) { splice @$ar, --$i, 1 if $seen{$ar->[$i++]}++; } } remove_duplicates( @data ); print "@data\n"; } else { print 'Unable to parse CSV line: ', $line, "\n"; } } # Close file handle close(CSV); } 1; __END__

Replies are listed 'Best First'.
Re: duplicate records in a csv file
by roboticus (Chancellor) on Jun 30, 2010 at 22:09 UTC

    sanju7:

    You could've put your data files in code tags as well, I think it would help clarify your post a little. Also, you may want to fix your indentation so that the code is easy to read. I was going to take a look at the code, but since the indentation doesn't show the program structure, I can't really read it unless I bother to download the code and reformat it. Considering it's my day off, I'm not inclined to go the extra mile.

    Anyway, a couple of notes:

    1. Putting the key information in a hash will let you detect duplicate lines.
    2. You can easily open multiple files and print the records to different files to split your records out.

    If I notice that the node is easier to read, or if you have any questions about what I wrote, I'll be happy to revisit the node and see if there's anything more I can offer.

    ...roboticus

      That was my lack of proper care while writing down. Reformatted indented now.
Re: duplicate records in a csv file
by Tux (Canon) on Jul 01, 2010 at 06:05 UTC

    Your $csv could use a ->new ({ binary => 1, auto_diag => 1 }) to catch errors better.

    Uniqueifying arrays that keep the order can be done much easier. The perlfaq's give even more solutions:

    tux$ cat test.pl #!/pro/bin/perl use strict; use warnings; my @data = (1, 3, 4, 5, 3, 6, 8, 4, 9); $" = ", "; print "(@data)\n"; { my %seen; @data = grep { !$seen{$_}++ } @data; } print "(@data)\n"; tux$ perl test.pl (1, 3, 4, 5, 3, 6, 8, 4, 9) (1, 3, 4, 5, 6, 8, 9)

    Enjoy, Have FUN! H.Merijn
Re: duplicate records in a csv file
by Marshall (Canon) on Jul 03, 2010 at 13:12 UTC
    Your input file format is not complex enough to need all the power of the parse csv module - when things get hairy that thing is definitely needed. But here, I would use a simple less than one inch regex and match global for the job. The regex below just tokenizes chunks of characters that are in the character set. This has the effect of throwing away the commas, spaces, new lines, etc.

    From what I could tell from your desired output, you want unique user names (keep the first one seen) as opposed to simply unique e-mail addresses, so I adjusted the grep{},%seen idea below to do that just on the stuff in front of the "@".

    In the join, change ',' to ', ' to add a space after the comma if that is what you want. CSV files usually do not have leading spaces.

    #!/usr/bin/perl -w use strict; my @emailaddr; while (<DATA>) { if (/@/) #skip lines without email addresses { push (@emailaddr, $_) foreach /[\w@.]+/g; } } my %seen; @emailaddr = grep { !$seen{(split(/[@]/,$_))[0]}++}@emailaddr; print join(',',@emailaddr),"\n"; #output line #alay@nkk.com,brps@nkk.com,luin@nkk.com,sthn@nkk.com,toen@nkk.com,mara +@nkk.com,wnrd@nkk.com,jpnd@ckk.com,Daim@nkk.com,nbic@ckk.com,nbrs@cra +wford.com,nbc1@Ckk.com,jodo@nkk.com,trrt@nkk.com,alam@mkk.com,Case@nk +k.com,miob@ikk.com,JTny@ikk.com,RBwn@ikk.com,jsab@ikk.com,Shli@nkk.co +m,Stee@nkk.com,Eron@nkk.com __DATA__ + argument_value + + + ---------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +--------------------------------------------------------------------- +----------------------------------------------- alay@nkk.com brps@nkk.com, luin@nkk.com sthn@nkk.com toen@nkk.com mara@nkk.com alay@nbkk.com wnrd@nkk.com, jpnd@ckk.com, Daim@nkk.com, nbic@ckk.com, nbrs@crawford +.com, nbc1@Ckk.com,jodo@nkk.com, mara@nkk.com trrt@nkk.com alay@nkk.com alam@mkk.com, Case@nkk.com, miob@ikk.com, JTny@ikk.com, RBwn@ikk.com, + jsab@ikk.com, Shli@nkk.com, Stee@nkk.com, Eron@nkk.com
      This works like a charm. Thank you for your time.