comment on

Hello Monks,

I am trying to parse a standard csv file to generate a output file which has got all the records in csv format (without and whitespace line and blank line between them etc)

The specimen input file: -------------------------

                                                                      
+                               argument_value                        
+                                                                     
+                                                                     
+                                     
----------------------------------------------------------------------
+---------------------------------------------------------------------
+---------------------------------------------------------------------
+---------------------------------------------------------------------
+---------------------------------------------------------------------
+---------------------------------------------------------------------
+---------------------------------------------------------------------
+-----------------------------------------------
 alay@nkk.com
 brps@nkk.com, luin@nkk.com
 sthn@nkk.com

 toen@nkk.com

 mara@nkk.com

 alay@nbkk.com
 wnrd@nkk.com, jpnd@ckk.com, Daim@nkk.com, nbic@ckk.com, nbrs@crawford
+.com, nbc1@Ckk.com,jodo@nkk.com, mara@nkk.com

 trrt@nkk.com
 alay@nkk.com
 alam@mkk.com, Case@nkk.com, miob@ikk.com, JTny@ikk.com, RBwn@ikk.com,
+ jsab@ikk.com, Shli@nkk.com, Stee@nkk.com, Eron@nkk.com
[download]

There is leading space before each email address right after comma, but few exception also there where email record right after comma. The input file having been generated from a pgsql export to csv has its share of blank (white space) lines between lines else where. Also the lines which has less records carry blank no whitespace remaining till next (newline).

Expected output: ----------------

alay@nkk.com, brps@nkk.com, luin@nkk.com, sthn@nkk.com, toen@nkk.com, 
+mara@nkk.com, wnrd@nkk.com, jpnd@ckk.com, Daim@nkk.com, nbic@ckk.com,
+ nbrs@crawford.com, nbc1@Ckk.com, jodo@nkk.com, trrt@nkk.com, Case@nk
+k.com, miob@ikk.com, JTny@ikk.com, RBwn@ikk.com, jsab@ikk.com, Shli@n
+kk.com, Stee@nkk.com, Eron@nkk.com
[download]

simple output all comma separated in one sentence unique records (email addresses) only.

all unique records as csv format

I am using a script that does eliminate the white space from the input file, remove duplicate email records when they are in the same line and prints one line at a time. This way the data is coming out a line at a time also with blank no-whitespace at end some of the records as they are in input file etc and not as desired. I believe i need a way to pick each record from input file from each line string it reads then write down one a time in a output file, this way input record could be matched against records written eliminating the duplicate and blank lines etc. Need your help here to find a way to accomplish this.

my script: ----------


# Require CPAN module for parsing CSV text files
use Text::CSV;

package MAIN ;
{
 # Store our CSV file name
 my $file = '/ppp.csv';

 # Obtain a file handle for our CSV file, or die upon failure
 open (CSV, '<', $file)
              or die('Unable to open csv file ', $file, "\n");

 # Obtain a Text::CSV object
 my $csv = new Text::CSV;

 # Loop on the lines in the CSV file
 foreach my $line (<CSV>)
   {

   # If the line parses successfully, print
   # otherwise, report the failure
   if ($csv->parse($line))
    {
      # Extract current line's data as an array
      my @data = $csv->fields();

      #print $data[0], "\t", # The name
      #      $data[2], "\n"; # The email address

     sub remove_duplicates(\@)
        {
         my $ar = shift ;
         my %seen;
         for ( my $i = 0; $i <= $#{$ar} ; )
             {
               splice @$ar, --$i, 1
               if $seen{$ar->[$i++]}++;

              }
         }
     remove_duplicates( @data );
     print "@data\n";
  
    }
   else
      {
         print 'Unable to parse CSV line: ', $line, "\n";
      }
   }

# Close file handle
close(CSV);
}


1;
 __END__
[download]

In reply to duplicate records in a csv file by sanju7

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.