Best method of munging CSV data - using Text::CSV::Simple?

billie_t has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks: I've never been particularly proficient in Perl, and I've been away from it for a long time. I've come to a complete blank when trying to get some data from an LDAP dump and rework it into a mail delivery table. A sample of the original data is this:

"CN=MBX_MANA,OU= Mailboxes,DC=doman,DC=com",SMTP:Manager@domain.com;sm
+tp:MBX_MANA@domain.com;FAX:MBX_MANA@domain.com;X400:c=us\;a= \;p=doma
+in\;o=Exchange\;s=MANAGER\;,MBX_MANA,/o=Exchange Org/ou=First Adminis
+trative Group/cn=Configuration/cn=Servers/cn=SERVER
"CN=Guest,CN=Users,DC=domain,DC=com",,Guest,
[download]

The valid info we want to extract is represented by the first line - the second column has smtp addresses that we want to grab, and associate with the third column (the username) and the final column (the server name). Eventually I want to achieve:

user@server    blah@smtpaddress
user@server    blah@smtp2address
user2@server   blah2@smtpaddress
user2@server   blah2@smtp2address

Each user can have a variable number of STMP addresses. The second line of the original data posted above is an example of an invalid line that we would like to skip (there are no smtp addresses).

I think I can figure out the regexps (mostly) to grab the data I want, such as SMTP: but not FAX: or X400: addresses. Text::CSV::Simple does a great job of ignoring the first column of each line (which is not needed), and stashing the others somewhere else, using this code:

my $parser = Text::CSV::Simple->new;
$parser->want_fields(2, 3, 4);
my @data = $parser->read_file($infile);
[download]

That appears to be the easy part. I need to achieve the remaining steps:
1. Ignore lines of data that do not contain smtp:. I was thinking of something like the following snippet, but there's a difficulty I'll explain after this list:

foreach my $line (@data) {
    chomp;
    if $line =~ /SMTP:/i 
{do stuff} }
[download]

2. Ensure that all the SMTP addresses extracted match up to the correct userid.
3. Ensure that all the SMTP addresses and userids match up to the correct server name (I'll extract the server name by using a simple substr (since it's always the last 6 characters at the end of that column)).
4. Output to another file data of the format "user@server user@smtpadddress" (outputting is not a problem)

The main problem is, by reading into a single-dimension array, I think that I'm losing the "columns" which associate the SMTP addresses with the correct userid and server. I simply don't know enough about arrays or hashes (or Perl, other than the fact it's by far the best tool for the job) to structure the data in the appropriate way so that I can extract what I want.

Any help would be appreciated. The input files will not be larger than 2MB, and the whole thing will be running on up-to-date kit (so reading into memory should not be a problem). And I'm sorry I couldn't come up with more code - not having a clue about appropriate structure is not helping me. It may be that Text::CSV::Simple is not right for the job, but I couldn't quite get to grips with the syntax of Text::CSV_XS, for example.

Comment on Best method of munging CSV data - using Text::CSV::Simple? Select or Download Code

Replies are listed 'Best First'.
Re: Best method of munging CSV data - using Text::CSV::Simple? by Cody Pendant (Prior) on Feb 08, 2006 at 05:39 UTC
Text::CSV::Simple does a great job of ignoring the first column of each line (which is not needed) Your code ignores the first two columns, because the columns are numbered from zero. I don't quite understand your overall goal, but this should help: #!/usr/bin/perl -w use strict; use Text::CSV::Simple; my $parser = Text::CSV::Simple->new; $parser->want_fields(1, 2, 3, 4); my @data = $parser->read_file(\DATA); foreach my $line (@data) { if ($line->[0] =~ m/SMTP:/i) { print "here are the parts of the first column:\n"; my @parts_of_first_column = split(';', $line->[0]); foreach my $part (@parts_of_first_column) { print " $part\n"; } } } __DATA__ "CN=MBX_MANA,OU= Mailboxes,DC=doman,DC=com",SMTP:Manager@domain.com;sm +tp:MBX_MANA@domain.com;FAX:MBX_MANA@domain.com;X400:c=us\;a= \;p=doma +in\;o=Exchange\;s=MANAGER\;,MBX_MANA,/o=Exchange Org/ou=First Adminis +trative Group/cn=Configuration/cn=Servers/cn=SERVER "CN=Guest,CN=Users,DC=domain,DC=com",,Guest, [download] Which prints out: `here are the parts of the first column: * SMTP:Manager@domain.com * smtp:MBX_MANA@domain.com * FAX:MBX_MANA@domain.com * X400:c=us\ * a= \ * p=domain\ * o=Exchange\ * s=MANAGER\` [download] So Text::CSV::Simple splits up the CSV in a smart way, and then the code splits up the first (remaining) column by semicolons. From there on you should be able to test each part for the stuff you want. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply] [d/l] [select]
Re^2: Best method of munging CSV data - using Text::CSV::Simple? by billie_t (Sexton) on Feb 08, 2006 at 06:15 UTC
Apologies for the typo with my list of fields - I mistakenly pasted from old code - I've got the correct (1,2,3) in my new stuff, although I was pulling my hair out there for a while. That is helpful, thank you. So, the parts of the line are line[0], line[1], line[2] when they're divvied up by Text::CSV::Simple. You've dumped line[0] into an array to break up its components, that's great. So I can manipulate line[1] and line[2] to grab the other parts I need. In that loop, I should be able to output something line-by-line like (pseudocode) "@parts[0] line[1]@line[2]", which is effectively "SMTP_ADDRESS USER@SERVER". That's fantastic, I was imagining all kinds of wierd structures in my head, not a simple and understandable one.	[reply]
Re^3: Best method of munging CSV data - using Text::CSV::Simple? by Cody Pendant (Prior) on Feb 08, 2006 at 06:20 UTC
I was imagining all kinds of wierd structures in my head, not a simple and understandable one. That's why they invented Data::Dumper. I'd be lost without it. Just add `use Data::Dumper;` and do `print Dumper(\$something);` and it makes things a lot easier. That's how I figured out what you data was doing. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply] [d/l] [select]
Re: Best method of munging CSV data - using Text::CSV::Simple? by jZed (Prior) on Feb 08, 2006 at 06:00 UTC
I'm not quite sure I understand where you're getting stuck, but it sounds like you have already done the CSV parsing bit - you're able to break each line into a series of fields. You also sounds like you know how to go about munging out the data you want from each field. So my best guess is that where you are stuck is with how to get data into and out of perl data structures. The @data that you get from Text::CSV::Simple is an array of array references. That means that each item in the array is itself a reference to an array composed of three fields. That means you need to derefernce it to get at the fields like this: `for my $row(@data){ my($smtp,$user,$server) = @$row; # munge $stmp next unless $smtp; # munge $user and $server # recombine them into a string # write the string to a file }` [download] If I've misunderstood the part you're having a problem with, let me know.	[reply] [d/l]
Re^2: Best method of munging CSV data - using Text::CSV::Simple? by billie_t (Sexton) on Feb 08, 2006 at 06:23 UTC
No, that's correct - I was getting my knickers in a knot trying to figure out when I do the munging and output to a file. So what I essentially need to do here is read a row at a time, fiddle around with it and output it before working on the next line. I knew that logically, but couldn't figure out the forest for looking at the trees. Thank you for breaking down what I need to do structurally - I have real problems with that aspect of coding (infrequent use). I think I have enough from you and Cody Pendant to be able to figure it out from here.	[reply]
Re^2: Best method of munging CSV data - using Text::CSV::Simple? by Scott7477 (Chaplain) on Feb 05, 2007 at 23:02 UTC
"The @data that you get from Text::CSV::Simple is an array of array references. That means that each item in the array is itself a reference to an array composed of three fields. That means you need to derefernce it..." This is need to know information that isn't made that clear in the module docs. I've annotated the CPAN module with a paraphrase of your explanation. Thanks, I have been butting my head for a few minutes trying to get some code using this module to work. Update: this reply was intended for jZed, I fat fingered the post.	[reply]
Re: Best method of munging CSV data - using Text::CSV::Simple? by billie_t (Sexton) on Feb 09, 2006 at 00:59 UTC
I just thought I'd post the finished work of art: #! /usr/bin/perl #parsing Exchange output file to create address aliases use strict; use warnings; my($infile) = ("C:\\recipient.txt"); my($outfile) = ("C:\\outfile.txt"); use Text::CSV::Simple; open(OUTPUT, ">$outfile") or die "Can't open $outfile : $!"; my $parser = Text::CSV::Simple->new; $parser->want_fields(1, 2, 3); my @data = $parser->read_file($infile); foreach my $line (@data) { # process only accounts that contain SMTP addesses + if ($line->[0] =~ m/SMTP:/i) { my $user = $line->[1]; my $server = substr(($line->[2]), -6); #split up the line that contains all the email addresses my @smtp = split(';', $line->[0]); foreach my $part (@smtp) { # do not process anything that is not an SMTP address # (exclude fax & X400) if ($part =~ m/SMTP:/i) { $part = substr($part, 5); print OUTPUT ("$user\@$server \t $part\n"); } } } } # CSVDE command used on domain controller to # generate recipient information for users in domain # Dumps the user name, all proxy addresses # and the home mailbox server - it WON'T exclude the DN. # csvde -m -n -g -f "C:\recipient.txt" -r "(\|(&(objectClass=user)(obje +ctCategory=person))(objectClass=groupOfNames)(objectClass=msExchDynam +icDistributionList))" -o "distinguishedName, DN" -L "name,proxyAddres +ses,msExchHomeServerName" [download] Output is exactly what I wanted. Thanks for all your help, guys!	[reply] [d/l]
Re^2: Best method of munging CSV data - using Text::CSV::Simple? by Scott7477 (Chaplain) on Feb 06, 2007 at 00:11 UTC
Here is my code that uses Text::CSV::Simple; the goal was to parse a NYSE price list from the Wall Street Journal. `use strict; use Text::CSV::Simple; #usage wsj2csv AMEX091806.txt my ($filename) = shift; print $filename."\n"; print "What is the trading date(MMDDYY)?"; my $traddt = <STDIN>; print $traddt."\n"; my $parser = Text::CSV::Simple->new; $parser->want_fields(1, 2, 3, 4, 5, 6, 8); my @data = $parser->read_file($filename); my $datalines = @data; print $datalines."is number of rows.\n"; for my $row(@data) { my @line = @$row; foreach my $row1(@line) { print $row1.","; } print "\n"; }` [download] I intend to add more functionality to this; but this thread was helpful to me in getting the parser to work	[reply] [d/l]