Parsing a complex csv, cleaning it up, and exporting it

scotttromley has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I've got a CSV file of contacts from a customer of mine. Whoever migrated his contacts into Gmail ended up putting the First Name, Last Name, Email, and a few other fields into the notes field of the CSV, and leaving the actual fields undef, and I've taken it upon myself to clean this up.

I decided the best way to handle this would be by learning perl and using it, and I've made decent progress thus far. I've cleaned up the csv as best as I can using Sublime Text 2, as I have lots of experience in it and a fair grasp of regex. Here is the situation I'm in:</p

I have written a Perl program that can (using Text::CSV) import each line of the CSV. It then saves the Notes field for the current line to a scalar $notesField. Here is an example of a few lines of the input data:

,,,,,,,,,,,,,,,,,,,,,,,,,"First Name: Dobbin Last Name: David L. Addre
+ss: david@adamsonanddobbin.com City: PO Box 1326407 Pido Road State: 
+Peterborough ZIP Code: ON Country: K9J 7H5
First Name: Dobbin Last Name: David L. E-mail: david@adamsonanddobbin.
+com Address: PO Box 1326407 Pido Road City: Peterborough State: ON ZI
+P Code: K9J 7H5",,,,,,Home,743 7790,Other,742 4524,Work,745 5751,,,,,
+,,,,,,Adamson And Dobbin Ltd. Mechanical Contractors,,General Manager
+,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,"First Name: Chapleau Last Name: Kathy, Ken A
+ddress: 666 FrankFirst Name: Chapleau Last Name: Kathy, Ken City: 666
+ Frank
",,,,,,Home,876-9863,,,,,,,,,,,,,,,Admiralty Hall,,Accountant,,,,,,,,,
+,,
[download]

The data between the double quotes is field[ 25 ] of the line, and is the notes field. You may notice the Last Name: string in the notes field is followed by the first name of the contact, as whoever migrated this data seems to have transposed those two fields

My next step, given that I can read each line in and extract the notes field, is to use regex to match the string following First Name: so I can save it to the appropriate field in that row, the Given Name field of the CSV. The pattern I've come up with is:

if ($notesField =~ /First Name:.*?(?=Last)/)
{
    print "MATCHED!";
}
[download]

Which approximately works in Sublime Text 2's Regex search to select what I need it to. However, it doesn't match anything in my Perl program.

My intention is to match each field in the row, concatenate together a new row, and write it out to a new CSV file, preserving the existing Notes field to account for any data that may be lost in the process.

Please help me figure out what I am doing wrong with my regex, as well as any other protips you could offer. I realize this is a pretty messy dataset, but I am really enjoying the challenge and am not easily discouraged. Thank you for your help, if you need any more information please let me know. Here is the full text of my program thus far:

use strict;
use warnings;
use Text::CSV;

my $csv = Text::CSV->new ({
    sep_char => ',',
    binary => 1,
    auto_diag => 1,
    allow_loose_quotes => 1
    });
my $file = $ARGV[0] or die "Need to load a file from stdin";

open (my $data, '<', $file) or die "Couldn't load file '$file'\n";
while(my $row = $csv->getline($data)) {
    my @fields = @{$row};
    my $notesField = $fields[25];
    if (defined $notesField) 
    {
        if ($notesField =~ /First Name:.*?(?=Last)/)
        {
            print "MATCHED!";
        }
        
    }
    else
    {
        print"No Notes Field \n\n";
    }
}
    
if (not $csv->eof) {
    $csv->error_diag();
}
close $data;
[download]

Comment on Parsing a complex csv, cleaning it up, and exporting it Select or Download Code

Replies are listed 'Best First'.

Re: Parsing a complex csv, cleaning it up, and exporting it
by Athanasius (Archbishop) on Aug 28, 2014 at 02:48 UTC

Hello scotttromley, and welcome to the Monastery!

I decided the best way to handle this would be by learning perl and using it, and I've made decent progress thus far.

Bravo! Perl is an excellent tool for this type of job, and your code shows you are making excellent progress. A word of warning, though: Perl is good for a lot else besides text munging — so good, it easily becomes addictive! Just be prepared for the inevitable lifestyle changes as you find yourself spending more and more time hanging around the Monastery. ;-)

However, it doesn't match anything in my Perl program.

I don’t know why you say that — your regex works fine for me:

Read more... (3 kB)

But note that this code:

if ($notesField =~ /First Name:(.*?)(?=Last)/)
{
    print "MATCHED!: '$1'\n";
}
[download]

Matches only the first “First Name” in any given row. To get all the matches, change the if to:

while ($notesField =~ /First Name:(.*?)(?=Last)/g)
[download]

See “Global matching” in perlretut#Using-regular-expressions-in-Perl.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Parsing a complex csv, cleaning it up, and exporting it
by RichardK (Parson) on Aug 28, 2014 at 01:14 UTC

perl array indexes start at 0, so are you sure you're looking at the right field? maybe 24 instead?

Try printing $notefield to see what it contains, and when you have a moment read Basic debugging checklist for tips on how to work out why any code isn't working.

A tip, you don't have to copy the $row to get at the fields, you can just use my $note = $row->[24]; see perlreftut

[reply]
[d/l]

Re: Parsing a complex csv, cleaning it up, and exporting it
by hdb (Monsignor) on Aug 28, 2014 at 06:52 UTC

My preference would be to first decompose the notes field into its parts (all of them) and to store it into some structure (like a hash) and only then extract the information you are looking for.

Here, a split would do exactly that using captures in the regex to keep the field names.

use strict;
use warnings;
use Data::Dumper;

while(<DATA>){
    my( $notesField )= /"(.*)"/;
    my @parts = split /\s?(First Name|Last Name|Address|City|State|ZIP
+ Code|E-mail): /, $notesField;
    shift @parts;
    my %parts = @parts;
    print Dumper \%parts;
}
__DATA__
,,,,,,,,,,,,,,,,,,,,,,,,,"First Name: Dobbin Last Name: David L. Addre
+ss: david@adamsonanddobbin.com City: PO Box 1326407 Pido Road State: 
+Peterborough ZIP Code: ON Country: K9J 7H5 First Name: Dobbin Last Na
+me: David L. E-mail: david@adamsonanddobbin.com Address: PO Box 13264
+07 Pido Road City: Peterborough State: ON ZIP Code: K9J 7H5",,,,,,Hom
+e,743 7790,Other,742 4524,Work,745 5751,,,,,,,,,,,Adamson And Dobbin 
+Ltd. Mechanical Contractors,,General Manager,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,"First Name: Chapleau Last Name: Kathy, Ken A
+ddress: 666 FrankFirst Name: Chapleau Last Name: Kathy, Ken City: 666
+ Frank",,,,,,Home,876-9863,,,,,,,,,,,,,,,Admiralty Hall,,Accountant,,
+,,,,,,,,,
[download]

The result would be a hash like this:

$VAR1 = {
          'First Name' => 'Dobbin',
          'ZIP Code' => 'K9J 7H5',
          'Address' => 'PO Box 1326407 Pido Road',
          'Last Name' => 'David L.',
          'City' => 'Peterborough',
          'E-mail' => 'david@adamsonanddobbin.com',
          'State' => 'ON'
        };
$VAR1 = {
          'First Name' => 'Chapleau',
          'Address' => '666 Frank',
          'Last Name' => 'Kathy, Ken',
          'City' => '666 Frank'
        };
[download]

Please note that my extraction of the notes field is only done using a regex for convenience. Your approach using Text::CSV is clearly the right way to do it.

When using a hash you would lose any duplicates. So if there are two first names, only one of them would survive.

[reply]
[d/l]
[select]