sier has asked for the wisdom of the Perl Monks concerning the following question:

Hey guys,

You provided me some expert help the last question I asked here, and now that I've been beating my head against a wall for hours now, I figured I'd see if anyone can help me out.

I have a script that accesses an SMTP folder, downloads an email with a CSV attached, parses the CSV, and sends emails to anyone listed in the CSV. Sounds fairly straight-forward, right? The CSV is in this format:

contactname,email,city
Luke,luke@test.com,Kansas City

My problem is the client REFUSES to send a standard CSV file each time. He changes the header names (from "city" to "country"), capitalizes them, rearranges them, etc. I simply can't sit with the guy and force him to use a correctly formatted CSV file, so I am forced to idiot-proof my script. Now he wants several people to use the same mailing system, and I need a way to dynamically sniff the headers in the CSV.

I am using the Tie::Handle::CSV module to do most of my parsing, but my headers are hard-coded like this:

my $fh = Tie::Handle::CSV->new($csvfile, header => 1, + open_mode => "< :encoding(ISO 8859-1)"); while (my $csv_line = <$fh>) { $lookup_email = $csv_line->{'email'} ...


I have just started working with Perl, so bear with me as this might be an easy question...

How do I dynamically display what the names of the headers are? I am already telling the module that the headers are located on line 1, but now I need to write each of them to their own string (or throw them into an array) so I can use them later.

Sorry for being such newb :). Thanks!

Replies are listed 'Best First'.
Re: Tie::Handle::CSV dynamically detecting header names
by Herkum (Parson) on Mar 25, 2009 at 17:01 UTC

    Instead of trying to figure out the headers and formatting on the fly, you should probably prep the header row. For example, load the file, read the header line and then make it all lowercase and save it somewhere. Then use the new file for processing in your script.

    A lot of places spend time prepping/formatting data before they actually process it. It is a great place to catch errors that might otherwise be overlooked. As for the guy changing 'City' to 'Country' well there are limits on what you can be expected to do. Garbage In will result in Garbage Out.

      I've got code that does something like this. I set up a "translation alias DB" (which is just a text file that I set up to be easily edited and parsed by me). So this winds up essentially being a hash table that shows that a,b,c,x all mean City, etc. So there is a "normalization" step.

      This alias table just gets built empirically. If some name shows up that doesn't translate, I add it to the alias DB. If you are lucky this process converges and fewer and fewer "new terms" show up. Sounds like you need to be alerted when new columns show up too.

      One idea is to use DBI::CSV, that way you can make SQL queries using the standard "normalized" terms. And if some new column is in there, it won't matter. Of course if this user puts the country into the city column you will have to fix the file manually before using it! In your case say: name, contactname, organization or whatever all mean "name" or "contactname"; e-mail, EMail means email, etc.

      update:Looks like your code has the name vs column thing figured out! Great job! You do want to "disconnect" column number from field name. Also for some of my users I wrote Excel spreadsheet app for them to help thing out. Sometimes helping the user generate better data is a good way to go as it saves hassle at your end!

Re: Tie::Handle::CSV dynamically detecting header names
by roboticus (Chancellor) on Mar 26, 2009 at 13:21 UTC
    sier:

    <disclaimer>I commend your desire to make your system flexible and useful, and I'm certainly not a BOFH, but...</disclaimer>

    Sometimes, you can just make things too hard on yourself. I'm not saying that this is the case here--but in general, trying to make a system idiot proof is impossible. You can bend over backwards trying to do it, but if the users continually send you any random crap and expect your code to parse it and deliver the messages, neither you nor your clients are going to be happy with the system. In cases similar to this, I simply have the system look at the header, and if it can't find the columns it wants, then send an error EMail telling the client what the problem(s) are, and let them fix the problem and send it again. Since they're using CSV files, they're likely to be generating 'em from Excel, so they can rename the columns appropriately...

    Again, I'm not advocating that you frustrate your clients. I'm simply suggesting that sometimes there are situations that can be (1) a time sink, (2) continually frustrating, and (3) anti-profitable. Just make sure that you're not putting yourself in a situation where you're going to have to continually "fix" your code because your user can't give you what the program wants. Been there, done that, got the T-shirt, washed it enough times for the lettering to fade...

    ...roboticus
      I actually have those checks in place right now, believe it or not. If the correct headers are not found in the CSV, it emails them stating that the CSV is improperly formatted with an attachment of a correctly formatted CSV for them to use.

      Alas, my employer knows this client is important, so it is their request that I make it idiot-proof. It's very frustrating. Sometimes being a programmer really sucks.
Re: Tie::Handle::CSV dynamically detecting header names
by blahblahblah (Priest) on Mar 25, 2009 at 22:55 UTC
    Text::xSV is worth a look. You can read in the header row, add aliases to those headers, and then fetch the rest of the rows' data by header name/alias.

    This module has a couple of other features that make it my preferred module for handling unpredictable user-uploaded csv files. It handles multiline data well, it lets you to apply filters to pre-process the data, and it has very easy-to-use error handling.