Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Baffled by data cleaning regex issue

by hesco (Deacon)
on Nov 18, 2008 at 23:45 UTC ( [id://724424]=perlquestion: print w/replies, xml ) Need Help??

hesco has asked for the wisdom of the Perl Monks concerning the following question:

Hello all:

So as not to further distract the cb from its discussion of Zappa's life's work, I bring this question here.

I am working on a script to clean and import a large csv data file into a postgres backend. My duplicate checks are failing to prevent fatal errors related to unique data constraints applied to some tables. As I drill into this issue, what I'm finding is that my regex's are not working as I thought they would.

#!/usr/bin/perl -w use strict; use warnings; open ('CSV','<',$file); while (<CSV>){ $counter++; @fields = split ",",$_; my($ndx); foreach $ndx (0 .. (0 + @fields -1)){ $fields[$ndx] =~ s/\s*$//g; $fields[$ndx] =~ s/'/\\'/g; $fields[$ndx] =~ s/"//g; if($ndx == 32){ print STDERR "For field $ndx, the value is now: |$fields[$ndx]| +\n"; my @chars = split(//, $fields[$ndx]); foreach my $chr (@chars) { print STDERR '|' . ord($chr) . '|,'; } print STDERR "\n"; } } close('CSV');
This is yielding debug output which looks like this:

For field 32, the value is now: |096 | |48|,|57|,|54|,|32|,|32|,
Field #32 is only a representative sample. This issue is showing up on several fields actually.

Can anyone please explain why this is happening for me, please? And more important what I can do about it?

Thanks,

-- Hugh

if( $lal && $lol ) { $life++; }

Replies are listed 'Best First'.
Re: Baffled by data cleaning regex issue
by ikegami (Patriarch) on Nov 18, 2008 at 23:49 UTC

    Can anyone please explain why this is happening for me, please?

    Mostly cause you didn't use Text::CSV_XS.

    Update: Or maybe not. You didn't say what was wrong. I thought you were talking about your other problem.

Re: Baffled by data cleaning regex issue
by gone2015 (Deacon) on Nov 19, 2008 at 00:26 UTC

    Splitting a CSV line on ',' is optimistic, if you haven't dealt with any '"' sections... Wiping out all '"' is a bit brutal if you haven't dealt with '""' pairs... Unless you know that the data never contains either of these in '"' sections.

    People refer to CSV files as if there was some well defined standard. Using one of the CSV parsers will at least give you access to others' experience of the different ways in which CSV has been interpretted.

    But addressing your posting, I'm with brother ikegami on this: most likely there's a "096  " field. If so, should you be stripping spaces that were, notionally, "protected" by the '"'s ? I note also that while you strip trailing whitespace from items, you don't strip leading whitespace -- I cannot tell if that's deliberate :-)

Re: Baffled by data cleaning regex issue
by ikegami (Patriarch) on Nov 18, 2008 at 23:55 UTC
    Are you wondering why there are trailing spaces? I'm guessing it's because you started with '"096  "'.
Re: Baffled by data cleaning regex issue
by eye (Chaplain) on Nov 19, 2008 at 01:26 UTC
    Building on what oshalla said, you get yourself into trouble with your split unless something about your data precludes the presence of commas within quotes. If there is a quoted comma, you will end up with more fields than you expected.

    Also, in debugging a problem like this, it is extremely helpful to compare your debugging output to the input data.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://724424]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (3)
As of 2024-03-29 02:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found