in reply to Re: Parsing CSV file
in thread Parsing CSV file

Hi Athanasius, The main reason behind my initial posting was that, when I used regular expression to parse the CSV file with one or more lines as shown in the code below, it always worked fine. I couldn't figure out why the print outside the while loop didn't work when using the Text::ParseWords quotewords.

$file = $ARGV[0] or die "Missing CSV file on the command line\n"; open($text, '<:encoding(UTF-8)', $file) or die "Could not open '$file' + $!\n"; @fields = (); # initialize @fields to be empty while ($line = <$text>) { chomp($line); # remove the newline at the end of the line while ($line =~ m/"([^"\\]*(\\.[^"\\]*)*)",?|([^,]+),?|,/g) { push(@fields, defined($1) ? $1 : $3); # add the matched fie +ld } # push(@fields, undef) if $line =~ m/,?/; # account for an empt +y last field } foreach $field(0..$#fields) { print $field + 1 . " $fields[$field]\n"; } close $file;

I am a newcomer to Perl and I am really enjoying it. Thanks for your help.

Replies are listed 'Best First'.
Re^3: Parsing CSV file
by Athanasius (Archbishop) on Jul 07, 2016 at 10:35 UTC

    Hello Joma,

    I’ll make three observations on the regex code shown:

    1. There’s no point in capturing to $2 if that capture is never used. It would be better to use a non-capturing group here:

      while ($line =~ m/"([^"\\]*(?:\\.[^"\\]*)*)",?|([^,]+),?|,/g) { # ^^^ push(@fields, defined($1) ? $1 : $2); # ^^

      See perlretut#Non-capturing-groupings.

    2. When testing for definedness, Perl’s // (logical defined-or) operator is useful and elegant:

      push @fields, $1 // $2;

      See perlop#Logical-Defined-Or.

    3. If you had use warnings at the head of your script (and you should!), you would get a Use of uninitialized value warning each time you try to print an array element whose value is undef. You can fix this easily by substituting an empty string:

      push @fields, $1 // $2 // '';

    Update: ++choroba for pointing out that the Branch Reset pattern (perlre#Extended-Patterns) is a more elegant option here.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Using $1 // $2 smells like you can use the Branch Reset pattern (5.10+), which restarts the capture group numbering on each | :

      while ($line =~ m/(?|"([^"\\]*(?:\\.[^"\\]*)*)",?|([^,]+),?|,)/g) { print $1 // q(), "\n"; }

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,