azaria has asked for the wisdom of the Perl Monks concerning the following question:

Hello There,
Lets say I want to read a file with the following content (a CSV format)

First Name, Last Name, Address,Tel Sara, Bon, "Hashikma 5 , Austin", 00 Bunnjy,"Doron", "Eug , Germany",22

my intenstion is to get the Address column for each line -

while(<FILE>) { @item = split(/\,/,$_); } close(FILE); print $item[2],"\n";

The problem is that for the second line I get - "Hashikma 5

How can I get - "Hashikma 5 , Austin" ?

20051031 Janitored by Corion: Added formatting

Replies are listed 'Best First'.
Re: CSV file
by prasadbabu (Prior) on Oct 31, 2005 at 12:37 UTC
      I loved that, thanks !!! Simona
Re: CSV file
by rev_1318 (Chaplain) on Oct 31, 2005 at 12:38 UTC
    Have a look at the Text::CSV_XS module. that can do just what you want.

    Paul

Re: CSV file
by cyclist38 (Hermit) on Oct 31, 2005 at 12:41 UTC
Re: CSV file
by pajout (Curate) on Oct 31, 2005 at 12:55 UTC
    Other people will probably advice you some module for parsing CSV files.
    I solved this problem a year ago, and I found this trick:
    sub parseCSVrow { my ($row, $fdelim, $tdelim) = @_; chomp($row); my @ret = split(/$fdelim/,$row); my $i = 0; while($i < @ret) { if (($ret[$i] =~ s/$tdelim/$tdelim/ge) % 2) { if ($i + 1 == @ret) { die "ERROR: not ending text"; } else { $ret[$i] = $ret[$i].$fdelim.$ret[$i+1]; splice(@ret,$i+1,1); } } else { $i++; } } return \@ret; }
    The idea is to join previously split parts, if some part contains odd count of text delimiters. But, there is one big mistake. CSV field can contain EOL and in this case my code piece does not work.
    Update: Respectively, works, but you cannot naively call this subroutine, but slighlty change the code.
      That's a nice trick, but will it work if you have escaped delimiters in your input?

      CountZero

      "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

        I disagree:
        When field delimiter appears as part of field value (beside text delimiters), the row will be split by this, but the parts will be joined after that.
        And I do not remember if the field value can contain text delimiter - I think that yes, but doubled - and count of text delimiters modulo 2 is invariant in this case.
      Thanks for your reply !!! What is the meaning of $fdelim, $tdelim ? Azaria
        $tdelim is text delimiter, typically quotes or apostrophes, and $fdelim is field delimiter, typically comma (Comma Sepparated Value).
Re: CSV file
by GrandFather (Saint) on Oct 31, 2005 at 19:32 UTC

    You might like to follow this thread and this thread for discussions of related CSV problems. There are many more such threads if you do a little searching!


    Perl is Huffman encoded by design.
Re: CSV file
by Moron (Curate) on Oct 31, 2005 at 15:13 UTC
    Or say it in regexp:
    while( <FILE> ) { if ( /^([^\,]*\,\s*[^\,]*)\,\s*(.*)$/ ) { print "Name: $1\n"; print "Address: $2\n"; } }
    To explain the regexp: the first ^ matches beginning of line, the ([^\,]*) captures the next contiguous set of non comma characters and places it in the first "numbered" variable ($1), the \,\s* steps over the first comma and any immediately trailing space, (update: that is now repeated to allow for the fact that there is a comma in the name field), the (.*) picks up everything thereafter placing it in the $2 variable and the $ matches with the end of the line.

    -M

    Free your mind

      ([\,]) only matches a single non-comma character. It is missing a quantifier such as + or *.

      Everytime I see (.*) I feel very uneasy, and usually I can think of a much simpler way to accomplish the same thing. In this case, all the regular expression is really doing is finding the first comma! Now, split normally splits on all the delimiters it finds in a string, but it can take an optional third argument which tells it exactly how much to split the line; in this case we just want two fields. As a bonus we can give the split fields names so we don't have to use $1 and $2.

      while (<FILE>) { chomp; my ($name, $address) = split /,\s+/, $_, 2; print "$name lives at $address\n"; }

      Another thing that would make me think twice about your solution is that you've got an if statement, but there's no sensible behaviour defined for when the if is false! Although I have to say that using a proper CSV module (Text::CSV, etc) is always preferable.

      --
      integral, resident of freenode's #perl
      
        I spotted the lack of * shortly after posting.

        re .*: having eliminated the first column and its delimiter, .* must be what is left over as the second and last column - why make the engine work harder?. It really is quite common that the remainder of a string is defined by elimination of specific features of the preceding characters and needs no further analysis - only when the .* is attempted earlier than such elimination is completed does it cause a problem.

        re "if" - I opted for this rather than allowing any exceptions to just fall through the code, whereas no else is defined because neither was any provision for such exceptions stated in the requirement. Not seeing the benefit of this turns out to be a trap for habitual negative thinking: If there was a blank line in the data, your own solution, which apart from not filtering exceptions is fine, would accept the blank data and, for example, enter it into the database!

        -M

        Free your mind

      That regex doesn't even work on the OP's data, and neither will it correctly split strings which contain delimiters within quotes. Read the top node again and you'll probably realise that a regular expression is not the right solution here.


      Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
        I missed the fact there is a comma in the name field too and modified the regular expression accordingly. Someone more handy at regexp's could probably factorise this more succinctly, however.

        But if we analyse the logic behind your argument we get "an error in analysing the requirement implies that all technical solutions resembling it must be wrong" - which I would hope reveals why I disagree with the argument, while remaining open-minded about variety of possible solutions.

        -M

        Free your mind

Re: CSV file
by Moron (Curate) on Nov 01, 2005 at 10:50 UTC
    I didn't have time yesterday to put this idea into code, but since no-one else did it meanwhile, here it is. The crux of the OP is the need for a tiny enhancement to split, which is as simple as this:
    sub MySplit2 { # args being like split but with the third INSTEAD a # preservation delimiter -- the OP requirement seems # to suggest that should be -1 for split anyway my $pat = shift; my $str = ( shift () || $_ ); my $pat2 = shift; # a preservation delimiter # i.e. not splits usual arg3 my @return = (); my $inquotes = 0; for my $qs ( defined ( $pat2 ) ? ( split( /$pat2/, $str, -1 ) ) : ( $str ) ) { push @return, $inquotes ? $qs : split( /$pat/, $qs, -1 ); $inquotes = !$inquotes; } return @return; }
    But even if taken to the extreme of including all split's existing bells and whistles, it still manages to remain reasonable in size:
    # example: # while (<FILE>) { # ( $name, $address, $tel ) = MySplit( ',',,,'"' ); # } sub MySplit { # a wrapper for split # args being like split but with a fourth for optional # preservation delimiter my $pat = shift; my $str = ( shift () || $_ ); my $cnt = shift; # copy of split's normal third argument # will be iterated down to zero here my $pat2 = shift; # the new argument my @return = (); my @topush; my $inquotes = 0; for my $qs ( defined ( $pat2 ) ? ( split( /$pat2/, $str, -1 ) ) : ( $str ) ) { $cnt or ( defined( $cnt ) and last ); # arg3>0 exhaustion @topush = $inquotes ? ($qs) : split( /$pat/, $qs, $cnt ); # update @topush and $cnt to agree with each other if ( defined( $cnt ) and ( $cnt >= 0 ) ) { $cnt -= ( $#topush + 1 ); if ( $cnt < 0 ) $#topush += $cnt; $cnt = 0; } } $inquotes = !$inquotes; push @return, @topush; } return @return; }

    -M

    Free your mind