Well...

I have a lot of gzipped csv files with the data from a legacy system I have to import. I don't want to uncompress them on disk and I don't want to read them entirely in memory. Also, I have to support multiline rows in the csv files (enclosed by a string delimiter).

After seeking for the modules, I didn't see anyone which could receive a filehandle that is a pipe from gunzip, neighter that support multiline rows... So, I end with the following code:

This is the parser...

package CSVParse; use strict; use warnings; sub new { my $self = shift; $self = { string_delim => '"', escape_char => "\\", field_delim => ',', reg_delim => "\n" }; bless $self, "CSVParse"; $self->{fh} = shift; return $self; } # supomos que ele esteja no início de uma coluna! sub fetch_column { my $self = shift; my $context = 'raw_data'; my @contexts = (); my $data = undef; while (1) { # char buffer my $buf; read($self->{fh},$buf,1) or do { $self->{EOF} = 1; last; }; $self->{last_char_read} = $buf; if ($context eq 'escape') { $data = '' unless defined $data; $data .= $buf; $context = shift @contexts; } elsif ($context eq 'string') { if ($buf eq $self->{string_delim}) { $context = shift @contexts; } else { $data = '' unless defined $data; $data .= $buf; } } else { if ($buf eq $self->{escape_char}) { push @contexts, $context; $context = 'escape'; } elsif ($buf eq $self->{string_delim}) { push @contexts, $context; $context = 'string'; } elsif ($buf eq $self->{field_delim} || $buf eq $self->{reg_delim}) { # voltar um caractere seek($self->{fh},0,tell($self->{fh})-1); # sair do loop. last; } else { $data = '' unless defined $data; $data .= $buf; } } } return $data; } sub fetch_row { my $self = shift; if ($self->{EOF}) { return undef; } my @cols = (); # supomos que ele comece numa posição OK while (1) { my $col = $self->fetch_column(); last if $self->{EOF}; push @cols, $col; if ($self->{last_char_read} eq $self->{reg_delim}) { # sair do loop. last; }# elsif ($buf eq ($self->{field_delim})) { next; } else { nex +t; } } return \@cols; } sub parse_file { my $self = shift; my @rows = (); while (1) { my $cols = $self->fetch_row(); last unless defined $cols; push @rows,$cols; } return @rows; } 1;

And this is a sample code...

open my $tabelaclientes, "gunzip -c somefile.csv.gz|" || die $!; my $csv = CSVParse->new($tabelaclientes); while (1) { my $row = $csv->fetch_row(); last unless defined $row; for (@$row) { utf8::decode($_); } print join(",",@$row)."\n"; } close $tabelaclientes;
daniel

In reply to CSV Parse on filehandle by ruoso

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.