comment on

Well...

I have a lot of gzipped csv files with the data from a legacy system I have to import. I don't want to uncompress them on disk and I don't want to read them entirely in memory. Also, I have to support multiline rows in the csv files (enclosed by a string delimiter).

After seeking for the modules, I didn't see anyone which could receive a filehandle that is a pipe from gunzip, neighter that support multiline rows... So, I end with the following code:

This is the parser...

package CSVParse;
use strict;
use warnings;

sub new {
    my $self = shift;
    $self =
      {
       string_delim => '"',
       escape_char => "\\",
       field_delim => ',',
       reg_delim => "\n"
      };
    bless $self, "CSVParse";
    $self->{fh} = shift;
    return $self;
}

# supomos que ele esteja no início de uma coluna!
sub fetch_column {
    my $self = shift;
    my $context = 'raw_data';
    my @contexts = ();
    my $data = undef;

    while (1) {
        # char buffer
        my $buf;
        read($self->{fh},$buf,1) or do {
            $self->{EOF} = 1;
            last;
        };
        $self->{last_char_read} = $buf;
        if ($context eq 'escape') {
            $data = '' unless defined $data;
            $data .= $buf;
            $context = shift @contexts;
        } elsif ($context eq 'string') {
            if ($buf eq $self->{string_delim}) {
                $context = shift @contexts;
            } else {
                $data = '' unless defined $data;
                $data .= $buf;
            }
        } else {
            if ($buf eq $self->{escape_char}) {
                push @contexts, $context;
                $context = 'escape';
            } elsif ($buf eq $self->{string_delim}) {
                push @contexts, $context;
                $context = 'string';
            } elsif ($buf eq $self->{field_delim} ||
                 $buf eq $self->{reg_delim}) {
                # voltar um caractere
                seek($self->{fh},0,tell($self->{fh})-1);
                # sair do loop.
                last;
            } else {
                $data = '' unless defined $data;
                $data .= $buf;
            }
        }
    }

    return $data;
}

sub fetch_row {
    my $self = shift;
    if ($self->{EOF}) {
        return undef;
    }
    my @cols = ();
    # supomos que ele comece numa posição OK
    while (1) {
        my $col = $self->fetch_column();
        last if $self->{EOF};
        push @cols, $col;
        if ($self->{last_char_read} eq $self->{reg_delim}) {

            # sair do loop.
            last;
        }# elsif ($buf eq ($self->{field_delim})) { next; } else { nex
+t; }
    }
    return \@cols;
}

sub parse_file {
    my $self = shift;
    my @rows = ();
    while (1) {
        my $cols = $self->fetch_row();
        last unless defined $cols;
        push @rows,$cols;
    }
    return @rows;
}

1;
[download]

And this is a sample code...

open my $tabelaclientes, "gunzip -c somefile.csv.gz|" || die $!;
my $csv = CSVParse->new($tabelaclientes);
while (1) {
  my $row = $csv->fetch_row();
  last unless defined $row;
  for (@$row) {
    utf8::decode($_);
  }
  print join(",",@$row)."\n";
}
close $tabelaclientes;
[download]

daniel

In reply to CSV Parse on filehandle by ruoso

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.