Thargor has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks

I am trying to read in two files with Text::CSV_XS and push all the rows that occur in both files onto a hash and output that into another CSV file. I think I am going to need a hash of arrays and make the key to the hash what I am comparing.

So what I am wanting to know is how to read the entire list in from a CSV file? In the documentation on CPAN I haven't been able to tell if the getline($io) command will only return 1 line and if it does how do I keep track of which line of the file I am on? If any of the wise monks could explain how I can efficiently read in a CSV file with Text::CSV_XS it would help out a lot.

Update:

I am pretty sure what I need is just to get the lines of 1 file put them all into a hash using my comparison variable as the key to the hash then get the other file of csv lines and push that into the hash there by getting rid of any duplicates. This would work because I know where the data is coming from and know that the key I would be using is definitely unique and it seems as though it would work fairly quickly. I just don't know how to read in an entire csv file cause I am horrible with file i/o.


Update 2: Ok I have everything reading in like it is supposed to, thanks tlm, and have started outputing the data. Since I am just fumbling with this in my spare time and with my virtual total ignorance of perl it can be tough. So, I started using a IO::File handle because that is what the Text::CSV_XS says to use and my information writes to the file but it is all on the same line. My code for outputing is as follows:

my $fh = new IO::File "> output.csv"; foreach my $hash ( %hashed){ if($hashed{$hash}){ $status = $csv->print($fh, $hashed{ $hash }); } } $fh->close;
So is there something I have doen incorrectly in the output or should I work on my %hashed information? Thanks in advance.

Replies are listed 'Best First'.
Re: Question about text::csv_xs
by jZed (Prior) on Apr 27, 2005 at 20:21 UTC
    #!perl -w use strict; use Text::CSV_XS; use IO::File; my $csv = Text::CSV_XS->new( {binary=>1} ); my $fh = IO::File->new('myfile.csv'); my $count = 0; while (my $cols = $csv->getline($fh)) { last unless @$cols; print ++$count . ". @$cols\n"; }
    update Or, to create your file #3 from files #1 and #2, I'd do:
    #!perl -w use strict; use DBI; my $dbh=DBI->connect("dbi:CSV(RaiseError=1):csv_eol=\n"); $dbh->{csv_tables}->{tbl1}->{file}='tmp.csv'; $dbh->{csv_tables}->{tbl2}->{file}='tmp2.csv'; $dbh->{csv_tables}->{tbl3}->{file}='combined.csv'; $dbh->do("CREATE TABLE tbl3 AS SELECT * FROM tbl1 NATURAL JOIN tbl2");
Re: Question about text::csv_xs
by tlm (Prior) on Apr 27, 2005 at 20:35 UTC

    getline parses a single line, but if you want to keep the original lines (so that you can determine the lines that appear in both files), then I don't think it is very useful to you. How about this (untested!)?:

    use strict; use warnings; use Text::CSV_XS; my ( $h1, $h2 ) = map get_parsed_lines( $_ ), qw( file_1 file_2 ); my %keep = map +( $_ => $h1->{ $_ } ), grep exists $h2->{ $_ }, keys %$h1; sub get_parsed_lines { my $file = shift; my $csv = Text::CSV_XS->new(); my %hash; open my $in, '<', $file or die "Read failed: $!\n"; while ( <$in> ) { $csv->parse( $_ ) or warn "Bad data: $_\n", next; $hash{ $_ } = [ $csv->fields() ]; } close $in or die "Failed to close $file: $!\n"; return \%hash; }

    the lowliest monk

      Thankyou for your solution and it works up until some of the information from the 2 tables is different for same entry. So, say I have 1 person and part of the CSV is the date they posted something. One date was 4/25/2005 and another is 5/1/2005 I still only want to have 1 copy of this person in the combined CSV file. Since the solution puts everything into the Keys of the hash this causes a problem. I was wondering if there was an easy way to just get 1 of the fields of the csv file as the key to the has and put all of the information into the value? I hope that makes sense.

        Sure, that would very easy. Suppose you wanted to use the third field as hashkey. Then get_parsed_lines would be changed to:

        sub get_parsed_lines { my $file = shift; my $csv = Text::CSV_XS->new(); my %hash; open my $in, '<', $file or die "Read failed: $!\n"; while ( <$in> ) { $csv->parse( $_ ) or warn "Bad data: $_\n", next; my @row = $csv->fields(); $hash{ $row[ 2 ] } = \@row; } close $in or die "Failed to close $file: $!\n"; return \%hash; }
        More generally, you may want to use several fields, e.g., fields 0, 2 and 3. In this case, just change the last line in the loop to
        $hash{ join( ',', @row[ 0, 2, 3] ) } = \@row;

        the lowliest monk

Re: Question about text::csv_xs
by Tanktalus (Canon) on Apr 27, 2005 at 23:19 UTC

    Looks to me like you really don't need CSV at all, but you need Text::Diff. Or, if you do need to do some mangling to the CSV, maybe Algorithm::Diff would be of help?

      That was my first thought also (that the OP doesn't really need CSV parsing), but a) two CSV tables can contain identical rows in different orders so a diff approach may not work, b) fields may contain embedded newlines so a line-oriented approach may not work, c) if requirements change and the comparison is done on key fields rather than whole records, the CSV approach will scale up.
Re: Question about text::csv_xs
by Thargor (Scribe) on Apr 29, 2005 at 18:25 UTC
    Ok I am opening the file and using cvs->getline($io) to get an array ref only problem being I am new to perl and not sure how to use this array ref. Is there any tutorials about array refs? Or would someone like to give me a minor explanation of how perl handles array refs?
      See perlreftut. Here's the basics:
      my $cols_arrayref = $csv->getline($fh); # get an arrayref my @cols_array = @$cols_arrayref; # deref the entire array my $col = $cols_arrayref->[0]; # or deref an element