http://qs1969.pair.com?node_id=660451

manav_gupta has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks

I have a file format like so:
200326951|_|rel_Access1|_|200315482|_| 200326951|_|rel_Access1|_|200315786|_| 200326951|_|rel_Access2|_|200315482|_| 200326951|_|rel_Access2|_|200315786|_|
I want to read this file and build a hash. I have the following code:
my %hash; open(CMD1, "< test.txt"); while (<CMD1>) { my @elts = split(/\|_\|/, $_, -1); my $p = \\%hash; $p = \( ${$p}->{$_} ) for @elts; } use Data::Dumper; print Dumper(\%hash);
However, that gives me a hash like so:
$VAR1 = { '200326951' => { 'rel_Access2' => { '200315482' => { '' => undef }, '200315786' => undef }, 'rel_Access1' => { '200315482' => { '' => undef }, '200315786' => { '' => undef } } } };
What can I do to get a hash like the following:
$VAR1 = { '200326951' => { 'rel_Access2' => '200315786' }, 'rel_Access1' => '200315482' } };
Basically, I'm looking to discard duplicate values. This is a slightly tricky example, since in this one, the second line gets discarded because the second field (rel_Access1) was already present in the first line, and the third line gets discarded because the third field (200315482) was present in the result..

Many thanks!

Replies are listed 'Best First'.
Re: hash from CSV-like structure
by FunkyMonk (Chancellor) on Jan 04, 2008 at 20:40 UTC
    This does nearly what you want. Perhaps you can modify it to get what you want (but see my question below).
    my %hash; while ( <DATA> ) { chomp; my @elts = split /\|_\|/, $_; $hash{$elts[0]}->{$elts[1]} = $elts[2]; } print Dumper \%hash; __DATA__ 200326951|_|rel_Access1|_|200315482|_| 200326951|_|rel_Access1|_|200315786|_| 200326951|_|rel_Access2|_|200315482|_| 200326951|_|rel_Access2|_|200315786|_|

    Output:

    $VAR1 = { '200326951' => { 'rel_Access2' => '200315786', 'rel_Access1' => '200315786' } };

    In your sample output, why does 'rel_Access2' => '200315786' but 'rel_Access1' => '200315482'?

      Thanks FunkyMonk

      The reason behind that was getting rid of duplicates...
Re: hash from CSV-like structure
by jZed (Prior) on Jan 04, 2008 at 20:56 UTC
    I'm not really sure what your criteria for duplicates is, but this produces the results you requested from the data you gave:
    #!/usr/bin/perl use warnings; use strict; use Data::Dumper; my $hash; my %results; for my $row( split("\n",join('',<DATA>)) ){ my($key1,$key2,$val) = split('\|_\|',$row); next if $results{$val} and $hash->{$key1}->{$key2}; $hash->{$key1}->{$key2}=$val; $results{$val}=1; } print Dumper $hash; __DATA__ 200326951|_|rel_Access1|_|200315482|_| 200326951|_|rel_Access1|_|200315786|_| 200326951|_|rel_Access2|_|200315482|_| 200326951|_|rel_Access2|_|200315786|_|
      Super, many thanks jZed - that works!

        To avoid undef warnings, you might want to make the duplicate check more like this:
        next if $results{$val} and $hash->{$key1} and $hash->{$key1}->{$key2};
      Thanks - this fits my needs exactly!
Re: hash from CSV-like structure
by naChoZ (Curate) on Jan 05, 2008 at 00:31 UTC
    I wrote a subroutine to do almost the exact same thing just recently (though I ended up converting it to import to a db). Here it is with my db related stuff commented and fixed to populate a hashref like you've described with some sample usage based on your example.

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; use Parse::CSV; my @list_of_fields = qw/ field1 field2 field3 field4 field5 field6 /; my $values = import_csv({ filename => '/tmp/vals.csv', fields => \@list_of_fields }); print "CSV Import complete...\n"; print Dumper( $values ); # {{{ import_csv # sub import_csv { my $args = shift; # die "No database handle provided for import...\n" unless defined +$args->{dbh}; # die "No database table provided for import...\n" unless defined +$args->{table}; die "No database columns provided for import...\n" unless defined +$args->{fields}; # my $dbh = $args->{dbh}; my $table = ref $args->{table} eq 'ARRAY' ? $args->{table}->[0] : $args->{table} ; my $fields = $args->{fields} ? $args->{fields} : 'auto'; my $csv = Parse::CSV->new( file => $args->{filename}, fields => $fields, sep_char => '|', ); my $results = {}; while ( my $row = $csv->fetch ) { my @columns = @{ $fields }; # Make certain that the number of values returned is # equal to the number of columns we're expecting. # die "Invalid number of columns...\n" unless scalar @columns == + scalar keys %{ $row }; # Creating placeholders this way so we'll always # have the exact right number of placeholders to # match the number of values in the columns list for # our sql statement. # # my @placeholders = map { '?' } @$columns; # my @values = map { $row->{$_} } sort @columns; # my $results = insert_row({ dbh => $dbh, # columns => \@columns, # table => $table, # 'values' => \@values, # }); $results->{ $row->{field1} }->{ $row->{field3} } = $row->{fiel +d5}; } return $results; } # }}}

    Output:

    # perl /tmp/test-parsing.pl CSV Import complete... $VAR1 = { '200326951' => { 'rel_Access2' => '200315786', 'rel_Access1' => '200315786' } };

    --
    naChoZ

    Therapy is expensive. Popping bubble wrap is cheap. You choose.

Re: hash from CSV-like structure
by doom (Deacon) on Jan 04, 2008 at 22:59 UTC
    Maybe this problem is an exception to the rule (because you're talking about "CSV-like" data, and not just "CSV"), but in general it's a really bad idea to try to roll your own CSV processing with regular expressions. CSV seems very simple, so you tend to think it'll be eaisier to just do it yourself, but there are enough odd little corner-cases that you're almost guaranteed to do something wrong (e.g. do you allow items with commas inside them if they're quoted correctly? if you allow quotes inside of a quoted item, how do you escape the embedded quotes? Is it okay to allow spaces after the commas? If you do, does that break someone else's CSV parsing?).

    When last I looked into this, your best bet was to use Text::CSV_XS (or DBD::CSV, which uses it internally), though you need to know that you always want to use the "binary" option (i.e. they got the default wrong).

      A couple of minor quibles : DBD::CSV defaults to having binary on when it calls Text::CSV_XS, and while it's good advice to turn binary on by default, it really isn't true that "you always want to use the binary option" since there are several CSV formats which you *want* to fail if they encounter embedded newlines. I was tempted to point the OP to one of those modules too, but this data has no embedding and would need to have its terminal field separator chomped so doing it by hand seems just as good for this one particular case.