Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Writing hashes as records to a CSV file

by Serene Hacker (Initiate)
on Dec 08, 2021 at 22:55 UTC ( [id://11139481]=perlquestion: print w/replies, xml ) Need Help??

Serene Hacker has asked for the wisdom of the Perl Monks concerning the following question:

I have a bunch of records containing key,value pairs.

I'd like to create a CSV file containing one row for each record.

However, each record doesn't necessarily contain exactly the same keys as the other records. So I'd like the column-headers to be a superset of all the keys. E.g.

{AB1 => 100, NN => 200, XYZ => 400} {AB1 => 100, XYZ => 400, MM => 300}
-->
"AB1","NN","XYZ","MM" "100","200","400","" "100","","400","300"
ie.
AB1 NN XYZ MM 100 200 400 undef 100 undef 400 300
I could do this "manually" as it were, but can one do it directly (using Text::CSV) ie. without figuring out the superset of keys myself? In Text::CSV I see one can *read* tables with headers as hashes. But can one write them just as easily?

Replies are listed 'Best First'.
Re: Writing hashes as records to a CSV file
by choroba (Cardinal) on Dec 08, 2021 at 23:24 UTC
    You can collect the column names by iterating over the records and accumulating their keys.
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use List::Util qw{ uniq }; use Text::CSV_XS; my @records = ({AB1 => 100, NN => 200, XYZ => 400}, {AB1 => 100, XYZ => 400, MM => 300}); my @columns = sort +uniq(map keys %$_, @records); my $csv = 'Text::CSV_XS'->new({auto_diag => 1}); for my $record ({map { $_ => $_ } @columns}, @records) { $csv->say(\*STDOUT, [@$record{@columns}]); }
    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Writing hashes as records to a CSV file
by Tux (Canon) on Dec 09, 2021 at 08:55 UTC

    My variation using csv:

    use Text::CSV_XS qw( csv ); my @records = ( { AB1 => 100, NN => 200, XYZ => 400}, { AB1 => 100, XYZ => 400, MM => 300}, ); my %fld = map { $_ => 1 } map { keys %$_ } @records; csv (in => \@records, headers => [ sort keys %fld ]);

    A suggestion might be to add quote_empty => 1 to be able to distinguish between missing keys and defined keys. But as that does not shows the difference between missing keys and keys that point to an undefined value, I would suggest something like this (unless you don't care about those differences):

    use Text::CSV_XS qw( csv ); my @records = ( { AB1 => 100, NN => 200, XYZ => 400 }, { AB1 => 200, XYZ => 400, MM => 300 }, { AB1 => 300, XYZ => undef, MM => "" }, ); my %seen; my @fld = sort grep { !$seen{$_}++ } map { keys %$_ } @records; csv (in => \@records, headers => \@fld, quote_empty => 1, on_in => sub + { foreach my $f (@fld) { exists $_{$f} or $_{$f} = "--missing--"; } });

    ->

    AB1,MM,NN,XYZ 100,--missing--,200,400 200,300,--missing--,400 300,"",--missing--,

    To answer LanX' question: as CSV is streaming, you don't know what is to come, so there is no builtin functionality to find unique values in CSV, but there are enough builtin features to do it yourself.


    Enjoy, Have FUN! H.Merijn
Re: Writing hashes as records to a CSV file
by LanX (Saint) on Dec 08, 2021 at 23:27 UTC
    > but can one do it directly (using Text::CSV) ie. without figuring out the superset of keys myself

    I searched thru the documentation of Text::CSV but couldn't find that. But I'm confident Tux as one of the maintainers will know.

    > I could do this "manually" as it were,

    Merging hashes is not complicated in Perl:

    after

    %H = (%H,%$_) for @AoH

    you'll have the superset in

    keys %H

    More importantly you can keep control over the order of columns created. (like ordering by count)

    Do you really want to leave that to Text::CSV, which most probably will be random?

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    update x2
    use strict; use warnings; use Data::Dump qw/pp dd/; my @AoH = ( {AB => 100, NN => 200, XYZ => 400}, {AB => 100, XYZ => 400, MM => 300}, { map { ("A$_" => $_) } "A".."C" } ); my %H; %H = (%H,%$_) for @AoH; dd keys %H; my %count; map { map { $count{$_}++ } keys %$_ } @AoH; dd sort { $count{$b} <=> $count{$a} || $a cmp $b } keys %count;
    OUTPUT:
    ("XYZ", "NN", "MM", "AC", "AA", "AB") ("AB", "XYZ", "AA", "AC", "MM", "NN")

    NB: "AB" is last in the first result, even that it appears in each record.

    The second result however will order the columns first by count and then by alphabetic order.

    YMMV...

      > after %H = (%H,%$_) for @AoH you'll have the superset in keys %H

      actually using slices and empty key lists is way faster

      @H{keys %$_} = () for @$AoH;

      DEMO:

      use v5.12; use warnings; use Test::More; use Benchmark qw/cmpthese/; my $AoH; for my $n_rec (1, 10,100,1000) { say; say "=== num of records is: ",$n_rec; $AoH = create_data(1,$n_rec); is_deeply( [sort &list_join], [sort &slice_join], ); cmpthese(-1, { 'list_join' => \&list_join, 'slice_join' => \&slice_join, } ); } done_testing; sub list_join { my %H; %H = (%H,%$_) for @$AoH; return keys %H; } sub slice_join { my %H; @H{keys %$_}=() for @$AoH; return keys %H; } sub create_data { my ( $density,$records ) = @_ ; my @AoH; push @AoH, { map { rand 100 <= $density ? ("$_" => $_) :() } "A".. +"ZZ" } for 1..$records; return \@AoH; }
      OUTPUT:
      __DATA__ === num of records is: 1 ok 1 Rate list_join slice_join list_join 238532/s -- -65% slice_join 682713/s 186% -- __DATA__ === num of records is: 10 ok 2 Rate list_join slice_join list_join 7819/s -- -93% slice_join 112993/s 1345% -- __DATA__ === num of records is: 100 ok 3 Rate list_join slice_join list_join 82.9/s -- -99% slice_join 8533/s 10195% -- __DATA__ === num of records is: 1000 ok 4 Rate list_join slice_join list_join 3.66/s -- -100% slice_join 1067/s 29072% -- 1..4

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      Update

      fixed bug in sorted tests

        I compared it to the two methods I used in my examples, just out of curiousity and to learn. I never used the list_join as that intuitively feels stupid/slow, but the slice_join was also not in my default tool-box, as map feels so much more logical to me. Anyway, here we go ...

        I think the grep_map is a neat contender and reads easier than the slice_join. YMMV.


        Enjoy, Have FUN! H.Merijn
Re: Writing hashes as records to a CSV file
by kcott (Archbishop) on Dec 09, 2021 at 06:48 UTC

    G'day Serene Hacker,

    It would have helped if you'd shown a sample of your actual input file. The two hashrefs you show give no indication of the data source or how the data shown was extracted; in fact, I'm not at all convinced that those hashrefs are necessary, even as temporary data structures.

    Purely for demonstration purposes, let's assume your input file records are URLs; and the query string (which could be empty or even absent) represents the "containing key,value pairs".

    $ cat pm_11139481_csv_kv_pairs_data.txt http://example.com/?AB1=101&NN=201&XYZ=401 http://example.net/?AB1=102&XYZ=402&MM=302 http://example.org/?MM=303&XYZ=403&NN=203&AB1=103 http://example.mil/? https://example.mil/ http://example.edu/?PQR=504

    Note that I used unique values throughout. The last digit of each matches the record number. This facilitates quickly checking the output by visual inspection. (Note: in your example, two instances of 100, and two of 400, do not allow this.)

    Now, process that data with:

    #!/usr/bin/env perl use strict; use warnings; use autodie; use Text::CSV; my $source_data = 'pm_11139481_csv_kv_pairs_data.txt'; my $created_csv = 'pm_11139481_csv_kv_pairs_data.csv'; my %data; my $last_index = -1; extract_data($source_data, \%data, \$last_index); { my @headers = sort keys %data; my $csv = Text::CSV::->new(); open my $csv_fh, '>', $created_csv; $csv->say($csv_fh, \@headers); for my $i (0 .. $last_index) { $csv->say($csv_fh, [map $data{$_}[$i], @headers]); } } sub extract_data { my ($in_file, $data, $last_index) = @_; open my $fh, '<', $in_file; while (<$fh>) { ++$$last_index; chomp; my $qs = (split /[?]/)[1]; next unless defined $qs; for my $kv_pair (split /[&]/, $qs) { my ($key, $val) = split /[=]/, $kv_pair; $data->{$key}[$$last_index] = $val; } } return; }

    Obviously, you'd need to write your own version of extract_data(); however, the rest of the code should probably work pretty much as is.

    Here's the result:

    $ cat pm_11139481_csv_kv_pairs_data.csv AB1,MM,NN,PQR,XYZ 101,,201,,401 102,302,,,402 103,303,203,,403 ,,,, ,,,, ,,,504,

    — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11139481]
Approved by marto
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (2)
As of 2024-04-19 18:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found