Serene Hacker has asked for the wisdom of the Perl Monks concerning the following question:
I have a bunch of records containing key,value pairs.
I'd like to create a CSV file containing one row for each record.
However, each record doesn't necessarily contain exactly the same keys as the other records. So I'd like the column-headers to be a superset of all the keys. E.g.
{AB1 => 100, NN => 200, XYZ => 400}
{AB1 => 100, XYZ => 400, MM => 300}
-->
"AB1","NN","XYZ","MM"
"100","200","400",""
"100","","400","300"
ie.
AB1 NN XYZ MM
100 200 400 undef
100 undef 400 300
I could do this "manually" as it were, but can one do it directly (using Text::CSV) ie. without figuring out the superset of keys myself? In Text::CSV I see one can *read* tables with headers as hashes. But can one write them just as easily?
Re: Writing hashes as records to a CSV file
by choroba (Cardinal) on Dec 08, 2021 at 23:24 UTC
|
You can collect the column names by iterating over the records and accumulating their keys.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use List::Util qw{ uniq };
use Text::CSV_XS;
my @records = ({AB1 => 100, NN => 200, XYZ => 400},
{AB1 => 100, XYZ => 400, MM => 300});
my @columns = sort +uniq(map keys %$_, @records);
my $csv = 'Text::CSV_XS'->new({auto_diag => 1});
for my $record ({map { $_ => $_ } @columns}, @records) {
$csv->say(\*STDOUT, [@$record{@columns}]);
}
map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
| [reply] [d/l] [select] |
Re: Writing hashes as records to a CSV file
by Tux (Canon) on Dec 09, 2021 at 08:55 UTC
|
use Text::CSV_XS qw( csv );
my @records = (
{ AB1 => 100, NN => 200, XYZ => 400},
{ AB1 => 100, XYZ => 400, MM => 300},
);
my %fld = map { $_ => 1 } map { keys %$_ } @records;
csv (in => \@records, headers => [ sort keys %fld ]);
A suggestion might be to add quote_empty => 1 to be able to distinguish between missing keys and defined keys. But as that does not shows the difference between missing keys and keys that point to an undefined value, I would suggest something like this (unless you don't care about those differences):
use Text::CSV_XS qw( csv );
my @records = (
{ AB1 => 100, NN => 200, XYZ => 400 },
{ AB1 => 200, XYZ => 400, MM => 300 },
{ AB1 => 300, XYZ => undef, MM => "" },
);
my %seen;
my @fld = sort grep { !$seen{$_}++ } map { keys %$_ } @records;
csv (in => \@records, headers => \@fld, quote_empty => 1, on_in => sub
+ {
foreach my $f (@fld) {
exists $_{$f} or $_{$f} = "--missing--";
}
});
->
AB1,MM,NN,XYZ
100,--missing--,200,400
200,300,--missing--,400
300,"",--missing--,
To answer LanX' question: as CSV is streaming, you don't know what is to come, so there is no builtin functionality to find unique values in CSV, but there are enough builtin features to do it yourself.
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] [select] |
Re: Writing hashes as records to a CSV file
by LanX (Saint) on Dec 08, 2021 at 23:27 UTC
|
> but can one do it directly (using Text::CSV) ie. without figuring out the superset of keys myself
I searched thru the documentation of Text::CSV but couldn't find that. But I'm confident Tux as one of the maintainers will know.
> I could do this "manually" as it were,
Merging hashes is not complicated in Perl:
after
%H = (%H,%$_) for @AoH
you'll have the superset in
keys %H
More importantly you can keep control over the order of columns created. (like ordering by count)
Do you really want to leave that to Text::CSV, which most probably will be random?
update x2
use strict;
use warnings;
use Data::Dump qw/pp dd/;
my @AoH =
(
{AB => 100, NN => 200, XYZ => 400},
{AB => 100, XYZ => 400, MM => 300},
{ map { ("A$_" => $_) } "A".."C" }
);
my %H;
%H = (%H,%$_) for @AoH;
dd keys %H;
my %count;
map { map { $count{$_}++ } keys %$_ } @AoH;
dd sort { $count{$b} <=> $count{$a} || $a cmp $b } keys %count;
OUTPUT: ("XYZ", "NN", "MM", "AC", "AA", "AB")
("AB", "XYZ", "AA", "AC", "MM", "NN")
NB: "AB" is last in the first result, even that it appears in each record.
The second result however will order the columns first by count and then by alphabetic order.
YMMV... | [reply] [d/l] [select] |
|
use v5.12;
use warnings;
use Test::More;
use Benchmark qw/cmpthese/;
my $AoH;
for my $n_rec (1, 10,100,1000) {
say;
say "=== num of records is: ",$n_rec;
$AoH = create_data(1,$n_rec);
is_deeply(
[sort &list_join],
[sort &slice_join],
);
cmpthese(-1, {
'list_join' => \&list_join,
'slice_join' => \&slice_join,
}
);
}
done_testing;
sub list_join {
my %H;
%H = (%H,%$_) for @$AoH;
return keys %H;
}
sub slice_join {
my %H;
@H{keys %$_}=() for @$AoH;
return keys %H;
}
sub create_data {
my ( $density,$records ) = @_ ;
my @AoH;
push @AoH, { map { rand 100 <= $density ? ("$_" => $_) :() } "A"..
+"ZZ" } for 1..$records;
return \@AoH;
}
OUTPUT: __DATA__
=== num of records is: 1
ok 1
Rate list_join slice_join
list_join 238532/s -- -65%
slice_join 682713/s 186% --
__DATA__
=== num of records is: 10
ok 2
Rate list_join slice_join
list_join 7819/s -- -93%
slice_join 112993/s 1345% --
__DATA__
=== num of records is: 100
ok 3
Rate list_join slice_join
list_join 82.9/s -- -99%
slice_join 8533/s 10195% --
__DATA__
=== num of records is: 1000
ok 4
Rate list_join slice_join
list_join 3.66/s -- -100%
slice_join 1067/s 29072% --
1..4
| [reply] [d/l] [select] |
|
I compared it to the two methods I used in my examples, just out of curiousity and to learn. I never used the list_join as that intuitively feels stupid/slow, but the slice_join was also not in my default tool-box, as map feels so much more logical to me. Anyway, here we go ...
I think the grep_map is a neat contender and reads easier than the slice_join. YMMV.
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] [select] |
|
|
|
Re: Writing hashes as records to a CSV file
by kcott (Archbishop) on Dec 09, 2021 at 06:48 UTC
|
G'day Serene Hacker,
It would have helped if you'd shown a sample of your actual input file.
The two hashrefs you show give no indication of the data source or how the data shown was extracted;
in fact, I'm not at all convinced that those hashrefs are necessary, even as temporary data structures.
Purely for demonstration purposes, let's assume your input file records are URLs;
and the query string (which could be empty or even absent) represents the "containing key,value pairs".
$ cat pm_11139481_csv_kv_pairs_data.txt
http://example.com/?AB1=101&NN=201&XYZ=401
http://example.net/?AB1=102&XYZ=402&MM=302
http://example.org/?MM=303&XYZ=403&NN=203&AB1=103
http://example.mil/?
https://example.mil/
http://example.edu/?PQR=504
Note that I used unique values throughout.
The last digit of each matches the record number.
This facilitates quickly checking the output by visual inspection.
(Note: in your example, two instances of 100, and two of 400, do not allow this.)
Now, process that data with:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Text::CSV;
my $source_data = 'pm_11139481_csv_kv_pairs_data.txt';
my $created_csv = 'pm_11139481_csv_kv_pairs_data.csv';
my %data;
my $last_index = -1;
extract_data($source_data, \%data, \$last_index);
{
my @headers = sort keys %data;
my $csv = Text::CSV::->new();
open my $csv_fh, '>', $created_csv;
$csv->say($csv_fh, \@headers);
for my $i (0 .. $last_index) {
$csv->say($csv_fh, [map $data{$_}[$i], @headers]);
}
}
sub extract_data {
my ($in_file, $data, $last_index) = @_;
open my $fh, '<', $in_file;
while (<$fh>) {
++$$last_index;
chomp;
my $qs = (split /[?]/)[1];
next unless defined $qs;
for my $kv_pair (split /[&]/, $qs) {
my ($key, $val) = split /[=]/, $kv_pair;
$data->{$key}[$$last_index] = $val;
}
}
return;
}
Obviously, you'd need to write your own version of extract_data();
however, the rest of the code should probably work pretty much as is.
Here's the result:
$ cat pm_11139481_csv_kv_pairs_data.csv
AB1,MM,NN,PQR,XYZ
101,,201,,401
102,302,,,402
103,303,203,,403
,,,,
,,,,
,,,504,
| [reply] [d/l] [select] |
|
|