Using an SQL DB for this sort of thing is actually a pretty good idea.
I attach my SQLite multi-platform SQLite code below.
There are utilities to import CSV files, but I have never bothered to install the SQLite3 command line on my machine. The Perl DBI comes with all the functions that Perl needs. I use a browser plug-in, SQLite Manager for admin functions. Firefox removed the I/F this plug-in needs, so I installed Waterfox for the single purpose of running this plug-in.
The Perl syntax for some of the SQL command strings got a little hairy, so I included a debug statement for those. If you define a constant, like DEBUG_SQL, if you set that constant equal to 0, Perl will not even compile the "if" statement into the code. So there is no production runtime penalty for using this technique. I left those statements in the code as a demo of technique.
I didn't bother with a CSV parser because there is no evidence at all that this is a CSV file. Include that if you like.
SQL cannot make a table name a variable, but it can make a column name a variable. Unfortunately my SQL knowledge is inadequate at the moment. So I just made a "template" and ran a "prepare" on the adjusted template once per column. Impact on performance is zero.
There has been some discussion in this thread about whether the OP actually intended unique values or distinct values. Most posts rejected the OP's plain English request for unique values in favor of an "distinct" interpretation. I suppose a hold over from UNIX uniq? I calculated both ways below.
There is an easy way to vary SQLite's memory footprint. But it has been maybe 8 years since I needed to do that. I had a heavily indexed DB and found that creating the tables first and then adding the indices later was much faster than having the DB figure that stuff out "as it went" during DB creation. It was possible to vastly increase the footprint of SQLite for this indexing operation and then reduce it later. Something like that could have a huge performance impact on this SQL code.
Update: Oh, darn I see that val3 should be unique in head1. Not sure where the bug is.
OK, see **ADDED**, I wasn't ignoring blank lines.
use strict;
use warnings;
use DBI;
use constant {DEBUG_SQL => 1};
my $scratchDBname = "ScratchDB.sqlite";
my %attr = ( RaiseError => 1); # auto die with error printout
my $dbh = DBI->connect("dbi:SQLite:dbname=$scratchDBname","","",\%attr
+)
or die "Couldn't connect to database $scratchDBname: " . DBI->errs
+tr;
my $headerLine = <DATA>;
chomp $headerLine;
my @headers = split (/\|/,$headerLine);
# create main Table
#
$dbh->do("DROP TABLE IF EXISTS main");
my $createTableSQL = "CREATE TABLE main (".
join (",",map{"\"$_\" TEXT NOT NULL"}@headers).
")";
print "SQL debug Statement: $createTableSQL\n" if DEBUG_SQL;
$dbh->do($createTableSQL);
# prepare insert row statement
#
my $valuesField = "?," x @headers;
chop $valuesField; # get rid of last comma
my $insertRowSQL = "INSERT INTO main (".
join(",",@headers).")".
"VALUES (".$valuesField.")";
print "SQL debug statement: $insertRowSQL\n" if DEBUG_SQL;
+
my $insert = $dbh->prepare($insertRowSQL);
# populate main table by reading the 1/2 GB file
#
$dbh->begin_work;
while (defined (my $line =<DATA>))
{
chomp $line;
next if $line =~ /^\s*$/; #ignore blank lines ***ADDED***
my @fields = split (/\|/,$line);
$insert->execute(@fields);
}
$dbh->commit;
# prepare SQL prototypes
#
# A Column can be a variable but requires more SQL-foo than
# I have at the moment...
# But making a "prototype" and adjusting that and preparing
# on a per column basis appears to be no big deal
my $distinct_proto = "SELECT __COLUMN from main
GROUP BY __COLUMN
ORDER BY __COLUMN";
my $unique_proto = "SELECT __COLUMN FROM (
SELECT __COLUMN, count(*) as histo
FROM main GROUP BY __COLUMN ORDER BY __COLUMN)
WHERE histo = 1";
foreach my $header (@headers)
{
my $distinct_select = $distinct_proto;
$distinct_select =~ s/__COLUMN/$header/g;
my $distinct = $dbh->prepare($distinct_select);
my $unique_select = $unique_proto;
$unique_select =~ s/__COLUMN/$header/g;
my $unique = $dbh->prepare($unique_select);
print "\n*****\n";
print "Distinct values for $header:\n";
$distinct->execute();
my $distinct_array_ref = $distinct -> fetchall_arrayref();
print "@$_\n" foreach @$distinct_array_ref;
print "\nUnique values for $header:\n";
$unique->execute();
my $unique_array_ref = $unique -> fetchall_arrayref();
print "@$_\n" foreach @$unique_array_ref;
}
=Prints:
SQL debug Statement: CREATE TABLE main ("head1" TEXT NOT NULL,"head2"
+TEXT NOT NULL,"head3" TEXT NOT NULL)
SQL debug statement: INSERT INTO main (head1,head2,head3)VALUES (?,?,?
+)
*****
Distinct values for head1:
val1
val2
val3
val6
Unique values for head1:
val2
val3
val6
*****
Distinct values for head2:
val2
val4
val7
Unique values for head2:
val2
*****
Distinct values for head3:
val3
val5
Unique values for head3:
=cut
__DATA__
head1|head2|head3
val1|val2|val3
val1|val4|val5
val6|val4|val5
val2|val7|val5
val3|val7|val3