Re: [OT] Re: Get unique fields from file

Using an SQL DB for this sort of thing is actually a pretty good idea.
I attach my SQLite multi-platform SQLite code below.

There are utilities to import CSV files, but I have never bothered to install the SQLite3 command line on my machine. The Perl DBI comes with all the functions that Perl needs. I use a browser plug-in, SQLite Manager for admin functions. Firefox removed the I/F this plug-in needs, so I installed Waterfox for the single purpose of running this plug-in.

The Perl syntax for some of the SQL command strings got a little hairy, so I included a debug statement for those. If you define a constant, like DEBUG_SQL, if you set that constant equal to 0, Perl will not even compile the "if" statement into the code. So there is no production runtime penalty for using this technique. I left those statements in the code as a demo of technique.

I didn't bother with a CSV parser because there is no evidence at all that this is a CSV file. Include that if you like.

SQL cannot make a table name a variable, but it can make a column name a variable. Unfortunately my SQL knowledge is inadequate at the moment. So I just made a "template" and ran a "prepare" on the adjusted template once per column. Impact on performance is zero.

There has been some discussion in this thread about whether the OP actually intended unique values or distinct values. Most posts rejected the OP's plain English request for unique values in favor of an "distinct" interpretation. I suppose a hold over from UNIX uniq? I calculated both ways below.

There is an easy way to vary SQLite's memory footprint. But it has been maybe 8 years since I needed to do that. I had a heavily indexed DB and found that creating the tables first and then adding the indices later was much faster than having the DB figure that stuff out "as it went" during DB creation. It was possible to vastly increase the footprint of SQLite for this indexing operation and then reduce it later. Something like that could have a huge performance impact on this SQL code.

Update: Oh, darn I see that val3 should be unique in head1. Not sure where the bug is.
OK, see **ADDED**, I wasn't ignoring blank lines.

use strict;
use warnings;
use DBI;

use constant {DEBUG_SQL => 1};

my $scratchDBname = "ScratchDB.sqlite";  
my %attr = ( RaiseError => 1);  # auto die with error printout
my $dbh = DBI->connect("dbi:SQLite:dbname=$scratchDBname","","",\%attr
+)
    or die "Couldn't connect to database $scratchDBname: " . DBI->errs
+tr;

my $headerLine = <DATA>;
chomp $headerLine;
my @headers = split (/\|/,$headerLine);

# create main Table
#
$dbh->do("DROP TABLE IF EXISTS main");
my $createTableSQL = "CREATE TABLE main (".
                   join (",",map{"\"$_\" TEXT NOT NULL"}@headers).
                   ")";
print "SQL debug Statement: $createTableSQL\n" if DEBUG_SQL;
$dbh->do($createTableSQL);

# prepare insert row statement
#
my $valuesField = "?," x @headers;
chop $valuesField; # get rid of last comma
my $insertRowSQL = "INSERT INTO main (".
                      join(",",@headers).")".
                      "VALUES (".$valuesField.")";
print "SQL debug statement: $insertRowSQL\n" if DEBUG_SQL;            
+       
my $insert = $dbh->prepare($insertRowSQL);

# populate main table by reading the 1/2 GB file
#
$dbh->begin_work;
while (defined (my $line =<DATA>))
{
   chomp $line;
   next if $line =~ /^\s*$/; #ignore blank lines ***ADDED***
   my @fields = split (/\|/,$line);
   $insert->execute(@fields);
}
$dbh->commit;

# prepare SQL prototypes
#
# A Column can be a variable but requires more SQL-foo than
# I have at the moment...
# But making a "prototype" and adjusting that and preparing
# on a per column basis appears to be no big deal

my $distinct_proto = "SELECT __COLUMN from main 
                      GROUP BY  __COLUMN
                      ORDER BY  __COLUMN";
                      
my $unique_proto = "SELECT __COLUMN FROM (
                      SELECT __COLUMN, count(*) as histo 
                      FROM main GROUP BY __COLUMN ORDER BY __COLUMN)
                   WHERE histo = 1";
  
foreach my $header (@headers)
{
   my $distinct_select = $distinct_proto;
   $distinct_select =~ s/__COLUMN/$header/g;
   my $distinct = $dbh->prepare($distinct_select);
   
   my $unique_select = $unique_proto;
   $unique_select =~ s/__COLUMN/$header/g;
   my $unique = $dbh->prepare($unique_select);
   
   print "\n*****\n";
   print "Distinct values for $header:\n";
   $distinct->execute();
   my $distinct_array_ref = $distinct -> fetchall_arrayref();
   print "@$_\n" foreach @$distinct_array_ref;
   
   print "\nUnique values for $header:\n";
   $unique->execute();
   my $unique_array_ref = $unique -> fetchall_arrayref();
   print "@$_\n" foreach @$unique_array_ref;
   
}

=Prints:
SQL debug Statement: CREATE TABLE main ("head1" TEXT NOT NULL,"head2" 
+TEXT NOT NULL,"head3" TEXT NOT NULL)
SQL debug statement: INSERT INTO main (head1,head2,head3)VALUES (?,?,?
+)

*****
Distinct values for head1:
val1
val2
val3
val6

Unique values for head1:
val2
val3
val6

*****
Distinct values for head2:
val2
val4
val7

Unique values for head2:
val2

*****
Distinct values for head3:
val3
val5

Unique values for head3:

=cut

__DATA__
head1|head2|head3
val1|val2|val3
val1|val4|val5
val6|val4|val5
val2|val7|val5
val3|val7|val3
[download]

Comment on Re: [OT] Re: Get unique fields from file Download Code

Replies are listed 'Best First'.
Re^2: [OT] Re: Get unique fields from file by erix (Prior) on Jan 11, 2022 at 15:58 UTC
That's nice, too, but it is essentially different from my example: your code actually /imports/ data, whereas mine only configures a table interface on top of an underlying text file, so that you can run any SQL against that underlying text table data (one could also read a zipped text file in this way). Let's just say SQL too has more than one way to do it ;)	[reply]
Re^3: [OT] Re: Get unique fields from file by Marshall (Canon) on Jan 11, 2022 at 17:07 UTC
Well I guess there is also DBD::CSV but once I got onto SQLite, I haven't looked back. SQLite is the most used DB in the world (geez, it is on every cell phone and in almost all browsers...), well debugged and very efficient for what it does. Let's not quibble too much about the details. The main point is that using SQL especially in conjunction with a real DB is definitely "on the table" as a reasonable solution for the OP's problem. So I don't think that is OT at all. This approach will scale well into much, much larger files. Functionality like "build histogram" and "print only the first time" are heavily optimized and a real DB is pretty smart about how it uses the memory available to it. I don't think we have heard back from the OP about requirement questions, so this is really just an exercise for our own edification and enjoyment.	[reply]
Re^4: [OT] Re: Get unique fields from file by erix (Prior) on Jan 11, 2022 at 17:38 UTC
so this is really just an exercise for our own edification and enjoyment. So let me just enjoy some more light quibbling: Hipp, SQLite-man extraordinaire: "SQLite was a spin-off of postgres, a conceptual fork" "We continue to look up at postgres as our mentor" (which makes sense, why invent the wheel? I saw (just a few days ago) that the SQLIte devteam/Hipp are implementing some JSON operators, taking the example from postgres. Good!) More fun: a youtube talk by Hipp (admittedly talking to a postgres-audience, at PGcon 2014): PGcon 2014 talk by Hipp	[reply]
Re^5: [OT] Re: Get unique fields from file by Marshall (Canon) on Jan 11, 2022 at 20:14 UTC


laziness, impatience, and hubris
	PerlMonks