Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

Re: [OT] Re: Get unique fields from file

by Marshall (Canon)
on Jan 11, 2022 at 12:46 UTC ( #11140365=note: print w/replies, xml ) Need Help??

in reply to [OT] Re: Get unique fields from file
in thread Get unique fields from file

Using an SQL DB for this sort of thing is actually a pretty good idea.
I attach my SQLite multi-platform SQLite code below.

There are utilities to import CSV files, but I have never bothered to install the SQLite3 command line on my machine. The Perl DBI comes with all the functions that Perl needs. I use a browser plug-in, SQLite Manager for admin functions. Firefox removed the I/F this plug-in needs, so I installed Waterfox for the single purpose of running this plug-in.

The Perl syntax for some of the SQL command strings got a little hairy, so I included a debug statement for those. If you define a constant, like DEBUG_SQL, if you set that constant equal to 0, Perl will not even compile the "if" statement into the code. So there is no production runtime penalty for using this technique. I left those statements in the code as a demo of technique.

I didn't bother with a CSV parser because there is no evidence at all that this is a CSV file. Include that if you like.

SQL cannot make a table name a variable, but it can make a column name a variable. Unfortunately my SQL knowledge is inadequate at the moment. So I just made a "template" and ran a "prepare" on the adjusted template once per column. Impact on performance is zero.

There has been some discussion in this thread about whether the OP actually intended unique values or distinct values. Most posts rejected the OP's plain English request for unique values in favor of an "distinct" interpretation. I suppose a hold over from UNIX uniq? I calculated both ways below.

There is an easy way to vary SQLite's memory footprint. But it has been maybe 8 years since I needed to do that. I had a heavily indexed DB and found that creating the tables first and then adding the indices later was much faster than having the DB figure that stuff out "as it went" during DB creation. It was possible to vastly increase the footprint of SQLite for this indexing operation and then reduce it later. Something like that could have a huge performance impact on this SQL code.

Update: Oh, darn I see that val3 should be unique in head1. Not sure where the bug is.
OK, see **ADDED**, I wasn't ignoring blank lines.

use strict; use warnings; use DBI; use constant {DEBUG_SQL => 1}; my $scratchDBname = "ScratchDB.sqlite"; my %attr = ( RaiseError => 1); # auto die with error printout my $dbh = DBI->connect("dbi:SQLite:dbname=$scratchDBname","","",\%attr +) or die "Couldn't connect to database $scratchDBname: " . DBI->errs +tr; my $headerLine = <DATA>; chomp $headerLine; my @headers = split (/\|/,$headerLine); # create main Table # $dbh->do("DROP TABLE IF EXISTS main"); my $createTableSQL = "CREATE TABLE main (". join (",",map{"\"$_\" TEXT NOT NULL"}@headers). ")"; print "SQL debug Statement: $createTableSQL\n" if DEBUG_SQL; $dbh->do($createTableSQL); # prepare insert row statement # my $valuesField = "?," x @headers; chop $valuesField; # get rid of last comma my $insertRowSQL = "INSERT INTO main (". join(",",@headers).")". "VALUES (".$valuesField.")"; print "SQL debug statement: $insertRowSQL\n" if DEBUG_SQL; + my $insert = $dbh->prepare($insertRowSQL); # populate main table by reading the 1/2 GB file # $dbh->begin_work; while (defined (my $line =<DATA>)) { chomp $line; next if $line =~ /^\s*$/; #ignore blank lines ***ADDED*** my @fields = split (/\|/,$line); $insert->execute(@fields); } $dbh->commit; # prepare SQL prototypes # # A Column can be a variable but requires more SQL-foo than # I have at the moment... # But making a "prototype" and adjusting that and preparing # on a per column basis appears to be no big deal my $distinct_proto = "SELECT __COLUMN from main GROUP BY __COLUMN ORDER BY __COLUMN"; my $unique_proto = "SELECT __COLUMN FROM ( SELECT __COLUMN, count(*) as histo FROM main GROUP BY __COLUMN ORDER BY __COLUMN) WHERE histo = 1"; foreach my $header (@headers) { my $distinct_select = $distinct_proto; $distinct_select =~ s/__COLUMN/$header/g; my $distinct = $dbh->prepare($distinct_select); my $unique_select = $unique_proto; $unique_select =~ s/__COLUMN/$header/g; my $unique = $dbh->prepare($unique_select); print "\n*****\n"; print "Distinct values for $header:\n"; $distinct->execute(); my $distinct_array_ref = $distinct -> fetchall_arrayref(); print "@$_\n" foreach @$distinct_array_ref; print "\nUnique values for $header:\n"; $unique->execute(); my $unique_array_ref = $unique -> fetchall_arrayref(); print "@$_\n" foreach @$unique_array_ref; } =Prints: SQL debug Statement: CREATE TABLE main ("head1" TEXT NOT NULL,"head2" +TEXT NOT NULL,"head3" TEXT NOT NULL) SQL debug statement: INSERT INTO main (head1,head2,head3)VALUES (?,?,? +) ***** Distinct values for head1: val1 val2 val3 val6 Unique values for head1: val2 val3 val6 ***** Distinct values for head2: val2 val4 val7 Unique values for head2: val2 ***** Distinct values for head3: val3 val5 Unique values for head3: =cut __DATA__ head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3

Replies are listed 'Best First'.
Re^2: [OT] Re: Get unique fields from file
by erix (Prior) on Jan 11, 2022 at 15:58 UTC

    That's nice, too, but it is essentially different from my example: your code actually /imports/ data, whereas mine only configures a table interface on top of an underlying text file, so that you can run any SQL against that underlying text table data (one could also read a zipped text file in this way).

    Let's just say SQL too has more than one way to do it ;)

      Well I guess there is also DBD::CSV but once I got onto SQLite, I haven't looked back. SQLite is the most used DB in the world (geez, it is on every cell phone and in almost all browsers...), well debugged and very efficient for what it does.

      Let's not quibble too much about the details.

      The main point is that using SQL especially in conjunction with a real DB is definitely "on the table" as a reasonable solution for the OP's problem. So I don't think that is OT at all. This approach will scale well into much, much larger files. Functionality like "build histogram" and "print only the first time" are heavily optimized and a real DB is pretty smart about how it uses the memory available to it.

      I don't think we have heard back from the OP about requirement questions, so this is really just an exercise for our own edification and enjoyment.

        so this is really just an exercise for our own edification and enjoyment.

        So let me just enjoy some more light quibbling:

        Hipp, SQLite-man extraordinaire:

        "SQLite was a spin-off of postgres, a conceptual fork"

        "We continue to look up at postgres as our mentor"

        (which makes sense, why invent the wheel? I saw (just a few days ago) that the SQLIte devteam/Hipp are implementing some JSON operators, taking the example from postgres. Good!)

        More fun: a youtube talk by Hipp (admittedly talking to a postgres-audience, at PGcon 2014):

        PGcon 2014 talk by Hipp

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11140365]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (2)
As of 2022-07-04 00:25 GMT
Find Nodes?
    Voting Booth?

    No recent polls found