Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

[OT] Re: Get unique fields from file

by erix (Prior)
on Jan 08, 2022 at 11:42 UTC ( #11140266=note: print w/replies, xml ) Need Help??

in reply to Get unique fields from file

I should probably say: don't do this at home, kids.

I'm not really sure what sroux is after. I have assumed the 'intention' from the solutions that others baked.

For fun, I open up a postgres SQL interface to the data in this file. I'll admit this postgres spelunking is not for the faint-hearted or for DB beginners. The advantage is getting access to a textfile via the power of SQL.

# bash file=/tmp/flatfile.dat t=flattable sep='|' echo "head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3" > $file # postgres extension 'file_fdw' must be installed (but haven't we all) +: psql -qXa -c "create server if not exists fs foreign data wrapper file +_fdw;" psql -qXa << SQL_PERL drop foreign table if exists $t; create foreign table $t ($( < $file perl -ne 'chomp; my $t = "'$t'"; print join("\n , ", map{$_ .= " text"} split(/['${sep}']/)); exit;' )) server fs options(delimiter'$sep',format'csv',header'TRUE',filename'$f +ile'); SQL_PERL

running the above, on-screen the SQL that was run, echoed from psql:

create server if not exists fs foreign data wrapper file_fdw; drop foreign table if exists flattable; create foreign table flattable (head1 text , head2 text , head3 text ) server fs options(delimiter'|',format'csv',header'TRUE',filename'/tmp/ +flatfile.dat');

Now the file is accessible as a foreign table named 'flattable'.

Once a postgres table, you can run SQL against it, for instance this SELECT:

(select 'head1' "heads", array_agg(distinct head1 order by head1) "val +ues" from flattable) union all (select 'head2' "heads", array_agg(distinct head2 order by head2) "val +ues" from flattable) union all (select 'head3' "heads", array_agg(distinct head3 order by head3) "val +ues" from flattable) -- output: -- -- heads | values -- -------+----------------------- -- head1 | {val1,val2,val3,val6} -- head2 | {val2,val4,val7} -- head3 | {val3,val5} -- (3 rows)

In case of many columns, say hundreds (it happens!), generate the needed SQL. (see psql's \gexec -- left as an exercise for the reader)

Replies are listed 'Best First'.
Re: [OT] Re: Get unique fields from file
by Marshall (Canon) on Jan 11, 2022 at 12:46 UTC
    Using an SQL DB for this sort of thing is actually a pretty good idea.
    I attach my SQLite multi-platform SQLite code below.

    There are utilities to import CSV files, but I have never bothered to install the SQLite3 command line on my machine. The Perl DBI comes with all the functions that Perl needs. I use a browser plug-in, SQLite Manager for admin functions. Firefox removed the I/F this plug-in needs, so I installed Waterfox for the single purpose of running this plug-in.

    The Perl syntax for some of the SQL command strings got a little hairy, so I included a debug statement for those. If you define a constant, like DEBUG_SQL, if you set that constant equal to 0, Perl will not even compile the "if" statement into the code. So there is no production runtime penalty for using this technique. I left those statements in the code as a demo of technique.

    I didn't bother with a CSV parser because there is no evidence at all that this is a CSV file. Include that if you like.

    SQL cannot make a table name a variable, but it can make a column name a variable. Unfortunately my SQL knowledge is inadequate at the moment. So I just made a "template" and ran a "prepare" on the adjusted template once per column. Impact on performance is zero.

    There has been some discussion in this thread about whether the OP actually intended unique values or distinct values. Most posts rejected the OP's plain English request for unique values in favor of an "distinct" interpretation. I suppose a hold over from UNIX uniq? I calculated both ways below.

    There is an easy way to vary SQLite's memory footprint. But it has been maybe 8 years since I needed to do that. I had a heavily indexed DB and found that creating the tables first and then adding the indices later was much faster than having the DB figure that stuff out "as it went" during DB creation. It was possible to vastly increase the footprint of SQLite for this indexing operation and then reduce it later. Something like that could have a huge performance impact on this SQL code.

    Update: Oh, darn I see that val3 should be unique in head1. Not sure where the bug is.
    OK, see **ADDED**, I wasn't ignoring blank lines.

    use strict; use warnings; use DBI; use constant {DEBUG_SQL => 1}; my $scratchDBname = "ScratchDB.sqlite"; my %attr = ( RaiseError => 1); # auto die with error printout my $dbh = DBI->connect("dbi:SQLite:dbname=$scratchDBname","","",\%attr +) or die "Couldn't connect to database $scratchDBname: " . DBI->errs +tr; my $headerLine = <DATA>; chomp $headerLine; my @headers = split (/\|/,$headerLine); # create main Table # $dbh->do("DROP TABLE IF EXISTS main"); my $createTableSQL = "CREATE TABLE main (". join (",",map{"\"$_\" TEXT NOT NULL"}@headers). ")"; print "SQL debug Statement: $createTableSQL\n" if DEBUG_SQL; $dbh->do($createTableSQL); # prepare insert row statement # my $valuesField = "?," x @headers; chop $valuesField; # get rid of last comma my $insertRowSQL = "INSERT INTO main (". join(",",@headers).")". "VALUES (".$valuesField.")"; print "SQL debug statement: $insertRowSQL\n" if DEBUG_SQL; + my $insert = $dbh->prepare($insertRowSQL); # populate main table by reading the 1/2 GB file # $dbh->begin_work; while (defined (my $line =<DATA>)) { chomp $line; next if $line =~ /^\s*$/; #ignore blank lines ***ADDED*** my @fields = split (/\|/,$line); $insert->execute(@fields); } $dbh->commit; # prepare SQL prototypes # # A Column can be a variable but requires more SQL-foo than # I have at the moment... # But making a "prototype" and adjusting that and preparing # on a per column basis appears to be no big deal my $distinct_proto = "SELECT __COLUMN from main GROUP BY __COLUMN ORDER BY __COLUMN"; my $unique_proto = "SELECT __COLUMN FROM ( SELECT __COLUMN, count(*) as histo FROM main GROUP BY __COLUMN ORDER BY __COLUMN) WHERE histo = 1"; foreach my $header (@headers) { my $distinct_select = $distinct_proto; $distinct_select =~ s/__COLUMN/$header/g; my $distinct = $dbh->prepare($distinct_select); my $unique_select = $unique_proto; $unique_select =~ s/__COLUMN/$header/g; my $unique = $dbh->prepare($unique_select); print "\n*****\n"; print "Distinct values for $header:\n"; $distinct->execute(); my $distinct_array_ref = $distinct -> fetchall_arrayref(); print "@$_\n" foreach @$distinct_array_ref; print "\nUnique values for $header:\n"; $unique->execute(); my $unique_array_ref = $unique -> fetchall_arrayref(); print "@$_\n" foreach @$unique_array_ref; } =Prints: SQL debug Statement: CREATE TABLE main ("head1" TEXT NOT NULL,"head2" +TEXT NOT NULL,"head3" TEXT NOT NULL) SQL debug statement: INSERT INTO main (head1,head2,head3)VALUES (?,?,? +) ***** Distinct values for head1: val1 val2 val3 val6 Unique values for head1: val2 val3 val6 ***** Distinct values for head2: val2 val4 val7 Unique values for head2: val2 ***** Distinct values for head3: val3 val5 Unique values for head3: =cut __DATA__ head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3

      That's nice, too, but it is essentially different from my example: your code actually /imports/ data, whereas mine only configures a table interface on top of an underlying text file, so that you can run any SQL against that underlying text table data (one could also read a zipped text file in this way).

      Let's just say SQL too has more than one way to do it ;)

        Well I guess there is also DBD::CSV but once I got onto SQLite, I haven't looked back. SQLite is the most used DB in the world (geez, it is on every cell phone and in almost all browsers...), well debugged and very efficient for what it does.

        Let's not quibble too much about the details.

        The main point is that using SQL especially in conjunction with a real DB is definitely "on the table" as a reasonable solution for the OP's problem. So I don't think that is OT at all. This approach will scale well into much, much larger files. Functionality like "build histogram" and "print only the first time" are heavily optimized and a real DB is pretty smart about how it uses the memory available to it.

        I don't think we have heard back from the OP about requirement questions, so this is really just an exercise for our own edification and enjoyment.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11140266]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (2)
As of 2022-05-27 00:54 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (94 votes). Check out past polls.