Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Using an SQL DB for this sort of thing is actually a pretty good idea.
I attach my SQLite multi-platform SQLite code below.

There are utilities to import CSV files, but I have never bothered to install the SQLite3 command line on my machine. The Perl DBI comes with all the functions that Perl needs. I use a browser plug-in, SQLite Manager for admin functions. Firefox removed the I/F this plug-in needs, so I installed Waterfox for the single purpose of running this plug-in.

The Perl syntax for some of the SQL command strings got a little hairy, so I included a debug statement for those. If you define a constant, like DEBUG_SQL, if you set that constant equal to 0, Perl will not even compile the "if" statement into the code. So there is no production runtime penalty for using this technique. I left those statements in the code as a demo of technique.

I didn't bother with a CSV parser because there is no evidence at all that this is a CSV file. Include that if you like.

SQL cannot make a table name a variable, but it can make a column name a variable. Unfortunately my SQL knowledge is inadequate at the moment. So I just made a "template" and ran a "prepare" on the adjusted template once per column. Impact on performance is zero.

There has been some discussion in this thread about whether the OP actually intended unique values or distinct values. Most posts rejected the OP's plain English request for unique values in favor of an "distinct" interpretation. I suppose a hold over from UNIX uniq? I calculated both ways below.

There is an easy way to vary SQLite's memory footprint. But it has been maybe 8 years since I needed to do that. I had a heavily indexed DB and found that creating the tables first and then adding the indices later was much faster than having the DB figure that stuff out "as it went" during DB creation. It was possible to vastly increase the footprint of SQLite for this indexing operation and then reduce it later. Something like that could have a huge performance impact on this SQL code.

Update: Oh, darn I see that val3 should be unique in head1. Not sure where the bug is.
OK, see **ADDED**, I wasn't ignoring blank lines.

use strict; use warnings; use DBI; use constant {DEBUG_SQL => 1}; my $scratchDBname = "ScratchDB.sqlite"; my %attr = ( RaiseError => 1); # auto die with error printout my $dbh = DBI->connect("dbi:SQLite:dbname=$scratchDBname","","",\%attr +) or die "Couldn't connect to database $scratchDBname: " . DBI->errs +tr; my $headerLine = <DATA>; chomp $headerLine; my @headers = split (/\|/,$headerLine); # create main Table # $dbh->do("DROP TABLE IF EXISTS main"); my $createTableSQL = "CREATE TABLE main (". join (",",map{"\"$_\" TEXT NOT NULL"}@headers). ")"; print "SQL debug Statement: $createTableSQL\n" if DEBUG_SQL; $dbh->do($createTableSQL); # prepare insert row statement # my $valuesField = "?," x @headers; chop $valuesField; # get rid of last comma my $insertRowSQL = "INSERT INTO main (". join(",",@headers).")". "VALUES (".$valuesField.")"; print "SQL debug statement: $insertRowSQL\n" if DEBUG_SQL; + my $insert = $dbh->prepare($insertRowSQL); # populate main table by reading the 1/2 GB file # $dbh->begin_work; while (defined (my $line =<DATA>)) { chomp $line; next if $line =~ /^\s*$/; #ignore blank lines ***ADDED*** my @fields = split (/\|/,$line); $insert->execute(@fields); } $dbh->commit; # prepare SQL prototypes # # A Column can be a variable but requires more SQL-foo than # I have at the moment... # But making a "prototype" and adjusting that and preparing # on a per column basis appears to be no big deal my $distinct_proto = "SELECT __COLUMN from main GROUP BY __COLUMN ORDER BY __COLUMN"; my $unique_proto = "SELECT __COLUMN FROM ( SELECT __COLUMN, count(*) as histo FROM main GROUP BY __COLUMN ORDER BY __COLUMN) WHERE histo = 1"; foreach my $header (@headers) { my $distinct_select = $distinct_proto; $distinct_select =~ s/__COLUMN/$header/g; my $distinct = $dbh->prepare($distinct_select); my $unique_select = $unique_proto; $unique_select =~ s/__COLUMN/$header/g; my $unique = $dbh->prepare($unique_select); print "\n*****\n"; print "Distinct values for $header:\n"; $distinct->execute(); my $distinct_array_ref = $distinct -> fetchall_arrayref(); print "@$_\n" foreach @$distinct_array_ref; print "\nUnique values for $header:\n"; $unique->execute(); my $unique_array_ref = $unique -> fetchall_arrayref(); print "@$_\n" foreach @$unique_array_ref; } =Prints: SQL debug Statement: CREATE TABLE main ("head1" TEXT NOT NULL,"head2" +TEXT NOT NULL,"head3" TEXT NOT NULL) SQL debug statement: INSERT INTO main (head1,head2,head3)VALUES (?,?,? +) ***** Distinct values for head1: val1 val2 val3 val6 Unique values for head1: val2 val3 val6 ***** Distinct values for head2: val2 val4 val7 Unique values for head2: val2 ***** Distinct values for head3: val3 val5 Unique values for head3: =cut __DATA__ head1|head2|head3 val1|val2|val3 val1|val4|val5 val6|val4|val5 val2|val7|val5 val3|val7|val3

In reply to Re: [OT] Re: Get unique fields from file by Marshall
in thread Get unique fields from file by sroux

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-04-19 10:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found