Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Rosetta Code: Long List is Long - SQLite code

by marioroy (Prior)
on Feb 08, 2023 at 02:17 UTC ( [id://11150229]=note: print w/replies, xml ) Need Help??


in reply to Rosetta Code: Long List is Long

I tried a key-value store implementation using SQLite. There are two scripts; one for importing files and another to output the results.

Note: See this thread for generating the big input files.

$ wc -l big1.txt big2.txt big3.txt 3515200 big1.txt 3515200 big2.txt 3515200 big3.txt 10545600 total $ time bash llil_sql_imp1 hash1.db big1.txt big2.txt big3.txt Importing big1.txt Importing big2.txt Importing big3.txt real 0m12.913s user 0m12.701s sys 0m0.197s $ time bash llil_sql_qry1 hash1.db >out.txt real 0m6.297s user 0m6.157s sys 0m0.158s $ ls -lh hash1.db -rw-r--r-- 1 mario mario 330M Feb 7 22:55 hash1.db

Updated on 2023-02-08

llil_sql_imp1:

The optional -reset argument to the importer script removes the DB file. The .output to /dev/null is to silence SQLite when setting PRAGMA statements. The .output without an argument restores it back to standard output.

#!/usr/bin/env bash # https://perlmonks.org/?node_id=11150229 # SQLite llil import script. # Usage: bash llil_sql_imp1 [-reset] hash1.db big1.txt big2.txt big3.t +xt # SQL function make_script () { read -r -d '' var << 'EOF' -- Set SQLite pragmas. -- .output /dev/null PRAGMA cache_size = 10000; PRAGMA page_size = 4096; PRAGMA journal_mode = OFF; PRAGMA synchronous = OFF; PRAGMA temp_store = MEMORY; .output -- Create the kv_store table. -- CREATE TABLE IF NOT EXISTS "kv_store" ( "name" TEXT NOT NULL UNIQUE, "value" INTEGER DEFAULT 0, PRIMARY KEY("name") ); -- Create a view for the kv_store table. -- CREATE VIEW IF NOT EXISTS properties("name", "value") AS SELECT "name", "value" FROM "kv_store"; -- Create a trigger for inserting into properties. -- CREATE TRIGGER IF NOT EXISTS insert_trg INSTEAD OF INSERT ON properties BEGIN -------------------------------------------------------------------- +----- -- This form (ON DUPLICATE KEY) is not supported in SQLite. -- -- INSERT INTO "kv_store" ("name", "value") VALUES (NEW.name, NEW.va +lue) -- ON DUPLICATE KEY UPDATE "value" = kv_store.value + NEW.value; -------------------------------------------------------------------- +----- -- Try this instead; supported since SQLite version 3.24.0 (2018-06- +04). -- https://sqlite.org/lang_UPSERT.html -- INSERT INTO "kv_store" ("name", "value") VALUES (NEW.name, NEW.va +lue) ON CONFLICT("name") DO UPDATE SET "value" = kv_store.value + NE +W.value; -------------------------------------------------------------------- +----- -- For SQLite before version 3.24.0 (2018-06-04) or Berkeley DB (dbs +ql). -- -- -- Try updating an existing record. -- UPDATE "kv_store" -- SET "value" = kv_store.value + NEW.value WHERE "name" = NEW.nam +e; -- -- Insert a row unless the update occurred. -- INSERT INTO "kv_store" ("name", "value") -- SELECT NEW.name, NEW.value WHERE (SELECT changes() = 0); END; -- Drop the query index to maximize import performance. -- DROP INDEX IF EXISTS "query_idx"; -- Import CSV/TXT file(s). -- .mode csv .headers off .separator ROW "\n" .separator "\t" EOF echo "$var" } # RUN if [[ "$1" == "-reset" ]]; then DB="$2"; shift; shift rm -fr "$DB" "$DB"-journal "$DB"-shm "$DB"-wal else DB="$1"; shift fi if [[ ! "$DB" =~ .db$ ]]; then echo "usage: bash $0 [ -reset ] hash1.db [ in1.txt ... ]" >&2 exit 1 fi SQLCMND=$( command -v sqlite3 ) if [[ -z "$SQLCMND" ]]; then echo "sqlite3: command not found" >&2 exit 1 fi SQLSCRIPT="$( make_script )" NL=$'\n' for infile in "$@"; do if [[ ! -r "$infile" ]]; then echo "$infile: cannot open '$infile', skipping..." >&2 continue fi SQLSCRIPT+="${NL}.print Importing ${infile}" SQLSCRIPT+="${NL}.import '${infile}' properties" done exec "$SQLCMND" "$DB" < <(echo "$SQLSCRIPT")

llil_sql_qry1:

This is the complementary script to output the data with proper formatting and sort order.

#!/usr/bin/env bash # https://perlmonks.org/?node_id=11150229 # SQLite llil query script. # Usage: bash llil_sql_qry1 hash1.db > out.txt # SQL function make_script () { read -r -d '' var << 'EOF' -- Set SQLite pragmas. -- .output /dev/null PRAGMA cache_size = 10000; PRAGMA page_size = 4096; PRAGMA journal_mode = OFF; PRAGMA synchronous = OFF; PRAGMA temp_store = MEMORY; .output -- A query index is not needed for one-time use. -- -- CREATE INDEX IF NOT EXISTS "query_idx" ON "kv_store" ( -- "value" DESC, "name" -- ); -- Output records. -- .mode csv .headers off .separator ROW "\n" .separator "\t" SELECT "name", "value" FROM "kv_store" ORDER BY "value" DESC, "name"; EOF echo "$var" } # RUN DB="$1"; shift if [[ ! "$DB" =~ .db$ ]]; then echo "usage: bash $0 hash1.db" >&2 exit 1 elif [[ ! -r "$DB" ]]; then echo "$DB: cannot open '$DB' (No such file)" >&2 exit 1 fi SQLCMND=$( command -v sqlite3 ) if [[ -z "$SQLCMND" ]]; then echo "sqlite3: command not found" >&2 exit 1 fi SQLSCRIPT="$( make_script )" exec "$SQLCMND" "$DB" < <(echo "$SQLSCRIPT")

Replies are listed 'Best First'.
Re^2: Rosetta Code: Long List is Long - SQLite code
by marioroy (Prior) on Feb 08, 2023 at 05:13 UTC

    If the requirement calls for a one-time report, then I tried another implementation using SQLite; without a view, trigger, and primary key. A group by clause is used for the select statement, now that the DB has duplicate keys. The total time sheds 5 seconds including smaller DB size. Note: Variant two is not compatible with variant one.

    running:

    $ wc -l big1.txt big2.txt big3.txt 3515200 big1.txt 3515200 big2.txt 3515200 big3.txt 10545600 total $ time bash llil_sql_imp2 hash2.db big1.txt big2.txt big3.txt Importing big1.txt Importing big2.txt Importing big3.txt real 0m4.350s user 0m4.298s sys 0m0.049s $ time bash llil_sql_qry2 hash2.db >out.txt real 0m9.668s user 0m9.537s sys 0m0.122s $ ls -lh hash2.db -rw-r--r-- 1 mario mario 160M Feb 7 22:57 hash2.db

    q - Text as Data:

    I came across q - Run SQL directly on delimited files and multi-file sqlite databases on GitHub. Specify -T for tab-delimited output. You may like this tool. Check it out.

    $ time ./linux-q -T "select name, sum(value) from hash.db:::kv_store g +roup by name order by sum(value) desc, name" >out.txt real 0m49.136s user 0m48.667s sys 0m0.423s

    llil_sql_imp2:

    Like variant one, the optional -reset argument to the importer script removes the DB file. The .output to /dev/null is to silence SQLite when setting PRAGMA statements. The .output without an argument restores it back to standard output.

    llil_sql_qry2:

    This produces output with proper formatting and sort order.

      Part three :) I tried creating a single SQLite DB file per each list file. The query script then reads all DBs by attaching other DBs to the first DB. Something like this.

      ATTACH DATABASE 'hash2.db' AS ha_2; ATTACH DATABASE 'hash3.db' AS ha_3; SELECT name, SUM(value) FROM ( SELECT * FROM main.kv_store UNION ALL SELECT * FROM ha_2.kv_store UNION ALL SELECT * FROM ha_3.kv_store ) GROUP BY name ORDER BY SUM(value) DESC, name; DETACH DATABASE ha_2; DETACH DATABASE ha_3;

      running:

      Each import takes ~ 1.5 seconds on my machine. The query takes the same amount of time as before.

      $ bash llil_sql_imp2 hash1.db big1.txt $ bash llil_sql_imp2 hash2.db big3.txt $ bash llil_sql_imp2 hash3.db big3.txt $ time bash llil_sql_qryn hash1.db hash2.db hash3.db >out.txt real 0m9.650s user 0m9.534s sys 0m0.106s

      llil_sql_qryn:

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11150229]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (9)
As of 2024-03-28 18:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found