Re^2: Rosetta Code: Long List is Long

If the requirement calls for a one-time report, then I tried another implementation using SQLite; without a view, trigger, and primary key. A group by clause is used for the select statement, now that the DB has duplicate keys. The total time sheds 5 seconds including smaller DB size. Note: Variant two is not compatible with variant one.

running:

$ wc -l big1.txt big2.txt big3.txt 
 3515200 big1.txt
 3515200 big2.txt
 3515200 big3.txt
10545600 total

$ time bash llil_sql_imp2 hash2.db big1.txt big2.txt big3.txt
Importing big1.txt
Importing big2.txt
Importing big3.txt

real  0m4.350s
user  0m4.298s
sys   0m0.049s

$ time bash llil_sql_qry2 hash2.db >out.txt

real  0m9.668s
user  0m9.537s
sys   0m0.122s

$ ls -lh hash2.db 
-rw-r--r-- 1 mario mario 160M Feb  7 22:57 hash2.db
[download]

q - Text as Data:

I came across q - Run SQL directly on delimited files and multi-file sqlite databases on GitHub. Specify -T for tab-delimited output. You may like this tool. Check it out.

$ time ./linux-q -T "select name, sum(value) from hash.db:::kv_store g
+roup by name order by sum(value) desc, name" >out.txt

real  0m49.136s
user  0m48.667s
sys   0m0.423s
[download]

llil_sql_imp2:

Like variant one, the optional -reset argument to the importer script removes the DB file. The .output to /dev/null is to silence SQLite when setting PRAGMA statements. The .output without an argument restores it back to standard output.

#!/usr/bin/env bash
# https://perlmonks.org/?node_id=11150232

# SQLite llil import script.
# Usage: bash llil_sql_imp2 [-reset] hash2.db big1.txt big2.txt big3.t
+xt

# SQL

function make_script () {

  read -r -d '' var << 'EOF'

-- Set SQLite pragmas.
--
.output /dev/null
PRAGMA cache_size   = 10000;
PRAGMA page_size    = 4096;
PRAGMA journal_mode = OFF;
PRAGMA synchronous  = OFF;
PRAGMA temp_store   = MEMORY;
.output

-- Create the kv_store table.
--
CREATE TABLE IF NOT EXISTS "kv_store" (
  "name"   TEXT NOT NULL,
  "value"  INTEGER DEFAULT 0
);

-- Import CSV/TXT file(s).
--
.mode csv
.headers off
.separator ROW "\n"
.separator "\t"
EOF

  echo "$var"
}

# RUN

if [[ "$1" == "-reset" ]]; then
  DB="$2"; shift; shift
  rm -fr "$DB" "$DB"-journal "$DB"-shm "$DB"-wal
else
  DB="$1"; shift
fi

if [[ ! "$DB" =~ .db$ ]]; then
  echo "usage: bash $0 [ -reset ] hash2.db [ in1.txt ... ]" >&2
  exit 1
fi

SQLCMND=$( command -v sqlite3 )

if [[ -z "$SQLCMND" ]]; then
  echo "sqlite3: command not found" >&2
  exit 1
fi

SQLSCRIPT="$( make_script )"

NL=$'\n'

for infile in "$@"; do
  if [[ ! -r "$infile" ]]; then
    echo "$infile: cannot open '$infile', skipping..." >&2
    continue
  fi
  SQLSCRIPT+="${NL}.print Importing ${infile}"
  SQLSCRIPT+="${NL}.import '${infile}' kv_store"
done

exec "$SQLCMND" "$DB" < <(echo "$SQLSCRIPT")
[download]

llil_sql_qry2:

This produces output with proper formatting and sort order.

#!/usr/bin/env bash
# https://perlmonks.org/?node_id=11150232

# SQLite llil query script.
# Usage: bash llil_sql_qry2 hash2.db > out.txt

# SQL

function make_script () {

  read -r -d '' var << 'EOF'

-- Set SQLite pragmas.
--
.output /dev/null
PRAGMA cache_size   = 10000;
PRAGMA page_size    = 4096;
PRAGMA journal_mode = OFF;
PRAGMA synchronous  = OFF;
PRAGMA temp_store   = MEMORY;
.output

-- Output records.
--
.mode csv
.headers off
.separator ROW "\n"
.separator "\t"

SELECT "name", sum("value") FROM "kv_store"
  GROUP BY "name"
  ORDER BY sum("value") DESC, "name";

EOF

  echo "$var"
}

# RUN

DB="$1"; shift

if [[ ! "$DB" =~ .db$ ]]; then
  echo "usage: bash $0 hash2.db" >&2
  exit 1
elif [[ ! -r "$DB" ]]; then
  echo "$DB: cannot open '$DB' (No such file)" >&2
  exit 1
fi

SQLCMND=$( command -v sqlite3 )

if [[ -z "$SQLCMND" ]]; then
  echo "sqlite3: command not found" >&2
  exit 1
fi

SQLSCRIPT="$( make_script )"

exec "$SQLCMND" "$DB" < <(echo "$SQLSCRIPT")
[download]

Comment on Re^2: Rosetta Code: Long List is Long - SQLite code Select or Download Code

Replies are listed 'Best First'.
Re^3: Rosetta Code: Long List is Long - SQLite code by marioroy (Prior) on Feb 08, 2023 at 23:14 UTC
Part three :) I tried creating a single SQLite DB file per each list file. The query script then reads all DBs by attaching other DBs to the first DB. Something like this. `ATTACH DATABASE 'hash2.db' AS ha_2; ATTACH DATABASE 'hash3.db' AS ha_3; SELECT name, SUM(value) FROM ( SELECT * FROM main.kv_store UNION ALL SELECT * FROM ha_2.kv_store UNION ALL SELECT * FROM ha_3.kv_store ) GROUP BY name ORDER BY SUM(value) DESC, name; DETACH DATABASE ha_2; DETACH DATABASE ha_3;` [download] running: Each import takes ~ 1.5 seconds on my machine. The query takes the same amount of time as before. `$ bash llil_sql_imp2 hash1.db big1.txt $ bash llil_sql_imp2 hash2.db big3.txt $ bash llil_sql_imp2 hash3.db big3.txt $ time bash llil_sql_qryn hash1.db hash2.db hash3.db >out.txt real 0m9.650s user 0m9.534s sys 0m0.106s` [download] llil_sql_qryn: Read more... (4 kB)	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Rosetta Code: Long List is Long - SQLite code
by marioroy (Prior) on Feb 08, 2023 at 23:14 UTC

Part three :) I tried creating a single SQLite DB file per each list file. The query script then reads all DBs by attaching other DBs to the first DB. Something like this.

ATTACH DATABASE 'hash2.db' AS ha_2;
ATTACH DATABASE 'hash3.db' AS ha_3;

SELECT name, SUM(value) FROM (
  SELECT * FROM main.kv_store UNION ALL
  SELECT * FROM ha_2.kv_store UNION ALL
  SELECT * FROM ha_3.kv_store
) GROUP BY name ORDER BY SUM(value) DESC, name;

DETACH DATABASE ha_2;
DETACH DATABASE ha_3;
[download]

running:

Each import takes ~ 1.5 seconds on my machine. The query takes the same amount of time as before.

$ bash llil_sql_imp2 hash1.db big1.txt
$ bash llil_sql_imp2 hash2.db big3.txt
$ bash llil_sql_imp2 hash3.db big3.txt

$ time bash llil_sql_qryn hash1.db hash2.db hash3.db >out.txt

real  0m9.650s
user  0m9.534s
sys   0m0.106s
[download]

llil_sql_qryn: