Re^3: sorting very large text files (slander)

BrowserUk >>[...] from 5 to 10 times as long ... Maybe more.

Sort will obviously always be faster than using a database (here, PostgreSQL) for loading and querying for the same result, but I want to show that you exaggerate the overhead of loading+quering into an output file. Indexing would be redundant, but I bet even with an index it wouldn't take that much time.

I constructed a similar file, and sorted by the dna column in the two ways under consideration, (sort versus database slurp+query).

Result:
  System sort: 59 minutes.
  Postgres load (6m) + select (59m): 66 minutes
[download]

$ ls -lh dna2.txt 
-rw-rw-r-- 1 xxxxx xxxxx 12G Dec 21 16:38 dna2.txt

1588    mm_ref_chrY.fa  2902494 R       CTTGACTTTTATGTGTCACTGTGTTCAGTT
+CCTGGT
939     mm_ref_chrY.fa  2902495 F       CCAGGAACTGACCACAGTGACACATAAGAG
+TCAAAT
742     mm_ref_chrY.fa  2902497 F       AGGAACTGACCACAGTGACACATAAGAGTC
+AAATAG
1097    mm_ref_chrY.fa  2902497 F       AGGAACTGACCACAGTGACACATAAGAGTC
+AAATAG
100     mm_ref_chrY.fa  2902499 F       GAACTGACCACAGTGACACATAAAAGTCAA
+GTAGTG
1184    mm_ref_chrY.fa  2902499 F       GAACTGACCACAGTGACACATAAAAGTCAA
+GTAGTG
286     mm_ref_chrY.fa  2902505 F       ACCACAGTGCCACATAAAAGTCAAGTAGGG
+AATCCT
235     mm_ref_chrY.fa  2902513 R       ACCAGGACAGGATCCCCTACTTGACTTTTA
+TGTGGC
1744    mm_ref_chrY.fa  2902516 F       ACATAAAAGTCAAGTAGTGGACCCTGTCCT
+GGTCTG
1029    mm_ref_chrY.fa  2902519 F       TAAAAGTCAAGTAGGGGATCCTGTCCTGGT
+CTGGCA

#
#  bash + sort version:
#

$ time sort -k5 dna2.txt > dna2.sortk5.out
real    59m48.641s

#
#  Postgres version:
#

#
# loading the data:
#

$ time < dna2.txt \
  psql -d test -c "
    drop table if exists dna_test;
    create table dna_test(
      y integer, 
      chromosome text,
      genomic_location integer,
      direction text,
      seq text);
     copy dna_test from stdin csv delimiter E'\t';
  " ;
real    6m20.430s

echo "
select to_char(count(*), '999G999G999') as rowcount
from dna_test" | psql -d test
   rowcount
--------------
  181,261,572
(1 row)

#
#  querying a resultset into a sorted file:
#
$ time echo " copy (select * from dna_test order by seq) to stdout" \
  | psql -1qtAd test > dna_test.order_by_seq.out
real    59m12.569s
[download]

So:

     unix sort: real    59m48.641s
table ORDER BY: real    59m12.569s  
the latter preceded by   6m20.430s  overhead for loading table data

So much for your (BrowserUK's) guess: "database takes 5 to 10 times as long as system sort ... Maybe more"...

It almost amounts to slander :)

Update 1

Of course, I couldn't resist to trying with an index as well. and sure enough it's useless / not used:

$ time echo "
       create index dna_test_seq_idx on dna_test (seq)
" | psql -d test
CREATE INDEX
real    63m20.451s
[download]

63m to create the index - that makes sense.

But of course, Pg will not use it:


$ time echo "
   explain analyze select * from dna_test order by seq
" | psql -1qtAd test > dna_test.order_by_seq.explain_analyze.txt

real    57m44.054s

$ cat dna_test.order_by_seq.explain_analyze.txt
Sort  (cost=35550611.84..36003765.76 rows=181261568 width=62) (actual 
+time=1834228.291..3428773.879 rows=181261572 loops=1)
  Sort Key: seq
  Sort Method:  external merge  Disk: 12933032kB
  ->  Seq Scan on dna_test  (cost=0.00..3872406.68 rows=181261568 widt
+h=62) (actual time=17.966..65802.621 rows=181261572 loops=1)
Total runtime: 3463580.243 ms
[download]

update 2: removed a useless use of cat, lest I receive another uuc-award...

Comment on Re^3: sorting very large text files (slander) Select or Download Code

Replies are listed 'Best First'.
Re^4: sorting very large text files (slander) by BrowserUk (Patriarch) on Dec 21, 2009 at 20:56 UTC
It almost amounts to slander :) Nice one! A DB answer with real code and data. A rare animal. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply]