comment on

BrowserUk >>[...] from 5 to 10 times as long ... Maybe more.

Sort will obviously always be faster than using a database (here, PostgreSQL) for loading and querying for the same result, but I want to show that you exaggerate the overhead of loading+quering into an output file. Indexing would be redundant, but I bet even with an index it wouldn't take that much time.

I constructed a similar file, and sorted by the dna column in the two ways under consideration, (sort versus database slurp+query).

Result:
  System sort: 59 minutes.
  Postgres load (6m) + select (59m): 66 minutes
[download]

$ ls -lh dna2.txt 
-rw-rw-r-- 1 xxxxx xxxxx 12G Dec 21 16:38 dna2.txt

1588    mm_ref_chrY.fa  2902494 R       CTTGACTTTTATGTGTCACTGTGTTCAGTT
+CCTGGT
939     mm_ref_chrY.fa  2902495 F       CCAGGAACTGACCACAGTGACACATAAGAG
+TCAAAT
742     mm_ref_chrY.fa  2902497 F       AGGAACTGACCACAGTGACACATAAGAGTC
+AAATAG
1097    mm_ref_chrY.fa  2902497 F       AGGAACTGACCACAGTGACACATAAGAGTC
+AAATAG
100     mm_ref_chrY.fa  2902499 F       GAACTGACCACAGTGACACATAAAAGTCAA
+GTAGTG
1184    mm_ref_chrY.fa  2902499 F       GAACTGACCACAGTGACACATAAAAGTCAA
+GTAGTG
286     mm_ref_chrY.fa  2902505 F       ACCACAGTGCCACATAAAAGTCAAGTAGGG
+AATCCT
235     mm_ref_chrY.fa  2902513 R       ACCAGGACAGGATCCCCTACTTGACTTTTA
+TGTGGC
1744    mm_ref_chrY.fa  2902516 F       ACATAAAAGTCAAGTAGTGGACCCTGTCCT
+GGTCTG
1029    mm_ref_chrY.fa  2902519 F       TAAAAGTCAAGTAGGGGATCCTGTCCTGGT
+CTGGCA

#
#  bash + sort version:
#

$ time sort -k5 dna2.txt > dna2.sortk5.out
real    59m48.641s

#
#  Postgres version:
#

#
# loading the data:
#

$ time < dna2.txt \
  psql -d test -c "
    drop table if exists dna_test;
    create table dna_test(
      y integer, 
      chromosome text,
      genomic_location integer,
      direction text,
      seq text);
     copy dna_test from stdin csv delimiter E'\t';
  " ;
real    6m20.430s

echo "
select to_char(count(*), '999G999G999') as rowcount
from dna_test" | psql -d test
   rowcount
--------------
  181,261,572
(1 row)

#
#  querying a resultset into a sorted file:
#
$ time echo " copy (select * from dna_test order by seq) to stdout" \
  | psql -1qtAd test > dna_test.order_by_seq.out
real    59m12.569s
[download]

So:

     unix sort: real    59m48.641s
table ORDER BY: real    59m12.569s  
the latter preceded by   6m20.430s  overhead for loading table data

So much for your (BrowserUK's) guess: "database takes 5 to 10 times as long as system sort ... Maybe more"...

It almost amounts to slander :)

Update 1

Of course, I couldn't resist to trying with an index as well. and sure enough it's useless / not used:

$ time echo "
       create index dna_test_seq_idx on dna_test (seq)
" | psql -d test
CREATE INDEX
real    63m20.451s
[download]

63m to create the index - that makes sense.

But of course, Pg will not use it:


$ time echo "
   explain analyze select * from dna_test order by seq
" | psql -1qtAd test > dna_test.order_by_seq.explain_analyze.txt

real    57m44.054s

$ cat dna_test.order_by_seq.explain_analyze.txt
Sort  (cost=35550611.84..36003765.76 rows=181261568 width=62) (actual 
+time=1834228.291..3428773.879 rows=181261572 loops=1)
  Sort Key: seq
  Sort Method:  external merge  Disk: 12933032kB
  ->  Seq Scan on dna_test  (cost=0.00..3872406.68 rows=181261568 widt
+h=62) (actual time=17.966..65802.621 rows=181261572 loops=1)
Total runtime: 3463580.243 ms
[download]

update 2: removed a useless use of cat, lest I receive another uuc-award...

In reply to Re^3: sorting very large text files (slander) by erix
in thread sorting very large text files by rnaeye

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.