BrowserUk >>[...] from 5 to 10 times as long ... Maybe more.
Sort will obviously always be faster than using a database (here, PostgreSQL) for loading and querying for the same result, but I want to show that you exaggerate the overhead of loading+quering into an output file. Indexing would be redundant, but I bet even with an index it wouldn't take that much time.
I constructed a similar file, and sorted by the dna column in the two ways under consideration, (sort versus database slurp+query).
Result:
System sort: 59 minutes.
Postgres load (6m) + select (59m): 66 minutes
$ ls -lh dna2.txt
-rw-rw-r-- 1 xxxxx xxxxx 12G Dec 21 16:38 dna2.txt
1588 mm_ref_chrY.fa 2902494 R CTTGACTTTTATGTGTCACTGTGTTCAGTT
+CCTGGT
939 mm_ref_chrY.fa 2902495 F CCAGGAACTGACCACAGTGACACATAAGAG
+TCAAAT
742 mm_ref_chrY.fa 2902497 F AGGAACTGACCACAGTGACACATAAGAGTC
+AAATAG
1097 mm_ref_chrY.fa 2902497 F AGGAACTGACCACAGTGACACATAAGAGTC
+AAATAG
100 mm_ref_chrY.fa 2902499 F GAACTGACCACAGTGACACATAAAAGTCAA
+GTAGTG
1184 mm_ref_chrY.fa 2902499 F GAACTGACCACAGTGACACATAAAAGTCAA
+GTAGTG
286 mm_ref_chrY.fa 2902505 F ACCACAGTGCCACATAAAAGTCAAGTAGGG
+AATCCT
235 mm_ref_chrY.fa 2902513 R ACCAGGACAGGATCCCCTACTTGACTTTTA
+TGTGGC
1744 mm_ref_chrY.fa 2902516 F ACATAAAAGTCAAGTAGTGGACCCTGTCCT
+GGTCTG
1029 mm_ref_chrY.fa 2902519 F TAAAAGTCAAGTAGGGGATCCTGTCCTGGT
+CTGGCA
#
# bash + sort version:
#
$ time sort -k5 dna2.txt > dna2.sortk5.out
real 59m48.641s
#
# Postgres version:
#
#
# loading the data:
#
$ time < dna2.txt \
psql -d test -c "
drop table if exists dna_test;
create table dna_test(
y integer,
chromosome text,
genomic_location integer,
direction text,
seq text);
copy dna_test from stdin csv delimiter E'\t';
" ;
real 6m20.430s
echo "
select to_char(count(*), '999G999G999') as rowcount
from dna_test" | psql -d test
rowcount
--------------
181,261,572
(1 row)
#
# querying a resultset into a sorted file:
#
$ time echo " copy (select * from dna_test order by seq) to stdout" \
| psql -1qtAd test > dna_test.order_by_seq.out
real 59m12.569s
So:
unix sort: real 59m48.641s
table ORDER BY: real 59m12.569s
the latter preceded by 6m20.430s overhead for loading table data
So much for your (BrowserUK's) guess: "database takes 5 to 10 times as long as system sort ... Maybe more"...
It almost amounts to slander :)
Update 1
Of course, I couldn't resist to trying with an index as well. and sure enough it's useless / not used:
$ time echo "
create index dna_test_seq_idx on dna_test (seq)
" | psql -d test
CREATE INDEX
real 63m20.451s
63m to create the index - that makes sense.
But of course, Pg will not use it:
$ time echo "
explain analyze select * from dna_test order by seq
" | psql -1qtAd test > dna_test.order_by_seq.explain_analyze.txt
real 57m44.054s
$ cat dna_test.order_by_seq.explain_analyze.txt
Sort (cost=35550611.84..36003765.76 rows=181261568 width=62) (actual
+time=1834228.291..3428773.879 rows=181261572 loops=1)
Sort Key: seq
Sort Method: external merge Disk: 12933032kB
-> Seq Scan on dna_test (cost=0.00..3872406.68 rows=181261568 widt
+h=62) (actual time=17.966..65802.621 rows=181261572 loops=1)
Total runtime: 3463580.243 ms
update 2: removed a useless use of cat, lest I receive another uuc-award...
|