comment on

I finally had a few minutes of time thinking about the LLIL problem and looked into PostgreSQL for a solution. For timing reference, is use the data generation and perl reference solution from eyepopslikeamosquitos post Rosetta Code: Long List is Long. With the 3 files, the reference script takes nearly 90 seconds on my puny work laptop:

$ time perl llil.pl big1.txt big2.txt big3.txt > big.tmp
llil start
get_properties : 6 secs
sort + output  : 78 secs
total          : 84 secs

real    1m28,236s
user    1m27,327s
sys    0m0,816s
[download]

llil is treated as a one-time problem, so my PostgreSQL solution is certainly not ideal. But if this is thought of as a problem of "these are logs and they grow over time", this is certainly something to consider.

The working directory needs to be read/writeable by the PostgreSQL server process. Then we can bulk import/export the data with the COPY command. Here is the SQL script:

CREATE TABLE llil_raw (
    name text,
    count bigint
);

-- INDEX makes it *way* slower
--CREATE INDEX llil_raw_idx ON llil_raw(name);

COPY llil_raw (name, count) FROM '/home/cavac/src/long_list_is_long/bi
+g1.txt' ( FORMAT TEXT);
COPY llil_raw (name, count) FROM '/home/cavac/src/long_list_is_long/bi
+g2.txt' ( FORMAT TEXT);
COPY llil_raw (name, count) FROM '/home/cavac/src/long_list_is_long/bi
+g3.txt' ( FORMAT TEXT);


CREATE TABLE llil_result (
    name text,
    count bigint
);

INSERT INTO llil_result (name, count)
    SELECT name, sum(count) AS total FROM llil_raw GROUP BY name ORDER
+ BY total;

COPY llil_result (name, count) TO '/home/cavac/src/long_list_is_long/r
+esult.txt' ( FORMAT TEXT);

DROP TABLE llil_result;
DROP TABLE llil_raw;
[download]

And the result:

$ time psql -U Earthrise_Server -d Test_DB -f llil.sql 
CREATE TABLE
COPY 3515200
COPY 3515200
COPY 3515200
CREATE TABLE
INSERT 0 10367359
COPY 10367359
DROP TABLE
DROP TABLE

real    0m19,675s
user    0m0,022s
sys    0m0,009s
[download]

Yes, it's a lot slower than the optimized one-time solutions. But in practice, as i imagine, the data is probably produced over time, a database could keep the running aggregate in llil_result up-to-date on every insert, without having to parse gazillion files every time. And it would be way more flexible, as soon as you need filtering or match/mash it with other data.

I also didn't do any optimization or parallelization (inherited subtables with exclusive primary key ranges and/or other partitioning tricks).

What i DID learn in this experiment is that COPY operations can be blazingly fast for larger datasets, expecially compared to INSERTs. (Most of the time spent here was the INSERT INTO SELECT FROM generation of the result table).

PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
Also check out my sisters artwork and my weekly webcomics

In reply to Re: [OT] The Long List is Long resurrected by cavac
in thread [OT] The Long List is Long resurrected by marioroy

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.