I finally had a few minutes of time thinking about the LLIL problem and looked into PostgreSQL for a solution. For timing reference, is use the data generation and perl reference solution from eyepopslikeamosquitos post Rosetta Code: Long List is Long. With the 3 files, the reference script takes nearly 90 seconds on my puny work laptop:

$ time perl llil.pl big1.txt big2.txt big3.txt > big.tmp llil start get_properties : 6 secs sort + output : 78 secs total : 84 secs real 1m28,236s user 1m27,327s sys 0m0,816s

llil is treated as a one-time problem, so my PostgreSQL solution is certainly not ideal. But if this is thought of as a problem of "these are logs and they grow over time", this is certainly something to consider.

The working directory needs to be read/writeable by the PostgreSQL server process. Then we can bulk import/export the data with the COPY command. Here is the SQL script:

CREATE TABLE llil_raw ( name text, count bigint ); -- INDEX makes it *way* slower --CREATE INDEX llil_raw_idx ON llil_raw(name); COPY llil_raw (name, count) FROM '/home/cavac/src/long_list_is_long/bi +g1.txt' ( FORMAT TEXT); COPY llil_raw (name, count) FROM '/home/cavac/src/long_list_is_long/bi +g2.txt' ( FORMAT TEXT); COPY llil_raw (name, count) FROM '/home/cavac/src/long_list_is_long/bi +g3.txt' ( FORMAT TEXT); CREATE TABLE llil_result ( name text, count bigint ); INSERT INTO llil_result (name, count) SELECT name, sum(count) AS total FROM llil_raw GROUP BY name ORDER + BY total; COPY llil_result (name, count) TO '/home/cavac/src/long_list_is_long/r +esult.txt' ( FORMAT TEXT); DROP TABLE llil_result; DROP TABLE llil_raw;

And the result:

$ time psql -U Earthrise_Server -d Test_DB -f llil.sql CREATE TABLE COPY 3515200 COPY 3515200 COPY 3515200 CREATE TABLE INSERT 0 10367359 COPY 10367359 DROP TABLE DROP TABLE real 0m19,675s user 0m0,022s sys 0m0,009s

Yes, it's a lot slower than the optimized one-time solutions. But in practice, as i imagine, the data is probably produced over time, a database could keep the running aggregate in llil_result up-to-date on every insert, without having to parse gazillion files every time. And it would be way more flexible, as soon as you need filtering or match/mash it with other data.

I also didn't do any optimization or parallelization (inherited subtables with exclusive primary key ranges and/or other partitioning tricks).

What i DID learn in this experiment is that COPY operations can be blazingly fast for larger datasets, expecially compared to INSERTs. (Most of the time spent here was the INSERT INTO SELECT FROM generation of the result table).

PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
Also check out my sisters artwork and my weekly webcomics

In reply to Re: [OT] The Long List is Long resurrected by cavac
in thread [OT] The Long List is Long resurrected by marioroy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.