comment on

We are matching two flat files and making a join, e.g.

Input_File_A has two fields sorted on PK_A_Field:

PK_A_Field    Data_A_Field
       D                     x
       D                     y
       E                      t
[download]

Input_File_B has two fields sorted on PK_B_Field:

PK_B_Field    Data_B_Field
       D                     m
       D                     n
       D                     o
       E                     m
       E                     s
[download]

Output file has three fields

PK_Field     Data_A_Field    Data_B_Field
     D                      x                     m
     D                      x                     n
     D                      x                     o
     D                      y                     m
     D                      y                     n
     D                      y                     o
     E                       t                     m
     E                       t                     s
[download]

i.e. some kind of cartesian join. The input files contain 65 million and 72 million records. The output file has 1.7 trillion records. Program is taking weeks, albeit on a low end server. We used Devel::NYTProf - the biggest chunk of time is writing the output record.

Any way to reduce this time?

Would using a queue in shifting output record to another process or set of round robin processes work? Something like IBM MQ or RabbitMQ?

We tried some measures like use integer; IO::File, and pare the main processing loop. Also, splitting the input files & running on multiple machines in parallel. Nothing much achieved as yet - after all it is a simple single pass program (we store data of the same key of one file in an array & write out matching records of the other file).

By the way, to manage the huge disk resources needed, we close the file every 100 million records, compress it & delete the "until-then" file created. Open the next output file with an increment in the file name. Onward processing on another machine.

2019-07-03 Athanasius added code tags

In reply to Speed up file write taking weeks by Sanjay

Title:
Use: <p> text here (a paragraph) </p>
and: <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

How do I compose an effective node title?

How do I post a question effectively?

Markup in the Monastery

Posts may use any of the Perl Monks Approved HTML tags:
a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

For: Use:

& &

< <

> >

[ [

] ]

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.