Sanjay has asked for the wisdom of the Perl Monks concerning the following question:
We are matching two flat files and making a join, e.g.
Input_File_A has two fields sorted on PK_A_Field:
PK_A_Field Data_A_Field D x D y E t
Input_File_B has two fields sorted on PK_B_Field:
PK_B_Field Data_B_Field D m D n D o E m E s
Output file has three fields
PK_Field Data_A_Field Data_B_Field D x m D x n D x o D y m D y n D y o E t m E t s
i.e. some kind of cartesian join. The input files contain 65 million and 72 million records. The output file has 1.7 trillion records. Program is taking weeks, albeit on a low end server. We used Devel::NYTProf - the biggest chunk of time is writing the output record.
Any way to reduce this time?
Would using a queue in shifting output record to another process or set of round robin processes work? Something like IBM MQ or RabbitMQ?
We tried some measures like use integer; IO::File, and pare the main processing loop. Also, splitting the input files & running on multiple machines in parallel. Nothing much achieved as yet - after all it is a simple single pass program (we store data of the same key of one file in an array & write out matching records of the other file).
By the way, to manage the huge disk resources needed, we close the file every 100 million records, compress it & delete the "until-then" file created. Open the next output file with an increment in the file name. Onward processing on another machine.
2019-07-03 Athanasius added code tags
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Speed up file write taking weeks
by dave_the_m (Monsignor) on Jun 29, 2019 at 14:56 UTC | |
Re: Speed up file write taking weeks
by marto (Cardinal) on Jun 29, 2019 at 16:02 UTC | |
Re: Speed up file write taking weeks
by holli (Abbot) on Jun 29, 2019 at 14:48 UTC | |
Re: Speed up file write taking weeks
by LanX (Saint) on Jun 29, 2019 at 23:46 UTC | |
by Sanjay (Sexton) on Jul 01, 2019 at 06:28 UTC | |
by dave_the_m (Monsignor) on Jul 01, 2019 at 10:55 UTC | |
by Sanjay (Sexton) on Jul 02, 2019 at 13:32 UTC | |
by dave_the_m (Monsignor) on Jul 02, 2019 at 14:12 UTC | |
by Corion (Patriarch) on Jul 02, 2019 at 13:40 UTC | |
by Marshall (Canon) on Jul 02, 2019 at 00:43 UTC | |
by Sanjay (Sexton) on Nov 22, 2019 at 16:00 UTC | |
by Marshall (Canon) on Nov 30, 2019 at 01:46 UTC | |
by karlgoethebier (Abbot) on Jul 02, 2019 at 19:22 UTC | |
Re: Speed up file write taking weeks
by marioroy (Prior) on Dec 01, 2019 at 06:55 UTC |