We are matching two flat files and making a join, e.g.
Input_File_A has two fields sorted on PK_A_Field:
PK_A_Field Data_A_Field D x D y E t
Input_File_B has two fields sorted on PK_B_Field:
PK_B_Field Data_B_Field D m D n D o E m E s
Output file has three fields
PK_Field Data_A_Field Data_B_Field D x m D x n D x o D y m D y n D y o E t m E t s
i.e. some kind of cartesian join. The input files contain 65 million and 72 million records. The output file has 1.7 trillion records. Program is taking weeks, albeit on a low end server. We used Devel::NYTProf - the biggest chunk of time is writing the output record.
Any way to reduce this time?
Would using a queue in shifting output record to another process or set of round robin processes work? Something like IBM MQ or RabbitMQ?
We tried some measures like use integer; IO::File, and pare the main processing loop. Also, splitting the input files & running on multiple machines in parallel. Nothing much achieved as yet - after all it is a simple single pass program (we store data of the same key of one file in an array & write out matching records of the other file).
By the way, to manage the huge disk resources needed, we close the file every 100 million records, compress it & delete the "until-then" file created. Open the next output file with an increment in the file name. Onward processing on another machine.
2019-07-03 Athanasius added code tags
In reply to Speed up file write taking weeks by Sanjay
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |