in reply to Speed up file write taking weeks

> . The output file has 1.7 trillion records

Sorry but the whole concept is questionable.

You are multiplying two files to effectively waste disc space with highly redundant information.

I'd rather consider putting those two sources in a DB (or something similar) and to join them on demand.

If the target consumer really needs files try faking them with a filesystem facade which interfaces the DB.

I wouldn't be surprised if this bridge was faster than accessing a huge static file on a hard disk.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Replies are listed 'Best First'.
Re^2: Speed up file write taking weeks
by Sanjay (Sexton) on Jul 01, 2019 at 06:28 UTC

    Thank you for your response. We tried Oracle & then SQLite. Ran out of disk space or system hanging. Reluctantly adopted this method. At least it's inching somewhere.

    BTW, the problem presented is a simplified one & part of a larger solution. We guess that the 1.7 trillion records will generate around 100 million unique records (process still going on).

      Generating trillions of records from initial data containing millions of records, then deduplicating that back to millions of records sounds like a really bad idea. If you gave us a better idea of what data you start with (how many rows, columns, how many unique keys, average size of a record etc), and what the final result is that you want, we might be able to come up with some suggestions.

      Dave.

        Let us move from here to the original problem

        Two tables used as "input" to an SQL:

        1. T_1_DAT ( t_1_id* , t_1_data_1 ) - 65 million rows - * = primary key

        2. T_2_KEY ( t_2_id@ , t_2_data_2 ) - 72 million rows - @ = part of primary key and an index on this field alone

        SQL

        insert in table T_3_XYZ ( t_3_id , t_3_data_1 , t_3_data_2 ) select distinct t_1_id , t_1_data_1 , t_2_data_2 from T_1_DAT , T_2_KEY where t_1_id = t_2_id

        Works for databases with multi column distinct facility. 1.7 trillion rows generated if the unique is removed. We are guessing 100 million if unique.

        Now that we are doing it the flat file method, we are "joining" the two files, generating the 1.7 trillion output in instalments. After this we will sort it & "coalesce" it - in stages I guess!

      You wrote: "The input files contain 65 million and 72 million records. The output file has 1.7 trillion records." And then, "We guess that the 1.7 trillion records will generate around 100 million unique records".

      If the final result is the generation of these 100 million "unique records" (whatever that means), what is your plan for doing that from this humongous flat file of 1.7 trillion records? A factor of millions is a lot!

      It is plausible to have an SQL DB with 65 + 72 million records. If those 2 tables combine to produce a smaller table (less than the sum of the input rows) of 100 million, I suspect there is a much more efficient algorithm to do that. However, I just don't know enough about what you are doing! My gosh what will you do with this 1.7 trillion record file after you generate it? How will you arrive at the 100 million unique records?

        Summarise based on another field

      "...out of...space..."

      This has nothing to do with the database you choose. As others mentioned already: Upgrade your hardware. You can‘t manage a big database with poor hardware. You may also consider partitioning of your tables. Regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

      perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help