in reply to Possible faster way to do this?

While I wrote SQL::Type::Guess, it really, really wants to have all the data in memory already, which is asking a bit much for a 5 TB file.

If you are content with a quick approach, consider either taking the first 100 GB or a random sampling of rows from your input file(s) and use these with SQL::Type::Guess to determine good types for your columns.

Alternatively, a more "manual" approach of reading the file line by line and feeding the information to SQL::Type::Guess could also work:

while (my $hashref = read_next_row_from_file_as_hashref( $fh )) { $sqltypes->guess( $hashref ); }

Have you looked at how long it takes to read the 5TB file without doing any processing? Maybe two days isn't all that bad.

Replies are listed 'Best First'.
Re^2: Possible faster way to do this?
by Eily (Monsignor) on Jun 25, 2019 at 09:44 UTC

    Maybe two days isn't all that bad.
    When it's for just one column of a total of 95, my guess would be that this is pretty bad :P. But rather than the 5TB file, it's sorting and searching for unique values in 50GB (~5TB/95) that I'd be the most worried about personally.