in reply to Possible faster way to do this?

in order to find out the column specifications I should use, I need to know what type of data each column has
You don't need to figure that out up-front! Load all the data into a database right now, every field is provisionally type text or blob, that means that the data passes into the database completely lossless and unmolested.

Next use database queries and other features to figure out what the data in each column looks like, it can handle large amount of data much better than you fumbling around with command-line tools.

After you have made a conservative guess for the accurate type of a column, alter the column type. If you were wrong and a row datum does not meet the constraints of the new type, then the conversion will throw an error and abort the type change. You can modify your guess and try again.

Replies are listed 'Best First'.
Re^2: Possible faster way to do this?
by holli (Abbot) on Jun 26, 2019 at 07:33 UTC
    Good luck running (multiple) ALTER TABLE-statements against a 5 TB database.


    holli

    You can lead your users to water, but alas, you cannot drown them.
Re^2: Possible faster way to do this?
by cmk (Initiate) on Jun 29, 2019 at 15:59 UTC
    There's a lot of ambiguity here, so I would second the notion of just throwing it all into a database to analyze. I imagine even indexing will take patience, so maybe make a table per column to make the initial data type determinations standalone. Then dump it into the final db spec when decisions are made. Normalizing it may dictate having multiple tables at the end of the day afterall, and the dump/restore might benefit from having the columns separated at that point as well.