I don't know why it is necessary to find all of the unique values for a particular column in order to get started.
One thought would be to build a histogram of number of chars in each field of a row. This would spot weirdo outlier cases where a column is almost always 12, but very rarely 112 chars. This also might help with char(x) versus varchar(x) decisions.
This is gonna be a big DB and I assume that you have an appropriate DB for that. Might be interesting to know the target DB?
I would also consider an iterative approach... Write a one pass analyzer program as I mentioned. Create a DB based upon that info and your judgement. Then if you want to know the unique values for a column, ask the DB! In other words, scan data in one pass to make "good enough judgments" to make an initial DB. Then query this first DB to to help you find the best schema for the final DB. The DB will be more efficient than Perl at finding uniques and a fancy DB can use multi-cores to speed things up.
Update:
Figuring out the optimal data types for your specific DB can be difficult if you want to minimize storage (and I guess that you do want to do that? - but note that sometimes minimal storage means less than optimal query speed). For example MySQL has many data types: MySQL data types.
Two days to analyze one column from 95 is way outside the boundaries of what is needed. It is difficult for me to believe that you don't know what most of these columns represent. You should be able to figure out a reasonable data type for most of the columns without even seeing any of the actual data. |