I don't know why it is necessary to find all of the unique values for a particular column in order to get started.

One thought would be to build a histogram of number of chars in each field of a row. This would spot weirdo outlier cases where a column is almost always 12, but very rarely 112 chars. This also might help with char(x) versus varchar(x) decisions.

This is gonna be a big DB and I assume that you have an appropriate DB for that. Might be interesting to know the target DB?

I would also consider an iterative approach... Write a one pass analyzer program as I mentioned. Create a DB based upon that info and your judgement. Then if you want to know the unique values for a column, ask the DB! In other words, scan data in one pass to make "good enough judgments" to make an initial DB. Then query this first DB to to help you find the best schema for the final DB. The DB will be more efficient than Perl at finding uniques and a fancy DB can use multi-cores to speed things up.

Update:
Figuring out the optimal data types for your specific DB can be difficult if you want to minimize storage (and I guess that you do want to do that? - but note that sometimes minimal storage means less than optimal query speed).
For example MySQL has many data types: MySQL data types.

Two days to analyze one column from 95 is way outside the boundaries of what is needed. It is difficult for me to believe that you don't know what most of these columns represent. You should be able to figure out a reasonable data type for most of the columns without even seeing any of the actual data.


In reply to Re^3: Possible faster way to do this? by Marshall
in thread Possible faster way to do this? by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.