in reply to Re: Possible faster way to do this?
in thread Possible faster way to do this?

Hi all, many interesting points here!
Corion, I do like your approach and probably use it at some point; however, in order to create the tables correctly, I would need ALL of the unique values, in order to see if e.g. VARCHAR(15) is enough, or if 10,000 rows from the hundreds of millions that this big file has, have e.g. VARCHAR(16). So, as I see it, not really much to do, except using uniq directly, without sorting...
The values vary largely, but, as correctly pointed out, they are biology-related data (on a bioinformatics project), so some of them are e.g. chromosome number positions (so, only numbers), others are text etc. And most of them are repeated of course. However, I really need to know all the unique values before I can design my table scheme..

Replies are listed 'Best First'.
Re^3: Possible faster way to do this?
by Marshall (Canon) on Jun 25, 2019 at 19:25 UTC
    I don't know why it is necessary to find all of the unique values for a particular column in order to get started.

    One thought would be to build a histogram of number of chars in each field of a row. This would spot weirdo outlier cases where a column is almost always 12, but very rarely 112 chars. This also might help with char(x) versus varchar(x) decisions.

    This is gonna be a big DB and I assume that you have an appropriate DB for that. Might be interesting to know the target DB?

    I would also consider an iterative approach... Write a one pass analyzer program as I mentioned. Create a DB based upon that info and your judgement. Then if you want to know the unique values for a column, ask the DB! In other words, scan data in one pass to make "good enough judgments" to make an initial DB. Then query this first DB to to help you find the best schema for the final DB. The DB will be more efficient than Perl at finding uniques and a fancy DB can use multi-cores to speed things up.

    Update:
    Figuring out the optimal data types for your specific DB can be difficult if you want to minimize storage (and I guess that you do want to do that? - but note that sometimes minimal storage means less than optimal query speed).
    For example MySQL has many data types: MySQL data types.

    Two days to analyze one column from 95 is way outside the boundaries of what is needed. It is difficult for me to believe that you don't know what most of these columns represent. You should be able to figure out a reasonable data type for most of the columns without even seeing any of the actual data.

Re^3: Possible faster way to do this?
by dsheroh (Monsignor) on Jun 26, 2019 at 06:59 UTC
    I really need to know all the unique values before I can design my table scheme.
    No, you don't need to know all the unique values. You only need to know a few pieces of metadata describing the kind of data in each column, such as the length of the longest value (so you know whether VARCHAR(15) is enough or if one or more values require VARCHAR(16)), possibly the length of the shortest value (if you want to try to use CHAR instead of VARCHAR where possible), a flag for whether the column is numeric-only (if you want to use numeric types instead of VARCHAR where possible), and, for numeric columns, the minimum/maximum values (so you can choose a numeric type with an appropriate range).

    The only scenario which would require you to know all possible values for each column would be if you were intending to define each column's type as an enum, which I would consider to fall under the "pathological database designs" heading.

Re^3: Possible faster way to do this?
by bliako (Abbot) on Jun 25, 2019 at 11:26 UTC

    Sorry, GNU's uniq just filters out adjacent lines. See my edits in my first post Re: Possible faster way to do this? for a Perl one-liner and a C++ uniq commands with the "proper" functionality using hashmaps.

    bw, bliako

      So, to understand this properly, the cut command cannot be avoided, right? The file is from a public database, so I can't really find out who made it...

        If you want to stay with a shell-based solution, you will have to stay with cut, but you can easily avoid cut by using either split (if your input data is well-formed enough) or Text::CSV_XS->getline to read tab-separated input.

        Personally, I wouldn't waste time (and RAM) on making the input data unique and instead just calculate the best input type directly for each input value. This will reduce the size of the data you need to remember far more than making the input data unique.

        i think the benefits of using Perl will be apparent later when you expand your pipeline. However, just for trying out ideas, there is also awk which does what cut does and more and also has hashmaps (associative arrays), so:

        Edit: N=1 specifies to use first column of input

        awk -vN=1 '{if($N in uniq){uniq[$N]++}else{uniq[$N]++}}END{for(k in un +iq){print k," => ",uniq[k]}}'