in reply to Possible faster way to do this?

sort -u column1 > column1.uniq

Ouch!

You are right to seek a hash-base solution if only to eliminate sorting in order to achieve uniqueness which sounds like a poor-man's dinner fat-industrialist's gala!

IMPORTANT EDIT: I was too fast to croak, sorry: uniq only filters out repeated adjacent lines. It seems you have to write your own...

Which, btw, can be achieved by GNU's uniq which - my guess - is the most efficient way to do that.

Thank you Richard Stallman: cut -f1 my_big_file | uniq > column1.uniq

Edit 2: just to remedy my previous gaffe, here is a Perl and c++ uniq which works without the "adjacent lines" restriction, it reads only from STDIN, Edit3: added awk solution too:

perl -ne 'chomp; $uniq{$_}++; END{print $_ . " => " . $uniq{$_} . "\n" + for keys %uniq}'

or

perl -lne '$uniq{$_}++; END{print $_ . " => " . $uniq{$_} for keys %un +iq}'

using awk

awk -vN=1 '{if($N in uniq){uniq[$N]++}else{uniq[$N]++}}END{for(k in un +iq){print k," => ",uniq[k]}}'
/* reads whole lines from stdin and counts how many times each was repeated, overall and NOT "adjacently" by bliako, 25.06.2019 for https://perlmonks.org/?displaytype=edit;node_id=11101856 g++ -O puniq.cpp -o puniq */ #include <unordered_map> #include <iostream> using namespace std; int main(void){ unordered_map<string,int> uniq; string aline; while(getline(cin, aline)){ if( uniq.count(aline) == 0 ) uniq[aline] = 1; else uniq[aline]++; } for(unordered_map<string,int>::iterator itr=uniq.begin();itr!=uniq +.end();itr++){ cout << itr->first << " => " << itr->second << endl; } }

End Edit 2

But Eily is right to suspect that depending on your data you may end up with several GB hash - all values unique keys! And I love Corion's idea of random sampling except that it can never be 100% accurate. Especially if this is yet one of those bioinformatics projects where you have quite a few instrument errors causing outliers, impossible values and NA's. But the approach of making an educated guess based on a few samples and then you find whether it is *mostly* correct through random sampling could work. The key is the percentage of disagreements and whether you can afford to throw the whole row away just because it is too expensive to re-work your assumptions. Beware that even 1% discards per column can eliminate up to 95% of a 95-column data!

And there is the parallelise-way which may, just may, be able to overcome the bottleneck of N threads reading sequentially from same file on disk given that the data-chunk-per-thread are large enough and overall you gain. Each thread reads a chunk of the data, process it and then when all threads are done, re-unique their combined output which will hopefully be much smaller.

Edit 3: how to parallelise on the GNU bash shell - I will use the awk-based solution but all other solutions combined with cut will also do:

Ncols=95 Nthreads=4 for acol in $(seq 1 ${Ncols}); do echo "awk -vN=${acol} '"'{if($N in uniq){uniq[$N]++}else{uniq[$N]++} +}END{for(k in uniq){print k," => ",uniq[k]}}'"' my_big_file > column$ +{acol}.uniq" > tmpscript${acol}.sh echo doing column $acol sem -j${Nthreads} sh tmpscript${acol}.sh done

End Edit3

Storing data in a solid file vs splitting the data in smaller chunks/files will benefit the parallelise way a little bit. A lot if those chunks are stored in physically separate disks.

Additional problem: col1's type can be integer or string depending on the value of col2. So give us some data.

bw, bliako

Replies are listed 'Best First'.
Re^2: Possible faster way to do this?
by hippo (Archbishop) on Jun 25, 2019 at 10:29 UTC
    Which, btw, can be achieved by GNU's uniq which - my guess - is the most efficient way to do that.

    That isn't what uniq does. It removes adjacent duplicate lines only. This is why everyone sorts before running uniq.

    $ echo -e "foo\nbar\nfoo" | uniq foo bar foo $ echo -e "foo\nbar\nfoo" | sort | uniq bar foo $

      yep, sorry for me spreading the wrong information, i have edited my answer. (and added one Perl and one C++ proper-uniqs)

Re^2: Possible faster way to do this?
by Anonymous Monk on Jun 25, 2019 at 10:16 UTC
    Hi all, many interesting points here!
    Corion, I do like your approach and probably use it at some point; however, in order to create the tables correctly, I would need ALL of the unique values, in order to see if e.g. VARCHAR(15) is enough, or if 10,000 rows from the hundreds of millions that this big file has, have e.g. VARCHAR(16). So, as I see it, not really much to do, except using uniq directly, without sorting...
    The values vary largely, but, as correctly pointed out, they are biology-related data (on a bioinformatics project), so some of them are e.g. chromosome number positions (so, only numbers), others are text etc. And most of them are repeated of course. However, I really need to know all the unique values before I can design my table scheme..
      I don't know why it is necessary to find all of the unique values for a particular column in order to get started.

      One thought would be to build a histogram of number of chars in each field of a row. This would spot weirdo outlier cases where a column is almost always 12, but very rarely 112 chars. This also might help with char(x) versus varchar(x) decisions.

      This is gonna be a big DB and I assume that you have an appropriate DB for that. Might be interesting to know the target DB?

      I would also consider an iterative approach... Write a one pass analyzer program as I mentioned. Create a DB based upon that info and your judgement. Then if you want to know the unique values for a column, ask the DB! In other words, scan data in one pass to make "good enough judgments" to make an initial DB. Then query this first DB to to help you find the best schema for the final DB. The DB will be more efficient than Perl at finding uniques and a fancy DB can use multi-cores to speed things up.

      Update:
      Figuring out the optimal data types for your specific DB can be difficult if you want to minimize storage (and I guess that you do want to do that? - but note that sometimes minimal storage means less than optimal query speed).
      For example MySQL has many data types: MySQL data types.

      Two days to analyze one column from 95 is way outside the boundaries of what is needed. It is difficult for me to believe that you don't know what most of these columns represent. You should be able to figure out a reasonable data type for most of the columns without even seeing any of the actual data.

      I really need to know all the unique values before I can design my table scheme.
      No, you don't need to know all the unique values. You only need to know a few pieces of metadata describing the kind of data in each column, such as the length of the longest value (so you know whether VARCHAR(15) is enough or if one or more values require VARCHAR(16)), possibly the length of the shortest value (if you want to try to use CHAR instead of VARCHAR where possible), a flag for whether the column is numeric-only (if you want to use numeric types instead of VARCHAR where possible), and, for numeric columns, the minimum/maximum values (so you can choose a numeric type with an appropriate range).

      The only scenario which would require you to know all possible values for each column would be if you were intending to define each column's type as an enum, which I would consider to fall under the "pathological database designs" heading.

      Sorry, GNU's uniq just filters out adjacent lines. See my edits in my first post Re: Possible faster way to do this? for a Perl one-liner and a C++ uniq commands with the "proper" functionality using hashmaps.

      bw, bliako

        So, to understand this properly, the cut command cannot be avoided, right? The file is from a public database, so I can't really find out who made it...