comment on

sort -u column1 > column1.uniq

Ouch!

You are right to seek a hash-base solution if only to eliminate sorting in order to achieve uniqueness which sounds like a ~~poor-man's dinner~~ fat-industrialist's gala!

IMPORTANT EDIT: I was too fast to croak, sorry: uniq only filters out repeated adjacent lines. It seems you have to write your own...

Which, btw, can be achieved by GNU's uniq which - my guess - is the most efficient way to do that.

Thank you Richard Stallman: cut -f1 my_big_file | uniq > column1.uniq

Edit 2: just to remedy my previous gaffe, here is a Perl and c++ uniq which works without the "adjacent lines" restriction, it reads only from STDIN, Edit3: added awk solution too:

perl -ne 'chomp; $uniq{$_}++; END{print $_ . " => " . $uniq{$_} . "\n"
+ for keys %uniq}'
[download]

perl -lne '$uniq{$_}++; END{print $_ . " => " . $uniq{$_} for keys %un
+iq}'
[download]

using awk

awk -vN=1 '{if($N in uniq){uniq[$N]++}else{uniq[$N]++}}END{for(k in un
+iq){print k," => ",uniq[k]}}'
[download]

/* reads whole lines from stdin and counts how many times each
   was repeated, overall and NOT "adjacently"
   by bliako, 25.06.2019 for
   https://perlmonks.org/?displaytype=edit;node_id=11101856
   g++ -O puniq.cpp -o puniq
*/
#include <unordered_map>
#include <iostream>
using namespace std;
int main(void){
    unordered_map<string,int> uniq;
    string aline;
    while(getline(cin, aline)){
        if( uniq.count(aline) == 0 ) uniq[aline] = 1;
        else uniq[aline]++;
    }    
    for(unordered_map<string,int>::iterator itr=uniq.begin();itr!=uniq
+.end();itr++){
        cout << itr->first << " => " << itr->second << endl;
    }
}
[download]

End Edit 2

But Eily is right to suspect that depending on your data you may end up with several GB hash - all values unique keys! And I love Corion's idea of random sampling except that it can never be 100% accurate. Especially if this is yet one of those bioinformatics projects where you have quite a few instrument errors causing outliers, impossible values and NA's. But the approach of making an educated guess based on a few samples and then you find whether it is *mostly* correct through random sampling could work. The key is the percentage of disagreements and whether you can afford to throw the whole row away just because it is too expensive to re-work your assumptions. Beware that even 1% discards per column can eliminate up to 95% of a 95-column data!

And there is the parallelise-way which may, just may, be able to overcome the bottleneck of N threads reading sequentially from same file on disk given that the data-chunk-per-thread are large enough and overall you gain. Each thread reads a chunk of the data, process it and then when all threads are done, re-unique their combined output which will hopefully be much smaller.

Edit 3: how to parallelise on the GNU bash shell - I will use the awk-based solution but all other solutions combined with cut will also do:

Ncols=95
Nthreads=4
for acol in $(seq 1 ${Ncols}); do
  echo "awk -vN=${acol} '"'{if($N in uniq){uniq[$N]++}else{uniq[$N]++}
+}END{for(k in uniq){print k," => ",uniq[k]}}'"' my_big_file > column$
+{acol}.uniq" > tmpscript${acol}.sh
  echo doing column $acol
  sem -j${Nthreads} sh tmpscript${acol}.sh
done
[download]

End Edit3

Storing data in a solid file vs splitting the data in smaller chunks/files will benefit the parallelise way a little bit. A lot if those chunks are stored in physically separate disks.

Additional problem: col1's type can be integer or string depending on the value of col2. So give us some data.

bw, bliako

In reply to Re: Possible faster way to do this? by bliako
in thread Possible faster way to do this? by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.