in reply to Re^2: Possible faster way to do this?
in thread Possible faster way to do this?

Sorry, GNU's uniq just filters out adjacent lines. See my edits in my first post Re: Possible faster way to do this? for a Perl one-liner and a C++ uniq commands with the "proper" functionality using hashmaps.

bw, bliako

Replies are listed 'Best First'.
Re^4: Possible faster way to do this?
by Anonymous Monk on Jun 25, 2019 at 11:49 UTC
    So, to understand this properly, the cut command cannot be avoided, right? The file is from a public database, so I can't really find out who made it...

      If you want to stay with a shell-based solution, you will have to stay with cut, but you can easily avoid cut by using either split (if your input data is well-formed enough) or Text::CSV_XS->getline to read tab-separated input.

      Personally, I wouldn't waste time (and RAM) on making the input data unique and instead just calculate the best input type directly for each input value. This will reduce the size of the data you need to remember far more than making the input data unique.

      i think the benefits of using Perl will be apparent later when you expand your pipeline. However, just for trying out ideas, there is also awk which does what cut does and more and also has hashmaps (associative arrays), so:

      Edit: N=1 specifies to use first column of input

      awk -vN=1 '{if($N in uniq){uniq[$N]++}else{uniq[$N]++}}END{for(k in un +iq){print k," => ",uniq[k]}}'