Re^3: Possible faster way to do this?

Sorry, GNU's uniq just filters out adjacent lines. See my edits in my first post Re: Possible faster way to do this? for a Perl one-liner and a C++ uniq commands with the "proper" functionality using hashmaps.

bw, bliako

Comment on Re^3: Possible faster way to do this? Download Code

Replies are listed 'Best First'.
Re^4: Possible faster way to do this? by Anonymous Monk on Jun 25, 2019 at 11:49 UTC
So, to understand this properly, the `cut` command cannot be avoided, right? The file is from a public database, so I can't really find out who made it...	[reply] [d/l]
Re^5: Possible faster way to do this? by Corion (Patriarch) on Jun 25, 2019 at 12:11 UTC
If you want to stay with a shell-based solution, you will have to stay with `cut`, but you can easily avoid `cut` by using either split (if your input data is well-formed enough) or Text::CSV_XS`->getline` to read tab-separated input. Personally, I wouldn't waste time (and RAM) on making the input data unique and instead just calculate the best input type directly for each input value. This will reduce the size of the data you need to remember far more than making the input data unique.	[reply] [d/l] [select]
Re^5: Possible faster way to do this? by bliako (Abbot) on Jun 25, 2019 at 13:28 UTC
i think the benefits of using Perl will be apparent later when you expand your pipeline. However, just for trying out ideas, there is also `awk` which does what `cut` does and more and also has hashmaps (associative arrays), so: Edit: N=1 specifies to use first column of input `awk -vN=1 '{if($N in uniq){uniq[$N]++}else{uniq[$N]++}}END{for(k in un +iq){print k," => ",uniq[k]}}'` [download]	[reply] [d/l] [select]