in reply to Re^3: Possible faster way to do this?
in thread Possible faster way to do this?

So, to understand this properly, the cut command cannot be avoided, right? The file is from a public database, so I can't really find out who made it...

Replies are listed 'Best First'.
Re^5: Possible faster way to do this?
by Corion (Patriarch) on Jun 25, 2019 at 12:11 UTC

    If you want to stay with a shell-based solution, you will have to stay with cut, but you can easily avoid cut by using either split (if your input data is well-formed enough) or Text::CSV_XS->getline to read tab-separated input.

    Personally, I wouldn't waste time (and RAM) on making the input data unique and instead just calculate the best input type directly for each input value. This will reduce the size of the data you need to remember far more than making the input data unique.

Re^5: Possible faster way to do this?
by bliako (Abbot) on Jun 25, 2019 at 13:28 UTC

    i think the benefits of using Perl will be apparent later when you expand your pipeline. However, just for trying out ideas, there is also awk which does what cut does and more and also has hashmaps (associative arrays), so:

    Edit: N=1 specifies to use first column of input

    awk -vN=1 '{if($N in uniq){uniq[$N]++}else{uniq[$N]++}}END{for(k in un +iq){print k," => ",uniq[k]}}'