in reply to Re: search/grep perl/*nix
in thread search/grep perl/*nix

Thanks, haukex As the dataset grows over a period of time, am I right in assuming that the approach (i.e., the code snippet) you've provided is likely to have a much larger footprint on memory, whereas a straight grep shows an extremely light footprint on memory.

Replies are listed 'Best First'.
Re^3: search/grep perl/*nix
by haukex (Archbishop) on Nov 25, 2017 at 17:36 UTC
    a straight grep

    The best way to get an idea is to measure, that is, produce several fake input data sets, increasing in size, representative of the data you expect to get in the future, and benchmark to see the performance of the various approaches. You've said "grep" twice now, but haven't shown an example of that, so without that, we can't really talk about performance comparisons objectively.

    As for the code shown so far, I think the Perl code I posted should have a significantly smaller memory footprint than cut | sort | uniq (or cut | sort -u, as hippo said), since the only thing my code keeps in memory is the resulting output data set (that is, the keys of the hash; the numeric hash values shouldn't add a ton of overhead). I haven't measured yet though! (it's Saturday evening here after all ;-) )

Re^3: search/grep perl/*nix
by 1nickt (Canon) on Nov 25, 2017 at 17:31 UTC
      The memory footprint may not grow very fast, but it will most probably grow because the %seen hash is very likely to get larger with a bigger file (unless the data input has really many duplicates when the file grows larger).

        Correction accepted; I was thinking only in terms of reading in the file, since in the OP, post-reading data storage seemed to be moot. But you are quite right.

        The way forward always starts with a minimal test.
Re^3: search/grep perl/*nix
by Anonymous Monk on Nov 25, 2017 at 17:39 UTC

    That snippet will only store the result dataset (i.e. the unique keys). If you anticipate result-sets larger than the available RAM, you'll have to revise the general approach (use a database) since none of the straight-up solutions will be workable in such a case.