Re: search/grep perl/*nix

I'd have thought that writing to a file and reading it back would have slowed me down, but it didn't!

Without having the time to investigate with a benchmark at the moment, I'd take a guess that most of the time might be being spent on cut | sort | uniq on a 100MB file, not on reading/writing a 50kb temp file or pipe to Perl. The main point I wanted to make is the following:

Is there a bias towards either of these approaches (driven by performance) should the dataset get significantly larger?

I'd have a bias against both of the approaches ;-) All of those tasks can be done in pure Perl, without launching four separate processes. I wrote about the topic of running external processes at length here, but the only advice from there that seems to apply at the moment is "just do it in Perl".

On a simple test file, the following produces the same output as "cut -d"," -f2 /tmp/input.txt | sort | uniq". Note that since I'm locating duplicates with a hash, I don't need to sort the input data first, meaning I can process the file line-by-line without loading all of it into memory. See also How can I remove duplicate elements from a list or array?

use warnings;
use strict;

my $filename = '/tmp/input.txt';
open my $fh, '<:encoding(UTF-8)', $filename or die "$filename: $!";
my %seen;
while (<$fh>) {
    chomp;
    my @fields = split /,/;
    $seen{$fields[1]}++
}
close $fh;
my @data = sort keys %seen;
print $_,"\n" for @data;
[download]

If your input file is CSV, you might consider using Text::CSV_XS instead of split, since it will more robustly handle cases like quotes or escaped/quoted separators within fields. Update before posting: 1nickt just showed an example of that. Note the line-by-line approach can also be used with Text::CSV_XS with its getline method (I showed a short example e.g. here).

Update: Added the file encoding from the OP (was it edited?) and minor edits for clarity.

Comment on Re: search/grep perl/*nix Select or Download Code

Replies are listed 'Best First'.
Re^2: search/grep perl/*nix by Gtforce (Sexton) on Nov 25, 2017 at 17:17 UTC
Thanks, haukex As the dataset grows over a period of time, am I right in assuming that the approach (i.e., the code snippet) you've provided is likely to have a much larger footprint on memory, whereas a straight grep shows an extremely light footprint on memory.	[reply]
Re^3: search/grep perl/*nix by haukex (Archbishop) on Nov 25, 2017 at 17:36 UTC
a straight grep The best way to get an idea is to measure, that is, produce several fake input data sets, increasing in size, representative of the data you expect to get in the future, and benchmark to see the performance of the various approaches. You've said "grep" twice now, but haven't shown an example of that, so without that, we can't really talk about performance comparisons objectively. As for the code shown so far, I think the Perl code I posted should have a significantly smaller memory footprint than `cut \| sort \| uniq` (or `cut \| sort -u`, as hippo said), since the only thing my code keeps in memory is the resulting output data set (that is, the keys of the hash; the numeric hash values shouldn't add a ton of overhead). I haven't measured yet though! (it's Saturday evening here after all `;-)` )	[reply] [d/l] [select]
Re^3: search/grep perl/*nix by 1nickt (Canon) on Nov 25, 2017 at 17:31 UTC
Hi, haukex will provide his own answer no doubt, but: No, the memory footprint should not grow, since `while ( my $line = <$FILEHANDLE> ) { ... }` [download] does not slurp the entire file into memory, but reads it one line at a time. See, for example, https://perldoc.perl.org/perlfaq5.html#How-can-I-read-in-an-entire-file-all-at-once%3f for a discussion of the issue. The way forward always starts with a minimal test.	[reply] [d/l]
Re^4: search/grep perl/*nix by Laurent_R (Canon) on Nov 25, 2017 at 18:57 UTC
The memory footprint may not grow very fast, but it will most probably grow because the `%seen` hash is very likely to get larger with a bigger file (unless the data input has really many duplicates when the file grows larger).	[reply] [d/l]
Re^5: search/grep perl/*nix by 1nickt (Canon) on Nov 25, 2017 at 20:23 UTC
Re^3: search/grep perl/*nix by Anonymous Monk on Nov 25, 2017 at 17:39 UTC
That snippet will only store the result dataset (i.e. the unique keys). If you anticipate result-sets larger than the available RAM, you'll have to revise the general approach (use a database) since none of the straight-up solutions will be workable in such a case.	[reply]