Re: search/grep perl/*nix
by haukex (Archbishop) on Nov 25, 2017 at 16:48 UTC
|
I'd have thought that writing to a file and reading it back would have slowed me down, but it didn't!
Without having the time to investigate with a benchmark at the moment, I'd take a guess that most of the time might be being spent on cut | sort | uniq on a 100MB file, not on reading/writing a 50kb temp file or pipe to Perl. The main point I wanted to make is the following:
Is there a bias towards either of these approaches (driven by performance) should the dataset get significantly larger?
I'd have a bias against both of the approaches ;-) All of those tasks can be done in pure Perl, without launching four separate processes. I wrote about the topic of running external processes at length here, but the only advice from there that seems to apply at the moment is "just do it in Perl".
On a simple test file, the following produces the same output as "cut -d"," -f2 /tmp/input.txt | sort | uniq". Note that since I'm locating duplicates with a hash, I don't need to sort the input data first, meaning I can process the file line-by-line without loading all of it into memory. See also How can I remove duplicate elements from a list or array?
use warnings;
use strict;
my $filename = '/tmp/input.txt';
open my $fh, '<:encoding(UTF-8)', $filename or die "$filename: $!";
my %seen;
while (<$fh>) {
chomp;
my @fields = split /,/;
$seen{$fields[1]}++
}
close $fh;
my @data = sort keys %seen;
print $_,"\n" for @data;
If your input file is CSV, you might consider using Text::CSV_XS instead of split, since it will more robustly handle cases like quotes or escaped/quoted separators within fields. Update before posting: 1nickt just showed an example of that. Note the line-by-line approach can also be used with Text::CSV_XS with its getline method (I showed a short example e.g. here).
Update: Added the file encoding from the OP (was it edited?) and minor edits for clarity. | [reply] [d/l] [select] |
|
|
Thanks, haukex
As the dataset grows over a period of time, am I right in assuming that the approach (i.e., the code snippet) you've provided is likely to have a much larger footprint on memory, whereas a straight grep shows an extremely light footprint on memory.
| [reply] |
|
|
a straight grep
The best way to get an idea is to measure, that is, produce several fake input data sets, increasing in size, representative of the data you expect to get in the future, and benchmark to see the performance of the various approaches. You've said "grep" twice now, but haven't shown an example of that, so without that, we can't really talk about performance comparisons objectively.
As for the code shown so far, I think the Perl code I posted should have a significantly smaller memory footprint than cut | sort | uniq (or cut | sort -u, as hippo said), since the only thing my code keeps in memory is the resulting output data set (that is, the keys of the hash; the numeric hash values shouldn't add a ton of overhead). I haven't measured yet though! (it's Saturday evening here after all ;-) )
| [reply] [d/l] [select] |
|
|
Hi, haukex will provide his own answer no doubt, but: No, the memory footprint should not grow, since
while ( my $line = <$FILEHANDLE> ) { ... }
does *not* slurp the entire file into memory, but reads it one line at a time. See, for example, https://perldoc.perl.org/perlfaq5.html#How-can-I-read-in-an-entire-file-all-at-once%3f for a discussion of the issue.
The way forward always starts with a minimal test.
| [reply] [d/l] |
|
|
|
|
|
|
| [reply] |
Re: search/grep perl/*nix
by hippo (Archbishop) on Nov 25, 2017 at 16:55 UTC
|
The first piece of code runs a shell grep and writes the output into a file.
Actually, it doesn't. Partly this is because it doesn't compile due to the missing semi-colon at the end of the first line but even if you fix that you should see that it doesn't shell out at all. You probably meant to use the qx operator instead of qw but that would make no sense because you are using it in void context.
With these mistakes it is hard to know what code you are really running (it certainly can't be what you've posted). This makes it very difficult to provide any insight (as discussed in How do I post a question effectively?). Instead, here's a free tip. Never do this:
$ foo | sort | uniq
if you are concerned about optimisation. sort has a -u flag which is much more efficient that firing up a separate uniq to de-duplicate the dataset.
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
G'day Gtforce,
"Both typos corrected on the post (the original code that I have is ok), apologies."
It's fine to make corrections to your posts,
but you also need to indicate what's changed at the point where the change occurred
(e.g. you've said you made a correction here, several screenfuls away from the correction,
but there's no indication of that in the OP itself).
When I first read this thread,
I couldn't initially understand why people were saying use 'qx' instead of 'qw':
your OP had no apparent 'qw'.
See "How do I change/delete my post?" for a more complete discussion.
Because it's directly related, and to save writing a separate reply, in "Re^2: search/grep perl/*nix" you wrote:
"It is qx and I did have the ending semicolon (daft of me to make these typos on the post), apologies."
The easiest way to avoid this, and what I do, is to just copy and paste your code directly into your post.
This is a lot less work than actually typing your code and, because there's no typing involved, you won't make typos.
| [reply] [d/l] [select] |
Re: search/grep perl/*nix
by 1nickt (Canon) on Nov 25, 2017 at 16:41 UTC
|
Since you are dealing with comma-separated values, you should use a module that's optimized for such work (i.e. handles quoting, null values, etc etc), such as Text::CSV. (If you install Text::CSV_XS your code will run faster.)
Try the following, and Benchmark your results compared to your other solutions. As you say the dataset you are working with is pretty small; I would try to work up a much larger sample for benchmarking.
Personally, I avoid shelling out at almost all costs.
$ cat 1204245.csv
foo,bar,baz,qux
fred,barney,wilma,betty
apple,orange,banana,pear
use strict; use warnings; use feature 'say';
use Text::CSV_XS;
my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1 });
open my $fh, '<', './1204245.csv' or die "Died: $!";
my @column = map { $_->[2] } @{ $csv->getline_all( $fh ) };
say for @column;
__END__
Output:
$ perl 1204245.pl
baz
wilma
banana
Hope this helps!
The way forward always starts with a minimal test.
| [reply] [d/l] [select] |
Re: search/grep perl/*nix
by shmem (Chancellor) on Nov 25, 2017 at 19:19 UTC
|
I'd have thought that writing to a file and reading it back would have slowed me down, but it didn't!
There's only a difference in the filehandle types involved. In the first, the shell opens/closes $tmpfile, in the second, it opens/closes a pipe attached to the perl side pipe filehandle created by qx (which perl creates anyways), so it is no surprise there is no difference, specially if you are working with a SSD instead of an old washing machine type of disk drums (modern disks might hold the entire file in the controller cache, so perl can read the file even before it is allocated physically via magnetism).
It would be more interesting to benchmark the shell chain against a pure perl solution, in which case perl loses here. Why? Because allocating the necessary data structures in perl means some overhead, whereas the cut uniq sort utilities deal only with char arrays[1], are seasoned and thus optimized for their specific tasks.
Here's a file of ~132MB, one million records, created with
$ perl -E 'say join",",map{int rand 1000000} 1..20 for 1..1000000' > s
+ample.csv
and a quick shot at timing:
$ time cut -d"," -f 17 sample.csv | sort | uniq > out
real 0m4.391s
user 0m4.788s
sys 0m0.060s
$ time perl -F, -E '$s{$F[16]}++ }{ say for sort keys %s' sample.csv >
+ out
real 0m6.716s
user 0m6.668s
sys 0m0.048s
This could make a difference with huge files. I haven't looked at the memory footprint, which might be another clue for deciding for or against a (dogmatic) "pure perl solution".
The bias is always qw(laziness impatience hubris) in an order that fits best.
[1] afaik those utilities are UTF-8 agnostic
perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
| [reply] [d/l] [select] |
Re: search/grep perl/*nix
by Laurent_R (Canon) on Nov 25, 2017 at 22:34 UTC
|
I'd have thought that writing to a file and reading it back would have slowed me down, but it didn't! I'm currently running over a fairly small'ish set of data ($file = approx.100Mb, and $tmpfile = 50Kb).
Yes, it is quite probably slowing you down a very little bit. (Although I am not entirely sure that Linux isn't writing some data to a temporary file during the process, but let's assume it doesn't, that does not change the reasoning here anyway.)
You're speaking about reading a 100 MB file in both cases. And also writing and reading a file 2,000 times smaller ($tmpfile = 50Kb) in one of the cases and not in the other case. I would think that these later operations are completely negligible compared to the time needed to read the original file. Even with the best benchmarking tools on a machine completely dedicated for those tests and doing absolutely nothing else, there is no way you can make sense of a 0.1% difference in execution time.
| [reply] |
Re: search/grep perl/*nix
by eyepopslikeamosquito (Archbishop) on Nov 25, 2017 at 22:23 UTC
|
To help you with improving your Perl technique,
some minor Perl style advice on your originally posted code:
- Always start your scripts with "use strict" and "use warnings"
- You don't need to quote "$tmpfile" in your open call
- Your use of split on newline is pointless; you're reading line-by-line and have already chomp'ed the newline
- Prefer the close function to the ->close method call
That is, I would write your originally posted code:
open my $fh1, "<:encoding(utf-8)","$tmpfile" or die "$tmpfile: $!";
while (<$fh1>) {
chomp;
push @names, split (/\n/);
}
$fh1->close;
as:
use strict;
use warnings;
my $tmpfile = 'f.tmp'; # test file used only for testing this script s
+tandalone
my @names;
open my $fh1, "<:encoding(utf-8)", $tmpfile or die "$tmpfile: $!";
while (<$fh1>) {
chomp;
push @names, $_;
}
close $fh1;
That said, I strongly endorse the other comments exhorting you to write the whole thing in Perl
without using Unix shell at all.
As for why, see: Unix shell versus Perl
| [reply] [d/l] [select] |
Re: search/grep perl/*nix
by pryrt (Abbot) on Nov 25, 2017 at 17:06 UTC
|
I was going to have a much more detailed response, but ++1nickt and ++haukex beat me to the "how to do it inside perl".
In other news, your first qw{cut -d"," -f17 $file | sort | uniq > $tmpfile} does not do what you think. Specifically, the qw form splits the text contained within on the spaces, making a list of 'cut', '-d","', ... in void context; use warnings would have told you
Possible attempt to separate words with commas at 1204245.pl line 13.
Useless use of a constant ("cut") in void context at 1204245.pl line 1
+3.
Useless use of a constant ("-d\",\"") in void context at 1204245.pl li
+ne 13.
...
and then it would have died with a message like
1204245-data.tmp: No such file or directory at 1204245.pl line 15.
I am assuming you actually ran qx{cut -d"," -f17 $file | sort | uniq > $tmpfile}, which would have done what you claimed, but would have been better implemented with system, because qx takes the output and puts it in a string, which you were using in void context; system just executes the command.
That said, follow 1nickt and haukex advice for how to do it in perl, without invoking external commands, much more efficiently..
(argh: ++hippo even beat me to pointing out this error, plus the missing semicolon which I forgot to mention; I'm only still posting because of my sunk cost. *sigh*) | [reply] [d/l] [select] |
|
|
| [reply] |
Re: search/grep perl/*nix
by karlgoethebier (Abbot) on Nov 25, 2017 at 19:11 UTC
|
Did you mean something like this?
Data like 1nickt but with dups.
fred,barney,wilma,betty
foo,bar,baz,qux
foo,bar,baz,qux
fred,barney,wilma,betty
fred,barney,wilma,betty
apple,orange,banana,pear
apple,orange,banana,pear
foo,bar,baz,qux
fred,barney,wilma,betty
apple,orange,banana,pear
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw (say);
use Data::Dump;
use Iterator::Simple qw(iter);
my $file = q(1204245.csv);
my %column;
open my $fh, q(<), $file or die $!;
while (<$fh>) {
chomp;
$column{ ( split /,/ )[2] } = undef;
}
close $fh;
dd \%column;
# what ever..
say for sort { $b cmp $a } keys %column;
say q(--);
# or
my $iterator = iter( IO::File->new($file) );
while (<$iterator>) {
chomp;
$column{ ( split /,/ )[2] } = undef;
}
dd \%column;
say for sort { $a cmp $b } keys %column;
__END__
karls-mac-mini:monks karl$ ./1204245.pl
{ banana => undef, baz => undef, wilma => undef }
wilma
baz
banana
--
{ banana => undef, baz => undef, wilma => undef }
banana
baz
wilma
"...dataset get significantly larger..."
Perhaps the solution with Iterator::Simple performs better. But I guess - and your mileage may vary. See also Benchmark.
Best regards, Karl
«The Crux of the Biscuit is the Apostrophe»
perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help
| [reply] [d/l] [select] |