search/grep perl/*nix

Gtforce has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: search/grep perl/*nix by haukex (Archbishop) on Nov 25, 2017 at 16:48 UTC
I'd have thought that writing to a file and reading it back would have slowed me down, but it didn't! Without having the time to investigate with a benchmark at the moment, I'd take a guess that most of the time might be being spent on `cut \| sort \| uniq` on a 100MB file, not on reading/writing a 50kb temp file or pipe to Perl. The main point I wanted to make is the following: Is there a bias towards either of these approaches (driven by performance) should the dataset get significantly larger? I'd have a bias against both of the approaches ;-) All of those tasks can be done in pure Perl, without launching four separate processes. I wrote about the topic of running external processes at length here, but the only advice from there that seems to apply at the moment is "just do it in Perl". On a simple test file, the following produces the same output as "`cut -d"," -f2 /tmp/input.txt \| sort \| uniq`". Note that since I'm locating duplicates with a hash, I don't need to sort the input data first, meaning I can process the file line-by-line without loading all of it into memory. See also How can I remove duplicate elements from a list or array? `use warnings; use strict; my $filename = '/tmp/input.txt'; open my $fh, '<:encoding(UTF-8)', $filename or die "$filename: $!"; my %seen; while (<$fh>) { chomp; my @fields = split /,/; $seen{$fields[1]}++ } close $fh; my @data = sort keys %seen; print $_,"\n" for @data;` [download] If your input file is CSV, you might consider using Text::CSV_XS instead of split, since it will more robustly handle cases like quotes or escaped/quoted separators within fields. Update before posting: 1nickt just showed an example of that. Note the line-by-line approach can also be used with Text::CSV_XS with its `getline` method (I showed a short example e.g. here). Update: Added the file encoding from the OP (was it edited?) and minor edits for clarity.	[reply] [d/l] [select]
Re^2: search/grep perl/*nix by Gtforce (Sexton) on Nov 25, 2017 at 17:17 UTC
Thanks, haukex As the dataset grows over a period of time, am I right in assuming that the approach (i.e., the code snippet) you've provided is likely to have a much larger footprint on memory, whereas a straight grep shows an extremely light footprint on memory.	[reply]
Re^3: search/grep perl/*nix by haukex (Archbishop) on Nov 25, 2017 at 17:36 UTC
a straight grep The best way to get an idea is to measure, that is, produce several fake input data sets, increasing in size, representative of the data you expect to get in the future, and benchmark to see the performance of the various approaches. You've said "grep" twice now, but haven't shown an example of that, so without that, we can't really talk about performance comparisons objectively. As for the code shown so far, I think the Perl code I posted should have a significantly smaller memory footprint than `cut \| sort \| uniq` (or `cut \| sort -u`, as hippo said), since the only thing my code keeps in memory is the resulting output data set (that is, the keys of the hash; the numeric hash values shouldn't add a ton of overhead). I haven't measured yet though! (it's Saturday evening here after all `;-)` )	[reply] [d/l] [select]
Re^3: search/grep perl/*nix by 1nickt (Canon) on Nov 25, 2017 at 17:31 UTC
Hi, haukex will provide his own answer no doubt, but: No, the memory footprint should not grow, since `while ( my $line = <$FILEHANDLE> ) { ... }` [download] does not slurp the entire file into memory, but reads it one line at a time. See, for example, https://perldoc.perl.org/perlfaq5.html#How-can-I-read-in-an-entire-file-all-at-once%3f for a discussion of the issue. The way forward always starts with a minimal test.	[reply] [d/l]
Re^4: search/grep perl/*nix by Laurent_R (Canon) on Nov 25, 2017 at 18:57 UTC
Re^5: search/grep perl/*nix by 1nickt (Canon) on Nov 25, 2017 at 20:23 UTC
Re^3: search/grep perl/*nix by Anonymous Monk on Nov 25, 2017 at 17:39 UTC
That snippet will only store the result dataset (i.e. the unique keys). If you anticipate result-sets larger than the available RAM, you'll have to revise the general approach (use a database) since none of the straight-up solutions will be workable in such a case.	[reply]
Re: search/grep perl/*nix by hippo (Archbishop) on Nov 25, 2017 at 16:55 UTC
The first piece of code runs a shell grep and writes the output into a file. Actually, it doesn't. Partly this is because it doesn't compile due to the missing semi-colon at the end of the first line but even if you fix that you should see that it doesn't shell out at all. You probably meant to use the qx operator instead of qw but that would make no sense because you are using it in void context. With these mistakes it is hard to know what code you are really running (it certainly can't be what you've posted). This makes it very difficult to provide any insight (as discussed in How do I post a question effectively?). Instead, here's a free tip. Never do this: `$ foo \| sort \| uniq` if you are concerned about optimisation. sort has a `-u` flag which is much more efficient that firing up a separate uniq to de-duplicate the dataset.	[reply] [d/l] [select]
Re^2: search/grep perl/*nix by Gtforce (Sexton) on Nov 25, 2017 at 17:25 UTC
Both typos corrected on the post (the original code that I have is ok), apologies.	[reply]
Re^3: search/grep perl/*nix by kcott (Archbishop) on Nov 26, 2017 at 00:24 UTC
G'day Gtforce, "Both typos corrected on the post (the original code that I have is ok), apologies." It's fine to make corrections to your posts, but you also need to indicate what's changed at the point where the change occurred (e.g. you've said you made a correction here, several screenfuls away from the correction, but there's no indication of that in the OP itself). When I first read this thread, I couldn't initially understand why people were saying use '`qx`' instead of '`qw`': your OP had no apparent '`qw`'. See "How do I change/delete my post?" for a more complete discussion. Because it's directly related, and to save writing a separate reply, in "Re^2: search/grep perl/nix" you wrote: "It is qx and I did have the ending semicolon (daft of me to make these typos on the post), apologies."* The easiest way to avoid this, and what I do, is to just copy and paste your code directly into your post. This is a lot less work than actually typing your code and, because there's no typing involved, you won't make typos. — Ken	[reply] [d/l] [select]
Re: search/grep perl/*nix by 1nickt (Canon) on Nov 25, 2017 at 16:41 UTC
Since you are dealing with comma-separated values, you should use a module that's optimized for such work (i.e. handles quoting, null values, etc etc), such as Text::CSV. (If you install Text::CSV_XS your code will run faster.) Try the following, and Benchmark your results compared to your other solutions. As you say the dataset you are working with is pretty small; I would try to work up a much larger sample for benchmarking. Personally, I avoid shelling out at almost all costs. `$ cat 1204245.csv foo,bar,baz,qux fred,barney,wilma,betty apple,orange,banana,pear` [download] `use strict; use warnings; use feature 'say'; use Text::CSV_XS; my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1 }); open my $fh, '<', './1204245.csv' or die "Died: $!"; my @column = map { $_->[2] } @{ $csv->getline_all( $fh ) }; say for @column; __END__` [download] Output: `$ perl 1204245.pl baz wilma banana` [download] Hope this helps! The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re: search/grep perl/*nix by shmem (Chancellor) on Nov 25, 2017 at 19:19 UTC
I'd have thought that writing to a file and reading it back would have slowed me down, but it didn't! There's only a difference in the filehandle types involved. In the first, the shell opens/closes `$tmpfile`, in the second, it opens/closes a pipe attached to the perl side pipe filehandle created by `qx` (which perl creates anyways), so it is no surprise there is no difference, specially if you are working with a SSD instead of an old washing machine type of disk drums (modern disks might hold the entire file in the controller cache, so perl can read the file even before it is allocated physically via magnetism). It would be more interesting to benchmark the shell chain against a pure perl solution, in which case perl loses here. Why? Because allocating the necessary data structures in perl means some overhead, whereas the `cut uniq sort` utilities deal only with char arrays^[1], are seasoned and thus optimized for their specific tasks. Here's a file of ~132MB, one million records, created with `$ perl -E 'say join",",map{int rand 1000000} 1..20 for 1..1000000' > s +ample.csv` [download] and a quick shot at timing: `$ time cut -d"," -f 17 sample.csv \| sort \| uniq > out real 0m4.391s user 0m4.788s sys 0m0.060s $ time perl -F, -E '$s{$F[16]}++ }{ say for sort keys %s' sample.csv > + out real 0m6.716s user 0m6.668s sys 0m0.048s` [download] This could make a difference with huge files. I haven't looked at the memory footprint, which might be another clue for deciding for or against a (dogmatic) "pure perl solution". The bias is always `qw(laziness impatience hubris)` in an order that fits best. ^[1] afaik those utilities are UTF-8 agnostic perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply] [d/l] [select]
Re: search/grep perl/*nix by Laurent_R (Canon) on Nov 25, 2017 at 22:34 UTC
I'd have thought that writing to a file and reading it back would have slowed me down, but it didn't! I'm currently running over a fairly small'ish set of data ($file = approx.100Mb, and $tmpfile = 50Kb). Yes, it is quite probably slowing you down a very little bit. (Although I am not entirely sure that Linux isn't writing some data to a temporary file during the process, but let's assume it doesn't, that does not change the reasoning here anyway.) You're speaking about reading a 100 MB file in both cases. And also writing and reading a file 2,000 times smaller ($tmpfile = 50Kb) in one of the cases and not in the other case. I would think that these later operations are completely negligible compared to the time needed to read the original file. Even with the best benchmarking tools on a machine completely dedicated for those tests and doing absolutely nothing else, there is no way you can make sense of a 0.1% difference in execution time.	[reply]
Re: search/grep perl/*nix by eyepopslikeamosquito (Archbishop) on Nov 25, 2017 at 22:23 UTC
To help you with improving your Perl technique, some minor Perl style advice on your originally posted code: Always start your scripts with "use strict" and "use warnings" You don't need to quote `"$tmpfile"` in your open call Your use of split on newline is pointless; you're reading line-by-line and have already chomp'ed the newline Prefer the close function to the ->close method call That is, I would write your originally posted code: `open my $fh1, "<:encoding(utf-8)","$tmpfile" or die "$tmpfile: $!"; while (<$fh1>) { chomp; push @names, split (/\n/); } $fh1->close;` [download] as: `use strict; use warnings; my $tmpfile = 'f.tmp'; # test file used only for testing this script s +tandalone my @names; open my $fh1, "<:encoding(utf-8)", $tmpfile or die "$tmpfile: $!"; while (<$fh1>) { chomp; push @names, $_; } close $fh1;` [download] That said, I strongly endorse the other comments exhorting you to write the whole thing in Perl without using Unix shell at all. As for why, see: Unix shell versus Perl	[reply] [d/l] [select]
Re: search/grep perl/*nix by pryrt (Abbot) on Nov 25, 2017 at 17:06 UTC
I was going to have a much more detailed response, but ++1nickt and ++haukex beat me to the "how to do it inside perl". In other news, your first `qw{cut -d"," -f17 $file \| sort \| uniq > $tmpfile}` does not do what you think. Specifically, the `qw` form splits the text contained within on the spaces, making a list of 'cut', '-d","', ... in void context; `use warnings` would have told you `Possible attempt to separate words with commas at 1204245.pl line 13. Useless use of a constant ("cut") in void context at 1204245.pl line 1 +3. Useless use of a constant ("-d\",\"") in void context at 1204245.pl li +ne 13. ...` [download] and then it would have died with a message like `1204245-data.tmp: No such file or directory at 1204245.pl line 15.` [download] I am assuming you actually ran `qx{cut -d"," -f17 $file \| sort \| uniq > $tmpfile}`, which would have done what you claimed, but would have been better implemented with system, because qx takes the output and puts it in a string, which you were using in void context; system just executes the command. That said, follow 1nickt and haukex advice for how to do it in perl, without invoking external commands, much more efficiently.. (argh: ++hippo even beat me to pointing out this error, plus the missing semicolon which I forgot to mention; I'm only still posting because of my sunk cost. sigh)	[reply] [d/l] [select]
Re^2: search/grep perl/*nix by Gtforce (Sexton) on Nov 25, 2017 at 17:23 UTC
It is qx and I did have the ending semicolon (daft of me to make these typos on the post), apologies.	[reply]
Re: search/grep perl/*nix by karlgoethebier (Abbot) on Nov 25, 2017 at 19:11 UTC
Did you mean something like this? Data like 1nickt but with dups. `fred,barney,wilma,betty foo,bar,baz,qux foo,bar,baz,qux fred,barney,wilma,betty fred,barney,wilma,betty apple,orange,banana,pear apple,orange,banana,pear foo,bar,baz,qux fred,barney,wilma,betty apple,orange,banana,pear` [download] #!/usr/bin/env perl use strict; use warnings; use feature qw (say); use Data::Dump; use Iterator::Simple qw(iter); my $file = q(1204245.csv); my %column; open my $fh, q(<), $file or die $!; while (<$fh>) { chomp; $column{ ( split /,/ )[2] } = undef; } close $fh; dd \%column; # what ever.. say for sort { $b cmp $a } keys %column; say q(--); # or my $iterator = iter( IO::File->new($file) ); while (<$iterator>) { chomp; $column{ ( split /,/ )[2] } = undef; } dd \%column; say for sort { $a cmp $b } keys %column; __END__ karls-mac-mini:monks karl$ ./1204245.pl { banana => undef, baz => undef, wilma => undef } wilma baz banana -- { banana => undef, baz => undef, wilma => undef } banana baz wilma [download] "...dataset get significantly larger..." Perhaps the solution with Iterator::Simple performs better. But I guess - and your mileage may vary. See also Benchmark. Best regards, Karl �The Crux of the Biscuit is the Apostrophe� `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l] [select]