in reply to Nested greps w/ Perl

G'day wackattack,

"I'm using arrays because these are big files (gigabytes) and I need to do thousands of searches without having to load the file from hard drive each time."

I would neither load the file into an array nor read it, in its entirety, from disk (any number of times). Instead, reading a file of this size line by line, would probably be a better option. Here's how I might tackle this task.

#!/usr/bin/env perl use strict; use warnings; use autodie; my $file_to_search = 'file_to_search'; my $file_of_search_terms = 'file_of_search_terms'; my $file_of_search_counts = 'file_of_search_counts'; my %count; { open my $search_terms_fh, '<', $file_of_search_terms; %count = map { chomp; $_ => 0 } <$search_terms_fh>; } my @search_terms = keys %count; { open my $in_fh, '<', $file_to_search; while (<$in_fh>) { chomp; next if -1 == index $_, 'Z'; for my $search_term (@search_terms) { next if -1 == index $_, $search_term; ++$count{$search_term}; last; } } } { open my $out_fh, '>', $file_of_search_counts; print $out_fh "$_ : $count{$_}\n" for sort @search_terms; }

I used this dummy data for testing:

$ cat file_to_search 100008020Z Z100008020 100008020 100008030Z Z100008030 100008030 100008040Z Z100008040 100008040
$ cat file_of_search_terms 100008010 100008020 100008030 100008040 100008050

Here's the output:

$ cat file_of_search_counts 100008010 : 0 100008020 : 2 100008030 : 2 100008040 : 2 100008050 : 0

— Ken

Replies are listed 'Best First'.
Re^2: Nested greps w/ Perl
by wackattack (Sexton) on Dec 20, 2016 at 19:01 UTC
    Thank you Ken. I limited the search terms to 500 and that program you wrote, while it works flawlessly, has been running over an hour and still going. I don't know when it will end.

    Interestingly enough this command processes all search terms and the complete file in 3 minutes 24 seconds.

    time grep -P 'Z' file_to_search | awk '{print $1}' | sort | uniq --count > uniq.count

      You updated your OP since I posted my solution.

      I suspect you don't need that inner (for) loop at all.

      You really need to provide us with a representative sample of your input. You originally posted a search for 100008020, now you seem to be saying that they're not numbers at all but first names. And, if they are indeed names, are there any called Zoë, Zachary, etc.?

      — Ken

      grep -P 'Z' file_to_search | awk '{print $1}' | sort | uniq --count > uniq.count

      One perlish equivalent is

      perl -ae '$s{$F[0]}++ if $F[1] eq "Z"; END {print "$_ $s{$_}\n" for ke +ys %s}' file_to_search > uniq.count

      See how the timings compare on your platform.


      Edit: (TIMTOWTDI) s/The/One/