Re: Nested greps w/ Perl

G'day wackattack,

"I'm using arrays because these are big files (gigabytes) and I need to do thousands of searches without having to load the file from hard drive each time."

I would neither load the file into an array nor read it, in its entirety, from disk (any number of times). Instead, reading a file of this size line by line, would probably be a better option. Here's how I might tackle this task.

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;

my $file_to_search = 'file_to_search';
my $file_of_search_terms = 'file_of_search_terms';
my $file_of_search_counts = 'file_of_search_counts';

my %count;

{
    open my $search_terms_fh, '<', $file_of_search_terms;
    %count = map { chomp; $_ => 0 } <$search_terms_fh>;
}

my @search_terms = keys %count;

{
    open my $in_fh, '<', $file_to_search;

    while (<$in_fh>) {
        chomp;
        next if -1 == index $_, 'Z';

        for my $search_term (@search_terms) {
            next if -1 == index $_, $search_term;
            ++$count{$search_term};
            last;
        }
    }
}

{
    open my $out_fh, '>', $file_of_search_counts;
    print $out_fh "$_ : $count{$_}\n" for sort @search_terms;
}
[download]

I used this dummy data for testing:

$ cat file_to_search
100008020Z
Z100008020
100008020
100008030Z
Z100008030
100008030
100008040Z
Z100008040
100008040
[download]

$ cat file_of_search_terms
100008010
100008020
100008030
100008040
100008050
[download]

Here's the output:

$ cat file_of_search_counts
100008010 : 0
100008020 : 2
100008030 : 2
100008040 : 2
100008050 : 0
[download]

— Ken

Comment on Re: Nested greps w/ Perl Select or Download Code

Replies are listed 'Best First'.
Re^2: Nested greps w/ Perl by wackattack (Sexton) on Dec 20, 2016 at 19:01 UTC
Thank you Ken. I limited the search terms to 500 and that program you wrote, while it works flawlessly, has been running over an hour and still going. I don't know when it will end. Interestingly enough this command processes all search terms and the complete file in 3 minutes 24 seconds. time grep -P 'Z' file_to_search \| awk '{print $1}' \| sort \| uniq --count > uniq.count	[reply]
Re^3: Nested greps w/ Perl by kcott (Archbishop) on Dec 20, 2016 at 22:23 UTC
You updated your OP since I posted my solution. I suspect you don't need that inner (`for`) loop at all. You really need to provide us with a representative sample of your input. You originally posted a search for `100008020`, now you seem to be saying that they're not numbers at all but first names. And, if they are indeed names, are there any called Zoë, Zachary, etc.? — Ken	[reply] [d/l] [select]
Re^3: Nested greps w/ Perl by hippo (Archbishop) on Dec 21, 2016 at 09:28 UTC
grep -P 'Z' file_to_search \| awk '{print $1}' \| sort \| uniq --count > uniq.count One perlish equivalent is `perl -ae '$s{$F[0]}++ if $F[1] eq "Z"; END {print "$_ $s{$_}\n" for ke +ys %s}' file_to_search > uniq.count` [download] See how the timings compare on your platform. Edit: (TIMTOWTDI) s/The/One/	[reply] [d/l]