wackattack has asked for the wisdom of the Perl Monks concerning the following question:

I can do this easily in bash, but it's slow on big files due to the hard drive i/o limitations. So I need to load the files into memory and do the same. Here is the method via bash: grep $SEARCH_TERM $FILE_1 | grep -c Z Basically it's grepping each line of a file for 2 conditions and returns the total count. How do I do this in perl? I tried loading the entire file into an array:
open my $C_LOC, '<', $C_LOCATIONS_FILE; chomp(my @C_LOC_ARRAY = <$C_LOC>); close $C_LOC;
But but scanning the array turns into a mess pretty quickly. Again, I'm using arrays because these are big files (gigabytes) and I need to do thousands of searches without having to load the file from hard drive each time. Thank you so much for the help! EDIT:
================
This takes 46 seconds:

my @foo = grep (/100008020/, @C_LOC_ARRAY); my @foo2 = grep (/Z/,@foo);

This takes 0.823 seconds
 grep 100008020 OT.file | grep -c Z

How do I speed up my perl?

EDIT #2
================
I have a flat (text file) that looks like this:

Tommy Z
Tommy Z
Chris Z
Chris B
Chris Z
Jake Z
Jake Y

I'm simply counting how many Z's each person has and ignoring all other letters.

Output would look like this for 8 million people

Tommy 2
Chris 2
Jake 1

Replies are listed 'Best First'.
Re: Nested greps w/ Perl
by kennethk (Abbot) on Dec 19, 2016 at 21:46 UTC
    Can you describe what you mean by scanning the array turns into a mess pretty quickly? I'm worried this is an XY Problem and based on your spec I'd assume the real concern would be memory consumption associated with holding the whole array in memory.

    For a naive reading for your spec, it'd probably be something like:

    my $count = grep /\Q$SEARCH_TERM\E/ && /Z/, @C_LOC_ARRAY;
    though if you were committed to chained greps, you could write
    my $count = grep /Z/, grep /\Q$SEARCH_TERM\E/, @C_LOC_ARRAY;
    You might also get a speed boost by using index, depending on your particular need.

    Depending on the type of search you are trying to perform, I would say a database might be a cleaner solution; an in-memory SQLite database would be very fast (though have a big footprint) and a file-based database would allow you to only index once.


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      Thank you so much for the help. So I just ran the first command


       my $count = grep /\Q$SEARCH_TERM\E/ && /Z/, @C_LOC_ARRAY;

      on my file which is 1.4 GB in size. It takes 51 times longer than a simple bash command line grep. It also uses 16 GB of virtual RAM.

      By mess I was saying nested for loops can get ugly when I have thousands of search terms in one file to search against a database. I'm simply trying to find out how many times each search term appears in the database.
        The problem is largely driven by memory footprint. If you instead run
        open my $C_LOC, '<', $C_LOCATIONS_FILE; while (<$C_LOC>) { chomp; $count += /\Q$SEARCH_TERM\E/ && /Z/ } close $C_LOC;
        you should see a substantial reduction in time. In order to substantially speed it up past this, you need to avoid readline's buffering, at which point you can implement Matching in huge files.

        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        Huh?

        I have thousands of search terms in one file to search against a database. I'm simply trying to find out how many times each search term appears in the database

        Does this mean that you're dumping out the content of a SQL database to file and then using grep to search the data?

        You realise this defeats the entire purpose of having a database, right? Assuming your database is correctly indexed (and you have sufficient RAM), you should be able to run a query that gives you exactly what you want in a fraction of the time it takes to even dump out the entire table(s) for external processing

      So I want to do about 16 million grep counts on a 1.4 Gigabyte file. I'm simply counting how many times a Z or Z pops up relative to another variable. It will take me 10 years to do this with nested greps. And 545 years to do this via the perl script you recommended. This shouldn't take this long. It should go pretty quickly.
        I'm unclear on the particulars on your math, you should note that grep -c Z does not count the number of Z's, but rather the number of lines that contain a Z. If this takes you 10 years, then you have very slow drive access. The majority of processor time for the code block I posted is consumed by the slurp and memory access unless you've done something very wrong.

        #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

        I'm dealing almost daily with files having several GB, doing the simple searches that you describe usually take a couple of minutes, perhaps ten or fifteen if the search process is really complicated, but certainly not days, let alone centuries. So my guess is that you're not telling everything.

        Please explain exactly the search you're doing.

        So I want to do about 16 million grep counts on a 1.4 Gigabyte file. I'm simply counting how many times a Z or Z pops up relative to another variable. It will take me 10 years to do this with nested greps. And 545 years to do this via the perl script you recommended.

        Load an array by greping the file for only those lines that contain 'Z' (or 'Z'????) and then grep that resultant subset foreach of your 16e6 seartch terms and it should take around 15 seconds per; which would reduce your total time to 7.6 days.

        Do the 'Z' (or 'Z'???) filter as once pass and then you can run your 16 million secondary filters concurrently 1 per core and reduce that to a little under two days assuming 4 cores.

        And in the first quarter of next year you'll be able buy a sub-$2000 machine that will allow 8-cores/16-threads that will reduce that to half a day.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Nested greps w/ Perl
by LanX (Saint) on Dec 19, 2016 at 21:41 UTC
    • Use a sliding window to load the files in chunks
    • loop thru a @pattern array with regexes and count matches
    • accumulate all counts in a @count array.
    • update: of course you will also need a @pos array for individual search positions

    If you just need the sum of all pattern matches instead of individual counts, consider using just one regex with all patterns joined with | conditions.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

Re: Nested greps w/ Perl
by kcott (Archbishop) on Dec 20, 2016 at 08:06 UTC

    G'day wackattack,

    "I'm using arrays because these are big files (gigabytes) and I need to do thousands of searches without having to load the file from hard drive each time."

    I would neither load the file into an array nor read it, in its entirety, from disk (any number of times). Instead, reading a file of this size line by line, would probably be a better option. Here's how I might tackle this task.

    #!/usr/bin/env perl use strict; use warnings; use autodie; my $file_to_search = 'file_to_search'; my $file_of_search_terms = 'file_of_search_terms'; my $file_of_search_counts = 'file_of_search_counts'; my %count; { open my $search_terms_fh, '<', $file_of_search_terms; %count = map { chomp; $_ => 0 } <$search_terms_fh>; } my @search_terms = keys %count; { open my $in_fh, '<', $file_to_search; while (<$in_fh>) { chomp; next if -1 == index $_, 'Z'; for my $search_term (@search_terms) { next if -1 == index $_, $search_term; ++$count{$search_term}; last; } } } { open my $out_fh, '>', $file_of_search_counts; print $out_fh "$_ : $count{$_}\n" for sort @search_terms; }

    I used this dummy data for testing:

    $ cat file_to_search 100008020Z Z100008020 100008020 100008030Z Z100008030 100008030 100008040Z Z100008040 100008040
    $ cat file_of_search_terms 100008010 100008020 100008030 100008040 100008050

    Here's the output:

    $ cat file_of_search_counts 100008010 : 0 100008020 : 2 100008030 : 2 100008040 : 2 100008050 : 0

    — Ken

      Thank you Ken. I limited the search terms to 500 and that program you wrote, while it works flawlessly, has been running over an hour and still going. I don't know when it will end.

      Interestingly enough this command processes all search terms and the complete file in 3 minutes 24 seconds.

      time grep -P 'Z' file_to_search | awk '{print $1}' | sort | uniq --count > uniq.count

        You updated your OP since I posted my solution.

        I suspect you don't need that inner (for) loop at all.

        You really need to provide us with a representative sample of your input. You originally posted a search for 100008020, now you seem to be saying that they're not numbers at all but first names. And, if they are indeed names, are there any called Zoë, Zachary, etc.?

        — Ken

        grep -P 'Z' file_to_search | awk '{print $1}' | sort | uniq --count > uniq.count

        One perlish equivalent is

        perl -ae '$s{$F[0]}++ if $F[1] eq "Z"; END {print "$_ $s{$_}\n" for ke +ys %s}' file_to_search > uniq.count

        See how the timings compare on your platform.


        Edit: (TIMTOWTDI) s/The/One/