in reply to Nested greps w/ Perl

Can you describe what you mean by scanning the array turns into a mess pretty quickly? I'm worried this is an XY Problem and based on your spec I'd assume the real concern would be memory consumption associated with holding the whole array in memory.

For a naive reading for your spec, it'd probably be something like:

my $count = grep /\Q$SEARCH_TERM\E/ && /Z/, @C_LOC_ARRAY;
though if you were committed to chained greps, you could write
my $count = grep /Z/, grep /\Q$SEARCH_TERM\E/, @C_LOC_ARRAY;
You might also get a speed boost by using index, depending on your particular need.

Depending on the type of search you are trying to perform, I would say a database might be a cleaner solution; an in-memory SQLite database would be very fast (though have a big footprint) and a file-based database would allow you to only index once.


#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Replies are listed 'Best First'.
Re^2: Nested greps w/ Perl
by wackattack (Sexton) on Dec 19, 2016 at 22:06 UTC
    Thank you so much for the help. So I just ran the first command


     my $count = grep /\Q$SEARCH_TERM\E/ && /Z/, @C_LOC_ARRAY;

    on my file which is 1.4 GB in size. It takes 51 times longer than a simple bash command line grep. It also uses 16 GB of virtual RAM.

    By mess I was saying nested for loops can get ugly when I have thousands of search terms in one file to search against a database. I'm simply trying to find out how many times each search term appears in the database.
      The problem is largely driven by memory footprint. If you instead run
      open my $C_LOC, '<', $C_LOCATIONS_FILE; while (<$C_LOC>) { chomp; $count += /\Q$SEARCH_TERM\E/ && /Z/ } close $C_LOC;
      you should see a substantial reduction in time. In order to substantially speed it up past this, you need to avoid readline's buffering, at which point you can implement Matching in huge files.

      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      Huh?

      I have thousands of search terms in one file to search against a database. I'm simply trying to find out how many times each search term appears in the database

      Does this mean that you're dumping out the content of a SQL database to file and then using grep to search the data?

      You realise this defeats the entire purpose of having a database, right? Assuming your database is correctly indexed (and you have sufficient RAM), you should be able to run a query that gives you exactly what you want in a fraction of the time it takes to even dump out the entire table(s) for external processing

        It's a flat file. (text document). Although I'm wondering if I should put the file into a SQLITE database I can't but help think I should be able to do this query rather quickly.

        All I'm doing is asking

        How many Z's does jake have?
        how many Z's does lisa have?
        how many Z's does tommy have?

        And doing that for 8 million people.
Re^2: Nested greps w/ Perl
by wackattack (Sexton) on Dec 19, 2016 at 22:29 UTC
    So I want to do about 16 million grep counts on a 1.4 Gigabyte file. I'm simply counting how many times a Z or Z pops up relative to another variable. It will take me 10 years to do this with nested greps. And 545 years to do this via the perl script you recommended. This shouldn't take this long. It should go pretty quickly.
      I'm unclear on the particulars on your math, you should note that grep -c Z does not count the number of Z's, but rather the number of lines that contain a Z. If this takes you 10 years, then you have very slow drive access. The majority of processor time for the code block I posted is consumed by the slurp and memory access unless you've done something very wrong.

      #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      I'm dealing almost daily with files having several GB, doing the simple searches that you describe usually take a couple of minutes, perhaps ten or fifteen if the search process is really complicated, but certainly not days, let alone centuries. So my guess is that you're not telling everything.

      Please explain exactly the search you're doing.

        I have a flat (test file) that looks like this:

        Tommy Z
        Tommy Z
        Chris Z
        Chris B
        Chris Z
        Jake Z
        Jake Y

        I'm simply counting how many Z's each person has and ignoring all other letters.

        Output would look like

        Tommy 2
        Chris 2
        Jake 1

      So I want to do about 16 million grep counts on a 1.4 Gigabyte file. I'm simply counting how many times a Z or Z pops up relative to another variable. It will take me 10 years to do this with nested greps. And 545 years to do this via the perl script you recommended.

      Load an array by greping the file for only those lines that contain 'Z' (or 'Z'????) and then grep that resultant subset foreach of your 16e6 seartch terms and it should take around 15 seconds per; which would reduce your total time to 7.6 days.

      Do the 'Z' (or 'Z'???) filter as once pass and then you can run your 16 million secondary filters concurrently 1 per core and reduce that to a little under two days assuming 4 cores.

      And in the first quarter of next year you'll be able buy a sub-$2000 machine that will allow 8-cores/16-threads that will reduce that to half a day.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
      In the absence of evidence, opinion is indistinguishable from prejudice.