Nested greps w/ Perl

wackattack has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Nested greps w/ Perl by kennethk (Abbot) on Dec 19, 2016 at 21:46 UTC
Can you describe what you mean by scanning the array turns into a mess pretty quickly? I'm worried this is an XY Problem and based on your spec I'd assume the real concern would be memory consumption associated with holding the whole array in memory. For a naive reading for your spec, it'd probably be something like: `my $count = grep /\Q$SEARCH_TERM\E/ && /Z/, @C_LOC_ARRAY;` [download] though if you were committed to chained greps, you could write `my $count = grep /Z/, grep /\Q$SEARCH_TERM\E/, @C_LOC_ARRAY;` [download] You might also get a speed boost by using index, depending on your particular need. Depending on the type of search you are trying to perform, I would say a database might be a cleaner solution; an in-memory SQLite database would be very fast (though have a big footprint) and a file-based database would allow you to only index once. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re^2: Nested greps w/ Perl by wackattack (Sexton) on Dec 19, 2016 at 22:06 UTC
Thank you so much for the help. So I just ran the first command `my $count = grep /\Q$SEARCH_TERM\E/ && /Z/, @C_LOC_ARRAY;` on my file which is 1.4 GB in size. It takes 51 times longer than a simple bash command line grep. It also uses 16 GB of virtual RAM. By mess I was saying nested for loops can get ugly when I have thousands of search terms in one file to search against a database. I'm simply trying to find out how many times each search term appears in the database.	[reply] [d/l]
Re^3: Nested greps w/ Perl by kennethk (Abbot) on Dec 19, 2016 at 22:37 UTC
The problem is largely driven by memory footprint. If you instead run `open my $C_LOC, '<', $C_LOCATIONS_FILE; while (<$C_LOC>) { chomp; $count += /\Q$SEARCH_TERM\E/ && /Z/ } close $C_LOC;` [download] you should see a substantial reduction in time. In order to substantially speed it up past this, you need to avoid readline's buffering, at which point you can implement Matching in huge files. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l]
Re^3: Nested greps w/ Perl by SimonPratt (Friar) on Dec 20, 2016 at 15:21 UTC
Huh? I have thousands of search terms in one file to search against a database. I'm simply trying to find out how many times each search term appears in the database Does this mean that you're dumping out the content of a SQL database to file and then using grep to search the data? You realise this defeats the entire purpose of having a database, right? Assuming your database is correctly indexed (and you have sufficient RAM), you should be able to run a query that gives you exactly what you want in a fraction of the time it takes to even dump out the entire table(s) for external processing	[reply]
Re^4: Nested greps w/ Perl by wackattack (Sexton) on Dec 20, 2016 at 16:21 UTC
Re^5: Nested greps w/ Perl by kennethk (Abbot) on Dec 20, 2016 at 17:07 UTC
Some notes below your chosen depth have not been shown here
Re^5: Nested greps w/ Perl by BrowserUk (Patriarch) on Dec 20, 2016 at 18:33 UTC
Re^2: Nested greps w/ Perl by wackattack (Sexton) on Dec 19, 2016 at 22:29 UTC
So I want to do about 16 million grep counts on a 1.4 Gigabyte file. I'm simply counting how many times a Z or Z pops up relative to another variable. It will take me 10 years to do this with nested greps. And 545 years to do this via the perl script you recommended. This shouldn't take this long. It should go pretty quickly.	[reply]
Re^3: Nested greps w/ Perl by kennethk (Abbot) on Dec 19, 2016 at 22:45 UTC
I'm unclear on the particulars on your math, you should note that `grep -c Z` does not count the number of Z's, but rather the number of lines that contain a Z. If this takes you 10 years, then you have very slow drive access. The majority of processor time for the code block I posted is consumed by the slurp and memory access unless you've done something very wrong. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l]
Re^3: Nested greps w/ Perl by Laurent_R (Canon) on Dec 19, 2016 at 23:15 UTC
I'm dealing almost daily with files having several GB, doing the simple searches that you describe usually take a couple of minutes, perhaps ten or fifteen if the search process is really complicated, but certainly not days, let alone centuries. So my guess is that you're not telling everything. Please explain exactly the search you're doing.	[reply]
Re^4: Nested greps w/ Perl by wackattack (Sexton) on Dec 20, 2016 at 16:26 UTC
Re^5: Nested greps w/ Perl by Laurent_R (Canon) on Dec 20, 2016 at 19:02 UTC
Re^3: Nested greps w/ Perl by BrowserUk (Patriarch) on Dec 19, 2016 at 23:35 UTC
So I want to do about 16 million grep counts on a 1.4 Gigabyte file. I'm simply counting how many times a Z or Z pops up relative to another variable. It will take me 10 years to do this with nested greps. And 545 years to do this via the perl script you recommended. Load an array by greping the file for only those lines that contain 'Z' (or 'Z'????) and then grep that resultant subset foreach of your 16e6 seartch terms and it should take around 15 seconds per; which would reduce your total time to 7.6 days. Do the 'Z' (or 'Z'???) filter as once pass and then you can run your 16 million secondary filters concurrently 1 per core and reduce that to a little under two days assuming 4 cores. And in the first quarter of next year you'll be able buy a sub-$2000 machine that will allow 8-cores/16-threads that will reduce that to half a day. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Nested greps w/ Perl by LanX (Saint) on Dec 19, 2016 at 21:41 UTC
Use a sliding window to load the files in chunks loop thru a @pattern array with regexes and count matches accumulate all counts in a @count array. update: of course you will also need a @pos array for individual search positions If you just need the sum of all pattern matches instead of individual counts, consider using just one regex with all patterns joined with `\|` conditions. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l]
Re: Nested greps w/ Perl by kcott (Archbishop) on Dec 20, 2016 at 08:06 UTC
G'day wackattack, "I'm using arrays because these are big files (gigabytes) and I need to do thousands of searches without having to load the file from hard drive each time." I would neither load the file into an array nor read it, in its entirety, from disk (any number of times). Instead, reading a file of this size line by line, would probably be a better option. Here's how I might tackle this task. #!/usr/bin/env perl use strict; use warnings; use autodie; my $file_to_search = 'file_to_search'; my $file_of_search_terms = 'file_of_search_terms'; my $file_of_search_counts = 'file_of_search_counts'; my %count; { open my $search_terms_fh, '<', $file_of_search_terms; %count = map { chomp; $_ => 0 } <$search_terms_fh>; } my @search_terms = keys %count; { open my $in_fh, '<', $file_to_search; while (<$in_fh>) { chomp; next if -1 == index $_, 'Z'; for my $search_term (@search_terms) { next if -1 == index $_, $search_term; ++$count{$search_term}; last; } } } { open my $out_fh, '>', $file_of_search_counts; print $out_fh "$_ : $count{$_}\n" for sort @search_terms; } [download] I used this dummy data for testing: `$ cat file_to_search 100008020Z Z100008020 100008020 100008030Z Z100008030 100008030 100008040Z Z100008040 100008040` [download] `$ cat file_of_search_terms 100008010 100008020 100008030 100008040 100008050` [download] Here's the output: `$ cat file_of_search_counts 100008010 : 0 100008020 : 2 100008030 : 2 100008040 : 2 100008050 : 0` [download] — Ken	[reply] [d/l] [select]
Re^2: Nested greps w/ Perl by wackattack (Sexton) on Dec 20, 2016 at 19:01 UTC
Thank you Ken. I limited the search terms to 500 and that program you wrote, while it works flawlessly, has been running over an hour and still going. I don't know when it will end. Interestingly enough this command processes all search terms and the complete file in 3 minutes 24 seconds. time grep -P 'Z' file_to_search \| awk '{print $1}' \| sort \| uniq --count > uniq.count	[reply]
Re^3: Nested greps w/ Perl by kcott (Archbishop) on Dec 20, 2016 at 22:23 UTC
You updated your OP since I posted my solution. I suspect you don't need that inner (`for`) loop at all. You really need to provide us with a representative sample of your input. You originally posted a search for `100008020`, now you seem to be saying that they're not numbers at all but first names. And, if they are indeed names, are there any called Zoë, Zachary, etc.? — Ken	[reply] [d/l] [select]
Re^3: Nested greps w/ Perl by hippo (Archbishop) on Dec 21, 2016 at 09:28 UTC
grep -P 'Z' file_to_search \| awk '{print $1}' \| sort \| uniq --count > uniq.count One perlish equivalent is `perl -ae '$s{$F[0]}++ if $F[1] eq "Z"; END {print "$_ $s{$_}\n" for ke +ys %s}' file_to_search > uniq.count` [download] See how the timings compare on your platform. Edit: (TIMTOWTDI) s/The/One/	[reply] [d/l]