Re: Nested greps w/ Perl
by kennethk (Abbot) on Dec 19, 2016 at 21:46 UTC
|
Can you describe what you mean by scanning the array turns into a mess pretty quickly? I'm worried this is an XY Problem and based on your spec I'd assume the real concern would be memory consumption associated with holding the whole array in memory.
For a naive reading for your spec, it'd probably be something like:
my $count = grep /\Q$SEARCH_TERM\E/ && /Z/, @C_LOC_ARRAY;
though if you were committed to chained greps, you could write
my $count = grep /Z/,
grep /\Q$SEARCH_TERM\E/,
@C_LOC_ARRAY;
You might also get a speed boost by using index, depending on your particular need.
Depending on the type of search you are trying to perform, I would say a database might be a cleaner solution; an in-memory SQLite database would be very fast (though have a big footprint) and a file-based database would allow you to only index once.
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
| [reply] [d/l] [select] |
|
|
Thank you so much for the help.
So I just ran the first command
my $count = grep /\Q$SEARCH_TERM\E/ && /Z/, @C_LOC_ARRAY;
on my file which is 1.4 GB in size. It takes 51 times longer than a simple bash command line grep. It also uses 16 GB of virtual RAM.
By mess I was saying nested for loops can get ugly when I have thousands of search terms in one file to search against a database. I'm simply trying to find out how many times each search term appears in the database.
| [reply] [d/l] |
|
|
The problem is largely driven by memory footprint. If you instead run
open my $C_LOC, '<', $C_LOCATIONS_FILE;
while (<$C_LOC>) {
chomp;
$count += /\Q$SEARCH_TERM\E/ && /Z/
}
close $C_LOC;
you should see a substantial reduction in time. In order to substantially speed it up past this, you need to avoid readline's buffering, at which point you can implement Matching in huge files.
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
| [reply] [d/l] |
|
|
Huh?
I have thousands of search terms in one file to search against a database. I'm simply trying to find out how many times each search term appears in the database
Does this mean that you're dumping out the content of a SQL database to file and then using grep to search the data?
You realise this defeats the entire purpose of having a database, right? Assuming your database is correctly indexed (and you have sufficient RAM), you should be able to run a query that gives you exactly what you want in a fraction of the time it takes to even dump out the entire table(s) for external processing
| [reply] |
|
|
|
|
|
|
|
|
|
So I want to do about 16 million grep counts on a 1.4 Gigabyte file. I'm simply counting how many times a Z or Z pops up relative to another variable.
It will take me 10 years to do this with nested greps. And 545 years to do this via the perl script you recommended. This shouldn't take this long. It should go pretty quickly.
| [reply] |
|
|
I'm unclear on the particulars on your math, you should note that grep -c Z does not count the number of Z's, but rather the number of lines that contain a Z. If this takes you 10 years, then you have very slow drive access. The majority of processor time for the code block I posted is consumed by the slurp and memory access unless you've done something very wrong.
#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
| [reply] [d/l] |
|
|
| [reply] |
|
|
|
|
|
|
So I want to do about 16 million grep counts on a 1.4 Gigabyte file. I'm simply counting how many times a Z or Z pops up relative to another variable. It will take me 10 years to do this with nested greps. And 545 years to do this via the perl script you recommended.
Load an array by greping the file for only those lines that contain 'Z' (or 'Z'????) and then grep that resultant subset foreach of your 16e6 seartch terms and it should take around 15 seconds per; which would reduce your total time to 7.6 days.
Do the 'Z' (or 'Z'???) filter as once pass and then you can run your 16 million secondary filters concurrently 1 per core and reduce that to a little under two days assuming 4 cores.
And in the first quarter of next year you'll be able buy a sub-$2000 machine that will allow 8-cores/16-threads that will reduce that to half a day.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Re: Nested greps w/ Perl
by LanX (Saint) on Dec 19, 2016 at 21:41 UTC
|
- Use a sliding window to load the files in chunks
- loop thru a @pattern array with regexes and count matches
- accumulate all counts in a @count array.
- update: of course you will also need a @pos array for individual search positions
If you just need the sum of all pattern matches instead of individual counts, consider using just one regex with all patterns joined with | conditions.
| [reply] [d/l] |
Re: Nested greps w/ Perl
by kcott (Archbishop) on Dec 20, 2016 at 08:06 UTC
|
G'day wackattack,
"I'm using arrays because these are big files (gigabytes) and I need to do thousands of searches without having to load the file from hard drive each time."
I would neither load the file into an array nor read it, in its entirety, from disk (any number of times).
Instead, reading a file of this size line by line, would probably be a better option.
Here's how I might tackle this task.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
my $file_to_search = 'file_to_search';
my $file_of_search_terms = 'file_of_search_terms';
my $file_of_search_counts = 'file_of_search_counts';
my %count;
{
open my $search_terms_fh, '<', $file_of_search_terms;
%count = map { chomp; $_ => 0 } <$search_terms_fh>;
}
my @search_terms = keys %count;
{
open my $in_fh, '<', $file_to_search;
while (<$in_fh>) {
chomp;
next if -1 == index $_, 'Z';
for my $search_term (@search_terms) {
next if -1 == index $_, $search_term;
++$count{$search_term};
last;
}
}
}
{
open my $out_fh, '>', $file_of_search_counts;
print $out_fh "$_ : $count{$_}\n" for sort @search_terms;
}
I used this dummy data for testing:
$ cat file_to_search
100008020Z
Z100008020
100008020
100008030Z
Z100008030
100008030
100008040Z
Z100008040
100008040
$ cat file_of_search_terms
100008010
100008020
100008030
100008040
100008050
Here's the output:
$ cat file_of_search_counts
100008010 : 0
100008020 : 2
100008030 : 2
100008040 : 2
100008050 : 0
| [reply] [d/l] [select] |
|
|
Thank you Ken. I limited the search terms to 500 and that program you wrote, while it works flawlessly, has been running over an hour and still going. I don't know when it will end.
Interestingly enough this command processes all search terms and the complete file in 3 minutes 24 seconds.
time grep -P 'Z' file_to_search | awk '{print $1}' | sort | uniq --count > uniq.count
| [reply] |
|
|
You updated your OP since I posted my solution.
I suspect you don't need that inner (for) loop at all.
You really need to provide us with a representative sample of your input.
You originally posted a search for 100008020,
now you seem to be saying that they're not numbers at all but first names.
And, if they are indeed names, are there any called Zoë, Zachary, etc.?
| [reply] [d/l] [select] |
|
|
perl -ae '$s{$F[0]}++ if $F[1] eq "Z"; END {print "$_ $s{$_}\n" for ke
+ys %s}' file_to_search > uniq.count
See how the timings compare on your platform.
Edit: (TIMTOWTDI) s/The/One/ | [reply] [d/l] |