in reply to Re: Nested greps w/ Perl
in thread Nested greps w/ Perl

So I want to do about 16 million grep counts on a 1.4 Gigabyte file. I'm simply counting how many times a Z or Z pops up relative to another variable. It will take me 10 years to do this with nested greps. And 545 years to do this via the perl script you recommended. This shouldn't take this long. It should go pretty quickly.

Replies are listed 'Best First'.
Re^3: Nested greps w/ Perl
by kennethk (Abbot) on Dec 19, 2016 at 22:45 UTC
    I'm unclear on the particulars on your math, you should note that grep -c Z does not count the number of Z's, but rather the number of lines that contain a Z. If this takes you 10 years, then you have very slow drive access. The majority of processor time for the code block I posted is consumed by the slurp and memory access unless you've done something very wrong.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re^3: Nested greps w/ Perl
by Laurent_R (Canon) on Dec 19, 2016 at 23:15 UTC
    I'm dealing almost daily with files having several GB, doing the simple searches that you describe usually take a couple of minutes, perhaps ten or fifteen if the search process is really complicated, but certainly not days, let alone centuries. So my guess is that you're not telling everything.

    Please explain exactly the search you're doing.

      I have a flat (test file) that looks like this:

      Tommy Z
      Tommy Z
      Chris Z
      Chris B
      Chris Z
      Jake Z
      Jake Y

      I'm simply counting how many Z's each person has and ignoring all other letters.

      Output would look like

      Tommy 2
      Chris 2
      Jake 1

        OK, here is a test. I started with a word.txt file having 113809 unique entries:
        $ wc words.txt 113809 113809 1016714 words.txt
        From there, I created a test_file.txt file, in which one 1200-character string was added to each entry of the word.txt file, and then copied 26 times, giving me a 3-million-line and 3.5 GB test_file.txt test file:
        $ perl -ne 'chomp; for my $let (A..Z) { print "$_ ". "foobar" x 200, +" $let\n" }' words.txt > test_file.txt $ wc test_file.txt 2959034 8877102 3586152466 test_file.txt
        Then, I ran a program counting for each word the number of occurrence of a 'Z':
        use strict; use warnings; use feature qw/say/; my %histogram; open my $IN, "<", "test_file.txt" or die "could not open input file $! +"; while (<$IN>) { chomp; my ($key, undef, $val) = split / /, $_; $histogram{$key}++ if defined $val and $val eq 'Z'; } print scalar keys %histogram;
        Running it takes less than 14 seconds:
        $ time perl test_Z_bigfile.pl 113809 real 0m13.973s user 0m11.734s sys 0m2.140s
        My input data file certainly does not look like yours, but I could count all entries ending with a "Z" from a 3-million-line and 3.5 GB test file in 14 seconds (on a simple laptop). I fail to see why it should take centuries on your side.

        So, again, you're probably not telling everything or not doing it the right way.

Re^3: Nested greps w/ Perl
by BrowserUk (Patriarch) on Dec 19, 2016 at 23:35 UTC
    So I want to do about 16 million grep counts on a 1.4 Gigabyte file. I'm simply counting how many times a Z or Z pops up relative to another variable. It will take me 10 years to do this with nested greps. And 545 years to do this via the perl script you recommended.

    Load an array by greping the file for only those lines that contain 'Z' (or 'Z'????) and then grep that resultant subset foreach of your 16e6 seartch terms and it should take around 15 seconds per; which would reduce your total time to 7.6 days.

    Do the 'Z' (or 'Z'???) filter as once pass and then you can run your 16 million secondary filters concurrently 1 per core and reduce that to a little under two days assuming 4 cores.

    And in the first quarter of next year you'll be able buy a sub-$2000 machine that will allow 8-cores/16-threads that will reduce that to half a day.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice.