in reply to Re^2: Nested greps w/ Perl
in thread Nested greps w/ Perl

I'm dealing almost daily with files having several GB, doing the simple searches that you describe usually take a couple of minutes, perhaps ten or fifteen if the search process is really complicated, but certainly not days, let alone centuries. So my guess is that you're not telling everything.

Please explain exactly the search you're doing.

Replies are listed 'Best First'.
Re^4: Nested greps w/ Perl
by wackattack (Sexton) on Dec 20, 2016 at 16:26 UTC
    I have a flat (test file) that looks like this:

    Tommy Z
    Tommy Z
    Chris Z
    Chris B
    Chris Z
    Jake Z
    Jake Y

    I'm simply counting how many Z's each person has and ignoring all other letters.

    Output would look like

    Tommy 2
    Chris 2
    Jake 1

      OK, here is a test. I started with a word.txt file having 113809 unique entries:
      $ wc words.txt 113809 113809 1016714 words.txt
      From there, I created a test_file.txt file, in which one 1200-character string was added to each entry of the word.txt file, and then copied 26 times, giving me a 3-million-line and 3.5 GB test_file.txt test file:
      $ perl -ne 'chomp; for my $let (A..Z) { print "$_ ". "foobar" x 200, +" $let\n" }' words.txt > test_file.txt $ wc test_file.txt 2959034 8877102 3586152466 test_file.txt
      Then, I ran a program counting for each word the number of occurrence of a 'Z':
      use strict; use warnings; use feature qw/say/; my %histogram; open my $IN, "<", "test_file.txt" or die "could not open input file $! +"; while (<$IN>) { chomp; my ($key, undef, $val) = split / /, $_; $histogram{$key}++ if defined $val and $val eq 'Z'; } print scalar keys %histogram;
      Running it takes less than 14 seconds:
      $ time perl test_Z_bigfile.pl 113809 real 0m13.973s user 0m11.734s sys 0m2.140s
      My input data file certainly does not look like yours, but I could count all entries ending with a "Z" from a 3-million-line and 3.5 GB test file in 14 seconds (on a simple laptop). I fail to see why it should take centuries on your side.

      So, again, you're probably not telling everything or not doing it the right way.