calebcall has asked for the wisdom of the Perl Monks concerning the following question:

What would be the best way to extract, record, and count a single field in a very large log file (Apache access log). The file is intentionally large (for a legal request). The file is 1.5TB in size. What I would like to do is pull the date from each line, count how many requests per date, then report number of requests per date and the date itself. If the file wasn't so large, I could just do something like:

cat logfile.log | awk {'print $4'} | sort | uniq -c

However, reading a 1.5TB file in to memory just isn't going to work :)

Where would I start?

Replies are listed 'Best First'.
Re: Working with a very large log file (parsing data out)
by BrowserUk (Patriarch) on Feb 20, 2013 at 07:50 UTC
    If the file wasn't so large, I could just do something like:cat logfile.log | awk {'print $4'} | sort | uniq -c However, reading a 1.5TB file in to memory just isn't going to work :)

    That command chain ought to work as is -- even with a very large file -- because each process in the chain (except sort) processes the file data line by line. And although sort needs to process the entire file, it knows how to use temporary files to spill intermediate results avoiding memory exhaustion.

    I'm not saying it will be fast. But it should work.

    However, something like this should also do the trick and be substantially faster (~1.25 60 hours):

    perl -anle"++$h{ $F[ 4 ] } }{ print qq[$h{ $_ } $_] for sort keys %h" +theLogFile > resultsFile

    Update: You might need $F[3]. I can't remember if awk's field numbers are zero-based or one-based?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      ... and be substantially faster (~1.25 hours)

      How did you figure the time?

        By running it on a 5.4GB logfile -- that took 12.5 minutes -- and then scaling: 1.5TB / 5.4GB = 285 * 12.5 = 35550 / 60 = 59.2. + (a bit for contingency) = 75.

        And then making the mistake of treating that as minutes instead of hours!

        Thank you for the heads up, I'll correct the above!


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Working with a very large log file (parsing data out)
by MidLifeXis (Monsignor) on Feb 20, 2013 at 13:12 UTC

    If the file is already sorted by date, you can do something like this

    my $date = ''; my $count = 0; while (<>) { @data = split(...); if ( $data[3] ne $date ) { if ($date) { print $date, "\t", $count, "\n"; } $count = 0; $date = $data[3]; } $count++; } # Catch the last one print $date, "\t", $count, "\n";

    update: forgot last print

    --MidLifeXis

Re: Working with a very large log file (parsing data out)
by tmharish (Friar) on Feb 20, 2013 at 07:18 UTC
    open( my $log_file_handle, 'logfile.log' ) or die( "Could not open fil +e\n" ); while( <$log_file_handle> ) { my $single_line_in_log_file = $_ ; ## Now we are dealing with only one line ... } close( $log_file_handle );

    You might also want to save the line number you have reached and seek it so you can continue from where you left off, in case the program dies.

Re: Working with a very large log file (parsing data out)
by mbethke (Hermit) on Feb 20, 2013 at 17:20 UTC

    ++ to what MidLifeXis said. As logs tend to be sorted already, it's likely you can avoid the sort as the only part that's likely to be a problem memory-wise.

    To add to that, for data this size it may be worth running a little preprocessor in C, especially if your log format has fixed-size fields or other delimiters easily recognized with C string functions. That way you could both split the parsing over two CPU cores and avoid running slow regexen (or even substr() which is fast for Perl but still doesn't even come close to C). Something like this (largely untested but you get the idea):

    #include <stdlib.h> #include <stdio.h> #include <errno.h> #include <string.h> int main(int argc, char *argv[]) { char buf[10000]; FILE *fh; if(2 != argc) { fputs("Usage: filter <log>\n", stderr); exit(1); } if(!(fh = fopen(argv[1], "r"))) { perror("Cannot open log"); exit(1); } while(!fgets(buf, sizeof(buf), fh)) { static const size_t START_OFFSET = 50; size_t len = strlen(buf); char *endp; if('\n' != buf[len-1]) { fputs("WARNING: line did not fit in buffer, skipped\n", stder +r); continue; } endp = buf + START_OFFSET; len = 20; // To search for a blank after the field instead of using a fixe +d width // endp = strchr(buf + START_OFFSET, ' '); // len = endp ? endp - (buf + START_OFFSET) : len - START_OFFSE +T; // careful with strchr()==NULL fwrite(buf + START_OFFSET, 1, len, stdout); } }

    Edit: jhourcle's post just reminded me of the part I missed initially, namely that it's an Apache log. So if you use the standard combined format you could just use START_OFFSET=9 and len=11 to print only the date, if you don't want to differentiate by result code. Then a simple

    my %h; $h{$_}++ while(<>);
    would get the requests-per-date counts and the only slightly trickier thing is to get them sorted chronologically on output. Something like
    for(sort { $a->[0] <=> $b->[0] } map { [ Date::Parse::str2date($_) => +chomp ] } keys %h) { print "$_->[1]: $h{"$_\n"}\n; }
      As logs tend to be sorted already, it's likely you can avoid the sort as the only part that's likely to be a problem memory-wise

      They're sorted by the time that they finish, but the time logged is when the request was made. ... so, a long running CGI or request to transfer a large file at the end of dayN might be after other lines for dayN+1

      But you still don't have to sort the whole file, as you can get everything in order, then in a second pass you sum up the values that got split up

        True, I missed the part where he said it's an Apache log m(

        I'd try and avoid making several passes over 1.5TB in Perl though. If you just accumulate request counts in a hash keyed by date as I just added above, you don't have to.

Re: Working with a very large log file (parsing data out)
by generator (Pilgrim) on Feb 21, 2013 at 00:32 UTC
    I'd build a hash using the log entry date as the key and the single field value as the (presumably numeric) value. As each line in the source log file is read, test for the existence of the key, if found increment the value by the current line's field value. If it is not found create a new key value pair in the hash. Sorting the hash after completing the processing of the file should be significantly less memory intensive as you'll be sorting the summary records not the detail records. That's my 2 cents for what it's worth.
    <><

    generator

Re: Working with a very large log file (parsing data out)
by topher (Scribe) on Feb 25, 2013 at 17:05 UTC

    At my previous job, I did a *lot* of log processing. As much as I love Perl, for quick and dirty ad hoc log mangling, awk was frequently my go-to tool. For cases exactly as you describe, I used the following:

    cat logfile.log | awk '{count[$4]++}; END {for (x in count) {print count[x], x}};' | sort -nr

    By using an associative array (hash) to track the unique values, you reduce the amount of data you have to sort by orders of magnitude (potentially).

    Note: This is not a "max performance" solution. It is a "usually fast enough" solution. If you want maximum performance, there are lots of additional things you can do to make this faster. One of the easiest things (that often pays quick dividends on modern multi-core/CPU systems) is to compress your log files. This decreases the disk IO, and for many systems will be faster than reading the whole uncompressed file from disk.

    zcat logfile.log.gz | awk '{count[$4]++}; END {for (x in count) {print count[x], x}};' | sort -nr

    Another possible speedup would be to do a perl-equivalent of the awk, but to stop your line split at the number of fields you care about (plus 1 for "the rest"). This will frequently be faster than the awk example, but is slightly less suitable to manually typing in every time you're hitting a log file for ad hoc log queries. Although, looking at them side-by-side, it's really not much more difficult; I think it's just the hundreds of times I typed the awk version that makes it pop quickly from my fingers.

    zcat logfile.log.gz | perl -ne '@line = split " ",$_, 5; $count{$line[3]}++; END {print "$count{$_} $_ \n" for (keys %count); };'