Paraxial has asked for the wisdom of the Perl Monks concerning the following question:

Good morning, Monks!

I'm working with a rather large and ever changing text log file (think apache log). What I wish to do is generate statistics from it, such as requests per hour, https connections per hour, failed requests per hour, etc...

In order to speed up this process, rather than going throug the file in a linear fashion, I'd like to do a binary search till I reach logs starting from 1 hour ago, and then parse through each line to the end of the file.

I've been scratching my head with this one, can anyone help?

Thanks in advance.

Replies are listed 'Best First'.
Re: Binary Search Timestamps
by CountZero (Bishop) on Jul 16, 2014 at 10:05 UTC
    As you want to generate various statistics on your log files, this means you will have to read these log files again and again and again ...

    Much better then to put these log files in a database and perform standard SQL queries on it.

    You will only have to enter each line of the log file once into the database, nicely split into its various fields. It won't be terribly difficult to automate that. Then you can have all these statistics updated just by running your SQL queries, which will be much faster than re-reading the log file again and again.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

      In an ideal world, this would be the method I'd prefer to use also, though unfortunately we don't have the resources to do this at the moment.

      Thanks for your input however, much appreciated.

        This is false economy; depending on your skill level… Putting it all into SQLite is semi-trivial—Pg or MySQL only slightly harder—and would obviate the need for repetitive, selective, time consuming, and error prone reparsing.

        Sometimes taking a time hit up front for extra code, structure, or learning will save 10 fold down the road.

        I agree with Your Mother that it will be only a small effort (both in time, people, computer resources and costs) to put this data in a database. Parsing your log-files into an SQLite-database will not need more than 50 lines of Perl-script and SQLite needs next to no maintenance once it has been set-up. A MySQL-database, more robust and better scalable, needs more effort to set it up, but much of it is a one time cost which you may have already paid that for a large part if you have a MySQL server running somewhere in the company.

        And if you are running an Apache web server, then there is always mod_log_mysql (although its installation is not for the faint-hearted it seems) which logs directly into a MySQL database.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics
Re: Binary Search Timestamps
by Athanasius (Archbishop) on Jul 16, 2014 at 09:54 UTC

    Hello Paraxial, and welcome to the Monastery!

    Since you will need the data from all the logs up to an hour old, you don’t really need a binary search. Just read the file backwards until the latest record read is more than an hour old. I haven’t used it, but the module File::ReadBackwards is designed for just this task:

    This module reads a file backwards line by line. It is simple to use, memory efficient and fast. ...
    It is intended for processing log and other similar text files which typically have their newest entries appended to them.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      I did in-fact look at this method yesterday, and it does seem to be a good way to do this vs binary searching, especially when working with things like log files.

      With that said, the File::SortedSeek module seems to be a better fit for what I'm doing it seems, though thanks for your input. As a complete newbie to Perl, it's nice to see I didn't go too far off the mark when looking for a solution to this.

Re: Binary Search Timestamps
by AppleFritter (Vicar) on Jul 16, 2014 at 09:46 UTC

    Howdy Paraxial, welcome to the Monastery!

    Binary Searches on Sorted Text Files has a useful snippet of code for binary-searching text files. It's intended for sorted in its current state, but all you'd really need to do to adapt it to your needs is modify the conditions for the recursive calls. In fact, I'd pass a callback function there as an extra parameter instead of hardcoding anything specific.

    One of the comments on that node also points out File::SortedSeek, which looks like it may well be useful.

      Thanks for this, I found the link to the page on binary searching sorted text files before I posted this, though failed to see the comment mentioning File::SortedSeek!

      You're right, it does exactly what I need and I've now managed to get it working with the log file as expected, so thank you!

      I've found I will probably run this script every 5-10 minutes in order to reduce the amount of RAM required to run this as it does get quite hungry.

Re: Binary Search Timestamps
by locked_user sundialsvc4 (Abbot) on Jul 16, 2014 at 10:58 UTC
    Arrange to have the file rotated, then process the files that have rotated off, putting results into a database. Also, bear in mind that there are many existing programs out there which do this job most-completely, fun though it may seem to write yet another one. There must be a hundred already.
Re: Binary Search Timestamps
by Anonymous Monk on Jul 16, 2014 at 20:16 UTC

    Another option is to have a (daemon) process reading the log file (think tail -f); this could wake up say every 5 minutes, consume new lines, update counter bins, write out the statistics page or meta log-file.