Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all. This is about software design and algorithms than about Perl proper, I hope that is okay.

I already STFW, CPAN and perlmonks, but came only up with search engines that provide channel topic data and similar. I want to efficiently search IRC logs, as grep(1) doesn't cut it anymore. Those files are one per day. Therefore I cannot use existing local search engines like namazu as it indexes documents as whole. Imagine two search terms which occur on different lines - namazu gives the document as result, but this is useless, because different people will have said the words at a different time.

Now before I venture to design in detail and program this on my own, do you know about software that already does what I want?

If not, I've thought about two approaches:

  1. Split all day logs into files of one line each so a regular document search engine can digest them. Only I'm stuck without reiser as file system and I'm afraid what this will do to the disk performance.
  2. Process the day logs to remove control characters and punctuation, lowercase every word, restrict it to the first 20 characters and output each word with a combined date/timestamp and the line number. The result I can feed into a RDB. I don't know much about that, so I'm going to need your help with that. Is it correct that I need an index on the word and the date/time column?

If that's all bunk, please advise how you would go about it.

Replies are listed 'Best First'.
Re: IRC log search
by FitTrend (Pilgrim) on Feb 22, 2005 at 04:50 UTC

    I'm assuming that you will have a large volume of data to process. If so, a database is the way to go to speed up searching and data crunching. However, the database schema will depend on what you are trying to get out of your project. What kind of reporting or searching will you be performing?

    If you are unsure on what to do about a database schema, inserting each line into a database (I prefer mysql) and then utilize some database analyzing queries to help optimize your database. Referencing the MySQL documenation at http://dev.mysql.com/doc/mysql/en/optimizing-database-structure.html is a good place to start.

    On the perl side: DBI, DBD::MySQL, and Class::DBI are a few good modules to help you talk to your database. There are certainly more modules and examples on CPAN.

    Hope This helps.

      In goes one or several search terms, perhaps sometimes constrained to the recent n months or so. Out comes the day and the line number where someone mentioned all the terms. With those data I can look up the immediate context from the log file.

      Currently I'm not treating the nickname which is in front of each line different than the spoken words. This means a nickname can be given as additional search term to restrict results to what a certain person said.

      Is that enough to make a schema?

      Yes, good pointers. I looked at DBI, that's straightforward, but I don't understand Class::DBI on a conceptual level. Still reading the mysql guide.

        Based on what I know, I would say the schema should be minimally:

        id, int(11) auto_increment
        nickname, varchar 100
        message, text
        timestamp, int(11)
        dateStamp, time/date

        Some would argue that you don't need 2 timestamp fields. However in my experience, I sometimes have the need to search base on an EPOCH time range (timestamp) or to use MySQLs date system to search between certain dates.

        My logic for this database is:

        • You can limit queries base on nickname
        • You can limit queries base on time range
        • You can perform certain matches based on text
        • Use a combination of any/all three

        I've included an ID field. This field exists based on common database practices and for future growth. I feel that you may need to code a sub or script to parse a raw log to insert it into the database properly. It may get tricky depending on how the IRC formats the time stamp in the logs.

        Hope this helps