Hi all. This is about software design and algorithms than about Perl proper, I hope that is okay.

I already STFW, CPAN and perlmonks, but came only up with search engines that provide channel topic data and similar. I want to efficiently search IRC logs, as grep(1) doesn't cut it anymore. Those files are one per day. Therefore I cannot use existing local search engines like namazu as it indexes documents as whole. Imagine two search terms which occur on different lines - namazu gives the document as result, but this is useless, because different people will have said the words at a different time.

Now before I venture to design in detail and program this on my own, do you know about software that already does what I want?

If not, I've thought about two approaches:

  1. Split all day logs into files of one line each so a regular document search engine can digest them. Only I'm stuck without reiser as file system and I'm afraid what this will do to the disk performance.
  2. Process the day logs to remove control characters and punctuation, lowercase every word, restrict it to the first 20 characters and output each word with a combined date/timestamp and the line number. The result I can feed into a RDB. I don't know much about that, so I'm going to need your help with that. Is it correct that I need an index on the word and the date/time column?

If that's all bunk, please advise how you would go about it.


In reply to IRC log search by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.