cool256 has asked for the wisdom of the Perl Monks concerning the following question:

Oh yeah wise ones,

I have a basic need of a search. The issue I face is optimization and speed.
Perhaps someone can give me a better idea how to accomplish this in perl rather than going to C.
The issue I have is the following:
1. Log size range up to 1015289 lines of text.
2. I have 4 strings I need to find within these logs
3. I have 100's of these logs to go through and generate reports.

The problem: slow, slow, slow
Any suggestions from search masters in here?

Thanks in advance

Replies are listed 'Best First'.
Re: Optimizing string searches
by perrin (Chancellor) on Sep 05, 2008 at 16:38 UTC
    Use a sliding window and if you're looking for constants use index() instead of a regex.
Re: Optimizing string searches
by moritz (Cardinal) on Sep 05, 2008 at 16:24 UTC
    If the strings you are looking for are constants and start similarly, try with perl 5.10.0, it optimizes the heck out of constant alternations.

    But probably one of the grep unix (or GNU) tools is faster.

Re: Optimizing string searches
by Illuminatus (Curate) on Sep 05, 2008 at 17:53 UTC
    I concur on grep. Most have an option to accept perl REs, so you wouldn't even have to do any mods. However, your vague description seems to indicate GBs worth of text to search. You don't mention how often it has to run. If grep, for some reason, is not an answer, I would first baseline 'slow'. Write a small program that just reads in all the lines of all the files you need to process. You obviously aren't going to get any faster than that, using perl.

    I assume these files are actually being created on many different machines. Can you add a small program to each that processes each log file as it is created (ie spread the pain)? If Linux, just 'tail -f' the log file and pipe it to your parser.

Re: Optimizing string searches
by Anonymous Monk on Sep 05, 2008 at 16:06 UTC
    "grep -l"?
Re: Optimizing string searches
by johndageek (Hermit) on Sep 05, 2008 at 18:50 UTC
    Rather vague description but perl may be better than os grep.

    if files reside on multiple machines - run search on the seperate machines if possible.

    if files reside on a single machine - process them locally.

    do not open files across the network.

    suggest:
    create a list of log files
    loop open log files
    loop read current file
    regex string1 (if match write ouput)
    regex string2 (if match ...)
    regex string3 (if match ...)
    regex string4 (if match ...)
    next record
    next log file

    assumes: will only parse the log files for these 4 strings. there will be no reason to search the same logfiles again for other strings.

    Enjoy!
    Dageek

      regex string1 (if match write ouput) regex string2 (if match ...) regex string3 (if match ...) regex string4 (if match ...)

      It's usually faster to build one regex with four alternations and match that instead of matching four single regexes against a string.

        Thanks Moritz!

        Enjoy!
        Dageek

      Thanks for all the suggestions.
      Indeed my question was a bit vague. Since the search strings may change at any given time, hardcoding the regex was not an option.
      Instead I generate a runtime perl file containing the regex on the fly from the search strings.
      This boosted the performance and I'm fairly happy with the results.

      Thanks again :)
Re: Optimizing string searches
by holli (Abbot) on Sep 05, 2008 at 21:24 UTC
    go and buy a faster harddisk.


    holli, /regexed monk/