plexy has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys

We have a service which we provide to customers where I work, and this service generates an SMS directly to customers when "something bad happens". Everytime a SMS is fired away it gets logged in a logfile (containing a bunch of other things as well). What I'm trying to achieve is to extract and count sent messages within a certain period of time from the logfile.

* Logs are stored in files pr hour, and the filename has the format <filename>.log.<yyyy>-<mm>-<d d>-<hh>, for example <filename>.log.2011-11-08-09
. * Every line in the log starts with a timestamp, for example "2011-11-09 09:00:00,000"
* Each lines start with a timestamp, for example "2011-11-08 09:00:00,000"
* Log entries which contains sent messages has the "key"(...) "sendSMS", and there's 3 different contents in the messages which I wish to count separetly:
1) 2011-11-08 09:00:03,473 INFO <<SMAPISender>> sendSMS: sender = <name>, recipient[0] = <number>, message = SMS for Sub-service 1
2) 2011-11-08 09:03:11,681 INFO <<SMAPISender>> sendSMS: sender = <name>, recipient[0] = <number>, message = SMS for sub-service 2
3) 2011-11-08 09:18:55,193 INFO <<SMAPISender>> Error sending SMS

Now, the script is run from cron every five minutes. The script starts with quite a lot of time-checking to determine which period to look between. When the start- and endpoint is defined, it's supposed to find the SMS' between these values. Example: Script runs at 09:05:00. $start_point is set to "2011-11-08 09:00:00", $end_point is set to "2011-11-08 09:04:59,999"

Here's where my problem begins (and hopefully ends). I have absolutely no idea how to make perl extract the lines between $start_point and $end_point. Here's the code for two approaches I've tried (based on google-results, but without understanding the approach it self):

Approach #1:
open(FH, $filehandle); while (<FH>) { <b>if(/$start_point/../$end_point/) {</b> $line = $_; chomp $line; if ($line =~ m/SMS for Sub-service 1/) { $httpSMS++; } if($line =~ m/SMS for sub-service 2/) { $sipSMS++; } if($line =~ m/Error sending SMS/) { $errorSMS++; } } } close(FH);

Approach #2:
open(FH, $filehandle); while (<FH>) { $line = $_; chomp $line; <b>if($line =~ m/$start_point(.*)$end_point$/s) {</b> if($line =~ m/SMS for Sub-service 1/) { $httpSMS++; } if($line =~ m/SMS for Sub-service 2/) { $sipSMS++; } if($line =~ m/Error sending SMS/) { $errorSMS++; } } } close(FH);
Any ideas how to solve this?

Best regards,
Andre Solheim

Replies are listed 'Best First'.
Re: Extract lines between two values from file
by choroba (Cardinal) on Nov 08, 2011 at 13:14 UTC
    Both your approaches need both the start point and the end point actually appear in the data. Is this the case? If not, you have to do a bit more work: compare the times in the log with start/end points.
    Update: The approach #2 seems weird. The regular expression suggests you should have read all the lines into one string.
Re: Extract lines between two values from file
by ww (Archbishop) on Nov 08, 2011 at 13:43 UTC
    Andre:

    Did you mean to write
        (my comments/questions/interpolations in italic) ...

    • Logs are stored in files per (s/pr/per/ ?) hour, and the filename has the format <filename>.log.<yyyy>-<mm>-<d d>-<hh>, for example <filename>.log.2011-11-08-09. (trailing dot from your next line?)
    • Every line in the log starts with a timestamp, for example "2011-11-09 09:00:00,000"
    • (close duplicate omitted?)
    • Log entries which contains sent messages has the "key"(...) "sendSMS", and there's 3 different contents in the messages which I wish to count separetly:
    1. 2011-11-08 09:00:03,473 INFO <<SMAPISender>> sendSMS: sender = <name>, recipient[0] = <number>, message = SMS for Sub-service 1
    2. 2011-11-08 09:03:11,681 INFO <<SMAPISender>> sendSMS: sender = <name>, recipient[0] = <number>, message = SMS for sub-service 2
    3. 2011-11-08 09:18:55,193 INFO <<SMAPISender>> Error sending SMS

    And:

    1. are there just two "Sub-services"?
          or
    2. are 1 and 2 the only ones you care about but others exist?
          or
    3. some other meaning?

    If you hope to have the Monks earn your salary, it's helpful to state your question clearly (and to use the formatting tags <c>...</c> around data as well as around code).

    If I've come close to your meanings, you should read perlrequick and perlretut with special attention to "Lookahead" --  (?=...) and, negated,  (?!...). 'Lookbehind'could also be useful, since it appears that your application would fall within its ;fixed length' limitation.

    Updated several times to fix markup, clarify.

      Hi ww,

      Thanks for your response as well.

    • Yes, I meant "per".
    • Duplicate, indeed.
    • Multiple "sub-services" exists, but the two I defined are the only ones I'm interested in, in addition to messages that fails to be sent

      For the record I personally thought I did a fair job at explaining what I wanted to achieve, as well as describing the "source" :-) But of course that's a quite subjective meaning as I know what I'm looking at/for and want to achieve.

      Like I mentioned in my reply to choroba, the (silly) problem was straightened out. But I'll look into perlrequick and perlretut for future reference

      By the way, are you saying that you and the rest of the Monks will earn my salary for my from now on if I only learn to format my posts properly? :-) You're a true angel!

        "...are you saying that you and the rest of the Monks will earn my salary for my from now on if ...."

        ...errr; uh... (gulp) Not exactly. :-()

        But, as a continued association with us will demonstrate, there are those who mistake the Monastery for a free, code-writing machine. In this case, however, the remark was intended as a humorous intro to a suggestion that avoiding ambiguity sometimes requires more verbosity than is needed to adequately frame a question for resolution between one's own ears.

        Welcome to PM!

Re: Extract lines between two values from file
by plexy (Initiate) on Nov 08, 2011 at 13:34 UTC

    Hi choroba,

    Thanks for you reply.

    You're absolutely right, the value of $end_point does not exist. I read up on flip-flop operator (which I of course should've done before posted), and a person describes it like this: "The first operand (the left-hand expression) is evaluated to see if it is true or false. If it is false then the operator returns false and nothing happens. If it is true, however, the operator returns true and continues to return true on subsequent calls until the second operand (the right-hand expression) returns true."

    So, with other words I can change $end_point to "09:05:00". Then I'll still get everything up towards the latest entry which contains "09:04:49", all the way up to "...,999" if that exists. And as the right expressions becomes false when the next line contains "09:05:00,xxx", I wont get any values from within that minute (which is what I want)

    It was a simple and easy solution, but I don't think I could've gotten out of the deep and frustrating hole I was in if it wasn't for your "thank you captain Obvious"-response, so thanks a lot. :-)

      This will still present a problem if there don't happen to be any events between 09:05:00 and 09:05:01, though. As choroba said, $start_point and $end_point have to appear literally at least once each in the file for pattern matching to work in this way. If $end_point is "09:05:00", it's not going to know to stop between lines having times of "09:04:59" and "09:05:01".

      To make that work, you'll need to loop through the file, extracting the date/time and comparing it to $start_point until a line's date is greater than that value, then start processing lines and doing the same with $end_point until you hit a line with a date greater than $end_point, and then stop.

      That also assumes that your log entries are always in date-order. That's usually true, but it's at least theoretically possible that different processes could write to the same log file slightly out of order, in which case you'd want to just check all times for being greater than $start_point and less than $end_point, and not worry about starting or stopping at certain lines.

        Hi aaron_baugher,

        Yes, after reading up on flip-flop I became aware of the fact that both values must exist. :-) It's 99,9999% guaranteed that these logs has something written in them each and every second.

        A good enough "solution" to make it stop (at least approximately around) the time it should, if the $end_point second is non-existent, might be to just set $end_point to "09:05". Worst case scenario, the same SMS-message(s) might be counted within two different time periods, but like I said it's almost sure that something is written in the logs every second.

        As you probably understood I'm not a perl-programmer, and I can barely call myself Google-perl'er. Therefor I'm not sure if I want to try other methods when this is (probably/hopefully) going to work most of the time.

        Of course, like you say, it's possible that the timestamps are out of order, and that the script should be more robust. I guess I'll have to google a bit after all... :-)

        Thanks a lot for your input!