Ben328 has asked for the wisdom of the Perl Monks concerning the following question:

Here's a log from an application and i wanted to be able to parse anything after "INFO -" till the date in next line as one varibale. You can see each line is similar till "INFO -" then the last portion is all over the place.

2012-09-14 16:55:22,498 ACTIVE ExecuteThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)' (com.this.perl.seems.kinda.Cool:di sconnectCORBA:154) INFO - Well this is just one line text

2012-09-14 16:55:22,498 ACTIVE ExecuteThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)' (com.this.perl.seems.kinda.Cool:di sconnectCORBA:154) INFO - Well this is just multiple line text

With formats Like this

***** Some other text **** then some more text on another line

2012-09-14 16:55:22,498 ACTIVE ExecuteThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)' (com.this.perl.seems.kinda.Cool:di sconnectCORBA:154) INFO - Well once again this is part starts with bracket and blah blah blah

Well i tried to write this program where I was able to exactly match upto "INFO -" but the last variable matches single line and not multiple lines. I tried added \ms but that didn't help. Can we not say match everything from here on out till you match my first varible of date in next line ? Basically ask the last varible to match everything in the line till it sees next date.Any help in right direction would be much appreciated.

while ( $line = <> ) { if ($line =~ m/^(\[\d\d\d\d[-]\d\d[-]\d\d \d\d[:]\d\d[ +:]\d\d[,]\d\d\d\]) (\[\S+\]) (\S+) (\'\d\d?\d?\') (\S+ \S+) (\'.*\') +(\(.*\)) (\S+) ([ -]) (.*)/) { print $1; print $2; print $3; print $4; print $5; print $6; print $7; print "$8 "; print "$9 "; print "$10 \n"; } }

Replies are listed 'Best First'.
Re: Can't seem to match from one varibale to next one.
by tobyink (Canon) on Sep 16, 2012 at 18:45 UTC

    This bit:

    while ( $line = <> )

    ... is only reading a single line at a time. Adjusting your regular expression won't make a difference because the variable that you're matching it against is only ever one line!

    You need to loop through lines, not doing anything except accumulating them into a variable; and only when you hit the start of a new record, processing that accumulated variable (then resetting it).

    Here's a somewhat simplified example:

    use Data::Dumper; sub process_record { my $record = join q[], @{+shift}; warn "Malformed record: $record" unless $record =~ /^ \[ (.+?) \] \s+ \[ (.+?) \] \s+ (.+) $/xs +; local $Data::Dumper::Terse = 1; print "Got record ", Dumper +{ datetime => $1, status => $2, info => $3, }; } my $current_record; while (<DATA>) { # we have hit a new record if (/^ \[ \d{4}-\d{2}-\d{2} /x) { process_record($current_record) if $current_record; $current_record = []; # start a new record } push @$current_record, $_; } # don't forget to process the final record process_record($current_record) if $current_record; __DATA__ [2012-09-14 16:55:22,498] [ACTIVE] INFO - this is a single line [2012-09-14 16:55:22,498] [ACTIVE] INFO - this is a multi line record [2012-09-14 16:55:22,500] [ACTIVE] INFO - this is another single line [2012-09-14 16:55:22,500] [ACTIVE] INFO - this is yet single line [2012-09-14 16:55:22,500] [ACTIVE] INFO - this one is on two lines [2012-09-14 16:55:22,500] [ACTIVE] INFO - one last record
    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
      Thank you for your answer. Since I am newbie to perl, I am trying to also understand what are you actually doing in the script. I looked up online but the description of following were pretty vague. Here are my questions:

      What is "join q[], @{+shift}" doing? Normally join is like join("expression", "list")

      And what is this line doing? local $Data::Dumper::Terse = 1;

      Hope you can help me understand this script better. Thanks, Ben

        G'day Ben328,

        Welcome to Perl and the monastery.

        I don't know where you're looking online for Perl documentation. Except in very rare cases, the following two sites have provided all the Perl documentation I've needed for many years:

        • perldoc.perl.org - Perl Programming Documentation. You'll find Perl syntax, functions, built-in modules, etc. here.
        • search.cpan.org - CPAN (Comprehensive Perl Archive Network). You'll find user contributed modules here.

        All documentation links I provide below are to pages on the first of those sites.

        What is "join q[], @{+shift}" doing? Normally join is like join("expression", "list")

        If online documentation has said that join("expression", "list") is normal syntax, it is wrong and I wouldn't use that site again. Did it say that or have you paraphrased what it said or, perhaps, taken it out of context? That piece of code evaluates to just "list" making join("expression", and the closing parenthesis completely pointless:

        $ perl -Mstrict -Mwarnings -E ' my $x = join("expression", "list"); say $x; my $y = "list"; say $y; ' list list

        Take a look at join which shows the syntax as: join EXPR, LIST. (It only has one example which shows join(EXPR, LIST) - read on for a further explanation).

        When you write a subroutine, e.g. sub some_function { ... }, you'd call it like this: some_function(arg1, ..., argN). Perl's built-in functions can, but don't need to, use the parentheses. It's normally perfectly fine to omit the parentheses; here's an example where you might include them:

        print 'Tabbed items: ', join("\t", @items), "\n";

        [Advanced exception: There is a way to make your functions act like Perl functions. It's generally a bad idea to do this. I strongly recommend that you do not do this - certainly not until you are way past the "newbie" stage. As you may see it in other's code, here's what I'm recommending you don't use: perlsub - Prototypes.]

        It is all too easy to read '' as ". Can you see the difference in your browser? Perl has a number of Quote-Like Operators which you can use to avoid this potential confusion. That documentation shows q/.../, q!...! and q(...); tobyink has used q[...]; my personal preference is for q{...}; you can pick some other delimiter if you want. q[] is a zero-length string: it is unambiguous and doesn't require you to decide if you're looking at one double-quote or two single-quotes.

        The start of the process_record function could have been written more verbosely as:

        sub process_record { my $array_ref = shift; my $record = join q[], @{$array_ref}; ...

        See shift if you're not sure how that works. The code @{shift} is ambiguous and could be interpreted as @shift or @{shift(@_)}. Adding the + tells Perl you mean the second interpretation. See perlop - Symbolic Unary Operators for a discussion of this.

        And what is this line doing? local $Data::Dumper::Terse = 1;

        I'm don't know what part of that you're having trouble with. Have a read of local and Data::Dumper ($Data::Dumper::Terse is mentioned in a few places). If you're still in the dark with one or more parts of that line of code, please specify where you're having problems.

        -- Ken

        my $record = join q[], @{+shift}; does a whole lot of stuff in one statement, but basically it takes a reference to an array of strings and joins them all into a single string.

        It could alternatively be written as:

        my $ref_to_list_of_strings = shift @_; my @list_of_strings = @{ $ref_to_list_of_strings }; my $empty_string = q[]; my $record = join($empty_string, @list_of_strings);

        ... but that sort of coding has been known to cause repetitive strain injuries. Whatsmore, the short way also avoids creating a bunch of temporary variables, so probably runs (ever so slightly) faster.

        perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re: Can't seem to match from one varibale to next one.
by Kenosis (Priest) on Sep 16, 2012 at 21:54 UTC

    Can we not say match everything from here on out till you match my first varible of date in next line ?

    Yes, but only if all of your data is in a single string. As tobyink pointed out, if you're reading your log file line-by line, e.g.:

    ... while ( $line = <> ) ...

    and you've just read the first line of a multiline record, you can't match against the next record's date.

    Although tobyink provided an excellent solution, if you don't have an enormous log file, the following option may also help with your situation, as it reads the entire log file into a variable for processing:

    logFile.txt:

    [2012-09-14 16:55:22,498] [ACTIVE] ExecuteThread: '8' for queue: 'webl +ogic.kernel.Default (self-tuning)' (com.this.perl.seems.kinda.Cool:di + sconnectCORBA:154) INFO - Well this is just one line text [2012-09-14 16:55:22,498] [ACTIVE] ExecuteThread: '8' for queue: 'webl +ogic.kernel.Default (self-tuning)' (com.this.perl.seems.kinda.Cool:di + sconnectCORBA:154) INFO - Well this is just multiple line text With formats Like this ***** Some other text **** then some more text on another line [2012-09-14 16:55:22,498] [ACTIVE] ExecuteThread: '8' for queue: 'webl +ogic.kernel.Default (self-tuning)' (com.this.perl.seems.kinda.Cool:di + sconnectCORBA:154) INFO - Well once again this is part starts with b +racket and blah blah blah

    Script:

    use strict; use warnings; my @infos; { local $/; open my $fh, '<', 'logFile.txt' or die $!; my $data = <$fh>; push @infos, $1 while $data =~ /INFO - (.+?)(\[\d{4}|\Z)/gs; } chomp @infos; print join "\n--##--\n", @infos;

    Output:

    Well this is just one line text --##-- Well this is just multiple line text With formats Like this ***** Some other text **** then some more text on another line --##-- Well once again this is part starts with bracket and blah blah blah

    Unless I'm mistaken, you're interested in grabbing the text after the "INFO" in each record. In the script above, @infos will contain that text, and the regex grabs it by matching as you've described, and includes matching on the last line where no date follows.

    Hope this helps!

      Thanks Kenosis. This is great. On your solution how do i define other variables within the line though? Basically, I also wanted the capability to pull any portion of the line as a variable so that I can compare each variable seperately later on.

      For ex: I want to define variable for ACTIVE ExecuteThread: and so forth as well. Do i do that as follows:

      push @infos, $1, $2, $3, while $data =~ / \ (.+?) \ \s+ \ (.+?) \ ... and so forth

      Does this makes sense? Hopefully, i am asking the right question. In any case, I have been helped tremendously already. Thanks,

        You're most welcome, Ben328! Am glad it worked for you...

        The following will capture all four fields from each record of your data set, so you can work with them as needed:

        use strict; use warnings; { local $/; open my $fh, '<', 'logFile.txt' or die $!; my $data = <$fh>; while ( $data =~ / (?=\[\d{4}) # Start record \[(?<date>.+?)\]\s* \[(?<active>.+?)\]\s* ExecuteThread:\s*(?<executeThread>.+?) INFO\s*-\s*(?<info>.+?) (?=\[\d{4}|\Z) # End record /gsx ) { print 'date: ', "$+{date}\n"; print 'active: ', "$+{active}\n"; print 'executeThread: ', "$+{executeThread}\n"; print 'info: ', "$+{info}\n"; } }

        Output:

        date: 2012-09-14 16:55:22,497 active: ACTIVE executeThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)' +(com.this.perl.seems.kinda.Cool:di sconnectCORBA:154) info: Well this is just one line text date: 2012-09-14 16:55:22,498 active: ACTIVE executeThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)' +(com.this.perl.seems.kinda.Cool:di sconnectCORBA:154) info: Well this is just multiple line text With formats Like this ***** Some other text **** then some more text on another line date: 2012-09-14 16:55:22,499 active: ACTIVE executeThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)' +(com.this.perl.seems.kinda.Cool:di sconnectCORBA:154) info: Well once again this is part starts with bracket and blah blah b +lah

        The regex gets a bit 'ugly,' but manageable. Named captures were used, so $+{info} contains the single or multiline INFO text.