hunagyp has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I've the following code snippet:
sub splitLine { my $line = shift; my $pattern = shift; # by default, it is "ERROR" my %header; # DD.MM.YYYY HH:MM:SEC USEC + ERROR/WARN [pool-def] class-name msg my $sPattern = '(\d{2}.\d{2}.\d{4}).*?(\d{2}:\d{2}:\d{2}).\d{0,3}. +*?\*(' . $pattern . ')\*.*?(\[.*?\]).(.*?\..*?\s+?)(.*)'; if ($line =~ /$sPattern/s) { my $ts = parseLogEntryTimeStamp($1, $2); %header = ( 'timestamp' => $ts, 'date' => $1, 'time' => $2, 'severity' => $3, 'thread' => $4, 'class' => $5, 'msg' => $6); print "$7 \n"; #doTrace %header; } return %header; }
The example input PARAM ($line) for this is:
30.08.2016 08:00:00.004 *ERROR* [pool-7-thread-5] com.day.cq.reporting +.impl.snapshots.SnapshotServiceImpl Error accessing repository during + creation of report snapshot data javax.jcr.LoginException: Cannot derive user name for bundle com.day.c +q.cq-reporting [313] and sub service null
My goal is: cut into meaningful pieces this example text. My regex above works almost fine, except the last capture group. In perl, the last capture group only gives back this: 'Error accessing repository during creation of report snapshot data' In an online tester (https://regex101.com/r/eB7cR3/1) with the /s modifier, the last capture group gives back everything until the last char. Does anyone have any idea, why perl does not do the same? (or can you suggest another approach on this regex? It might be quite "messy" :D) Thanks a lot in advance for any advice!

Replies are listed 'Best First'.
Re: regex with /s modifier does not provide the expected result
by choroba (Cardinal) on Aug 31, 2016 at 14:10 UTC
    It works for me. Are you sure you send both the lines to the subroutine? If so, consider renaming the $line variable, as it can contain multiple lines.

    Also, are you sure the class name should contain any whitespace following it? I'd move the \s+? outside the capture group.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Thank you, you are totally right - it's a shame for me, that I overlooked this. The $line variable did not contain the rest of the text... And on your second sentence: yes, I saw it but I wanted to take care of it later. Anyway, thanks for the heads up! :)
Re: regex with /s modifier does not provide the expected result
by coicles (Sexton) on Sep 01, 2016 at 02:27 UTC

    I think choroba is correct, that the real problem is the $line variable passed to the sub has already been truncated at the newline character, and there is nothing wrong with your regex.

    Anyway, here are some things I noticed:

    It looks like you forgot to escape the dot characters in your regex's date capture group, to match dots rather than any character.

    Same with the dot before the microseconds field.

    No space characters are required between the digits at the end of the date field and the beginning of the time field, which leads to the need to over-specify the time and date fields in order to avoid ambiguity and false matches.

    The single character separating the "pool-def" and "class-name" fields should probably be an explicit space, rather than any character.

    The space which terminates "class-name" is included in the capture, plus specifying non-greedy repetition doesn't make sense to me here.

    So for what it's worth, here is how I would handle the regex:

    my $line = "30.08.2016 08:00:00.004 *ERROR* [pool-7-thread-5] com.day. +cq.reporting.impl.snapshots.SnapshotServiceImpl Error accessing repos +itory during creation of report snapshot data\njavax.jcr.LoginExcepti +on: Cannot derive user name for bundle com.day.cq.cq-reporting [313] +and sub service null"; my $pattern = 'ERROR'; my $r = qr/([\d\.]+)\s+([\d:]+)\.?\d*\s+\*(\Q$pattern\E)\*\s+(\[.*?\]) +\s+([\w\.]+)\s*(.*)/s; my @fields = $line =~ $r; unshift(@fields, undef); # 1-base to match regex capture indexes for my $i (1..$#fields) { print "[$i] '$fields[$i]'\n"; }

    Output:

    [1] '30.08.2016' [2] '08:00:00' [3] 'ERROR' [4] '[pool-7-thread-5]' [5] 'com.day.cq.reporting.impl.snapshots.SnapshotServiceImpl' [6] 'Error accessing repository during creation of report snapshot dat +a javax.jcr.LoginException: Cannot derive user name for bundle com.day.c +q.cq-reporting [313] and sub service null'

    I don't like to over-specify fields when I expect the input's format to be sane (of course, I have only seen one sample of your log entries, so grain of salt...), so my regex's date field matches any blob of digits with dots, and its time field matches any blob of digits and colons. I think this helps with maintainability, as well as tolerance to small changes to the log file's format in the future.

    My regex's class field is similarly a blob of word characters and dots, but is actually more specified than yours: something with a dot in it, one character after "pool-def", terminated by a space.

    Also, all fields in my regex are explicitly separated by one or more space characters, which helps avoid undesired matchings due to ambiguity introduced by the under-specified date and time fields.