Re: DateTime::Format::Flexible; for Log Parse with multiple formatted lines

In your code, you slurp the entire file into an array, then join all the lines using a fixed string, and then use a regex that specifically includes the fixed string as the last thing to match. I don't quite understand why you're doing it this way, I don't see the advantage of this over a normal while (<$filehandle>) { ... } loop? I didn't really test your code because the sample log entry you provided doesn't actually match your regex, but from what I can tell, your code will silently skip any log entries that don't match the regex, including that it will always skip the last log entry.

I see a couple of other issues with your code: You don't Use strict and warnings, and you don't check some opens for errors. In your regexes, you don't need to put (...) capturing groups around things you don't actually want to capture into the $1, $2, ... variables, e.g. you can say /...T.../ instead of /...(T).../. You might also want to look into the /x regex modifier (perlre) to make your regexes easier to read and follow. Also, I'd strongly recommend using an appropriate module such as Text::CSV for CSV output.

I'm not sure I fully understand your questions. Instead, I can show you how I might have coded this. Personally, I like to validate the format of input files a little bit as I read them. Instead of DateTime::Format::Flexible, I'd use several DateTime::Format::Strptime parsers, and first use a heuristic to decide which format the log line has. It seems from your sample inputs that the log line formats are quite different, which is why I've duplicated the parsing and output logic in the if statements below, but if your log lines are instead similar, you should of course not duplicate that code and move the common parsing code outside of the ifs.

#!/usr/bin/env perl
use warnings;
use strict;
use 5.010; # for /p and ${^MATCH}
use DateTime;
use DateTime::Format::Strptime;
use Text::CSV;

my $strp_one = DateTime::Format::Strptime->new(on_error=>'croak',
    time_zone=>'UTC', pattern => '%Y-%m-%dT%H:%M:%S.%6N');
my $strp_two = DateTime::Format::Strptime->new(on_error=>'croak',
    time_zone=>'UTC',  pattern => '%a %b %d %H:%M:%S %Y');
my $csv = Text::CSV->new({binary=>1, always_quote=>1, blank_is_undef=>
+1,
    eol=>$/, auto_diag=>2});

while (<DATA>) {
    chomp;
    if (/^\d{4,}-[\d\-T\:\.]+(?=\s+)/p) {
        my ($dts,$rest) = (${^MATCH}, ${^POSTMATCH});
        my $dt = $strp_one->parse_datetime($dts);
        # parse "$rest" and break it into more fields here
        $csv->print(select, [
            $dt->strftime('%Y-%m-%d-%H-%M-%S-%6N-%Z'),
            $rest ] );
    }
    elsif (/^\w+\s+\w+\s+\d+\s+[\d\:]+\s+\d{4,}(?=\s+)/p) {
        my ($dts,$rest) = (${^MATCH}, ${^POSTMATCH});
        my $dt = $strp_two->parse_datetime($dts);
        # parse "$rest" and break it into more fields here
        $csv->print(select, [
            $dt->strftime('%Y-%m-%d-%H-%M-%S-%6N-%Z'),
            $rest ] );
    }
    else
        { warn "Skipping unknown line format: $_" }
}

__DATA__
2017-02-20T09:30:53.177000 20[] 0000000000000000 Error Description One
Mon Feb 20 09:31:25 2017 [INFO] [AGENTEXEC] Error Description Two
2017-02-20T09:30:53.177000 20[] 0000000000000000 Error Description Thr
+ee
Mon Feb 20 09:31:25 2017 [INFO] [AGENTEXEC] Error Description Four
[download]

Output:

"2017-02-20-09-30-53-177000-UTC"," 20[] 0000000000000000 Error Descrip
+tion One"
"2017-02-20-09-31-25-000000-UTC"," [INFO] [AGENTEXEC] Error Descriptio
+n Two"
"2017-02-20-09-30-53-177000-UTC"," 20[] 0000000000000000 Error Descrip
+tion Three"
"2017-02-20-09-31-25-000000-UTC"," [INFO] [AGENTEXEC] Error Descriptio
+n Four"
[download]

One disadvantage of the above approach is that if you have a lot of different date/time formats in your log files, you'd have to add more and more parsers. So if that's the case, you can also try using DateTime::Format::Flexible, and the same basic idea as above (use a regex to pull the date/time string from the beginning of the line before attempting to parse it) applies.

Comment on Re: DateTime::Format::Flexible; for Log Parse with multiple formatted lines Select or Download Code

Replies are listed 'Best First'.

Re^2: DateTime::Format::Flexible; for Log Parse with multiple formatted lines
by TCLion (Novice) on Mar 27, 2017 at 13:14 UTC

I now understand about the Data Portion and this looks good. Boss doesn't like it... says his script was better because was more simple. I still would like to use and modify this one (thank you) but I am having a problem with breaking up the $rest. trying to add in my first with the positions but its not working correctly. I am thinking that the $dts has extra left over that is being pushed out and making $rest not the same. But I probably don't have the code correct

    if (/^\d{4,}-[\d\-T\:\.]+(?=\s+)/p) {
        my ($dts,$rest) = (${^MATCH}, ${^POSTMATCH});
        my $dt = $strp_one->parse_datetime($dts);
        # parse "$rest" and break it into more fields here
        while ($rest =~ /(\w+\s+)(\w+\s+)(.+?)/smg) {       
                      print "New Error Found...\n";
                      print "0 $0\n";
                      print "1 $1\n";
                      print "2 $2\n";
                      print "3 $3\n";
                      print "4 $4\n";
                      print "5 $5\n";
                      print "6 $6\n";}
        $csv->print(select, [
            $dt->strftime('%Y-%m-%d,%H:%M:%S'),#'%Y-%m-%d-%H-%M-%S-%6N
+-%Z'
            $rest ] );
    }
[download]

I did add full data strings

__DATA__
2017-02-20T09:30:53.177000    20848[30892]    0000000000000000    [DM_
+MQ_I_DAEMON_START]info:  "Message queue daemon (tid : 27944, session 
+0102b20d80000456) is started sucessfully."
2017-02-20T09:30:53.193000    20848[17732]    0102b20d80000003    [DM_
+DOCBROKER_I_PROJECTING]info:  "Sending information to Docbroker locat
+ed on host (PWDOCPRDCON32) with port (1489).  Information: (Config(se
+rver), Proximity(1), Status(Open), Dormancy Status(Active))."
2017-02-20T09:30:53.193000    20848[17732]    0102b20d80000003    [DM_
+DOCBROKER_I_PROJECTING]info:  "Sending information to Docbroker locat
+ed on host (server) with port (1354).  Information: (Config(server), 
+Proximity(2), Status(Open), Dormancy Status(Active))."
2017-02-20T09:30:53.193000    20848[17732]    0102b20d80000003    [DM_
+DOCBROKER_I_PROJECTING]info:  "Sending information to Docbroker locat
+ed on host (server) with port (1354).  Information: (Config(Server), 
+Proximity(3), Status(Open), Dormancy Status(Active))."
Mon Feb 20 09:31:25 2017 [INFORMATION] [AGENTEXEC 26816] Detected duri
+ng program initialization: Version: 7.2.0160.0297  Win64
Mon Feb 20 09:31:30 2017 [INFORMATION] [AGENTEXEC 26816] Detected duri
+ng program initialization: Agent Exec connected to server server:  [D
+M_SESSION_I_SESSION_START]info:  "Session 0102b20d80397508 started fo
+r user user."
[download]

I do need to pull out the word position (info:) as it would say error: if a problem but add back in for full message DM...info: "message.." and thats what I need.
2084817732 and 0102b20... are not needed as well as INFORMATION and agentexect 26816

so what did I do wrong with the code in trying to find the positions?

[reply]
[d/l]
[select]

Re^3: DateTime::Format::Flexible; for Log Parse with multiple formatted lines

by haukex (Archbishop) on Mar 27, 2017 at 14:54 UTC

Two problems I see with that code are: First, just like in the original code, while ($str =~ /.../g) without a \G regex anchor ($str =~ /\G.../g) will skip over stuff in $str that doesn't match the regex, possibly resulting in missed data. Second, as 1nickt already said, $0 is not a regex capture (see $0), and the regex only has three capture groups, so $4 and above will never be populated by that regex.

Based on your regex, it looks like you're trying to break up the string based on whitespace, in which case a simple my @parts = split ' ', $rest; might be easiest.

However, I see that your log entries have quoted strings, so that might not be appropriate either. Your first couple of example log entries could possibly be broken apart like this: my @parts = split /\s*[\[\]]\s*/, $rest, 5;, or, you'll have to write regexes that actually match the log entries, e.g. /^ \s* (\d+) \s* \[(\d+)\] \s+ (\S+) \s+ \[(.+?)\] \s* (\w+): \s* (.*?) \s* $/x, for example.

To match quoted strings, you could use Regexp::Common::delimited or the core module Text::Balanced. Good resources on regexes in general are perlretut, perlrequick, and perlre.

[reply]
[d/l]
[select]

Re^4: DateTime::Format::Flexible; for Log Parse with multiple formatted lines

by TCLion (Novice) on Mar 27, 2017 at 17:08 UTC

Please remember I am new to Perl. I am trying to understand your code examples but have not been able to get it to work in any way. I did go through and looked up the code to understand it but I am missing somthing. I dont want every white space to seperate just the first few then the error message. One of the lines doesnt have a seperate severity like INFORMATION but an info: in the line. So the spot for info: I would be able to pull out error: if it was an error. I am still trying to use $1 $2 $3 to understand and find the positions to place where I want them but unsuccessful. I do understand the higher numbers would pull nothing but are there because I am trying to break it up and see what is there as I go to see if I missed one. Also $0 printed the file location which is a good separator when going through the results.

Ok maybe this will help explain what I am trying to do. Here is the data and the desired output for both line formats leaving out unnecessary info.

__data__
Mon Feb 20 09:31:25 2017 [INFORMATION] [AGENTEXEC 26816] Detected duri
+ng program initialization: Version: 7.2.0160.0297  Win64
2017-02-20T09:30:53.177000    20848[30892]    0000000000000000    [DM_
+MQ_I_DAEMON_START]info:  "Message queue daemon (tid : 27944, session 
+0102b20d80000456) is started sucessfully."
[download]

Server Name	2017-02-20	09:30:53	info:	DM_MQ_I_DAEMON_START info: "Message queue daemon (tid : 27944, session 0102b20d80000456) is started sucessfully."
Server Name	2017-02-20	09:31:25	INFORMATION	Detected during program initialization: Version: 7.2.0160.0297 Win64

I do appreciate your time for helping and explaining this to me.

[reply]
[d/l]
[select]

Re^5: DateTime::Format::Flexible; for Log Parse with multiple formatted lines

by haukex (Archbishop) on Mar 27, 2017 at 18:07 UTC

Re^5: DateTime::Format::Flexible; for Log Parse with multiple formatted lines

by poj (Abbot) on Mar 27, 2017 at 20:18 UTC

Re^5: DateTime::Format::Flexible; for Log Parse with multiple formatted lines

by 1nickt (Canon) on Mar 27, 2017 at 17:35 UTC

Re^2: DateTime::Format::Flexible; for Log Parse with multiple formatted lines
by TCLion (Novice) on Apr 04, 2017 at 14:23 UTC

What is the reason my original code will skip the last entry?

[reply]

Re^3: DateTime::Format::Flexible; for Log Parse with multiple formatted lines

by huck (Prior) on Apr 04, 2017 at 14:35 UTC

join only puts the first argument between array entries, not after each one

$myfixedlog =~ /......(ENDOFLINE)/smg
[download]

[reply]
[d/l]

Re^3: DateTime::Format::Flexible; for Log Parse with multiple formatted lines

by haukex (Archbishop) on Apr 04, 2017 at 14:33 UTC

What is the reason my original code will skip the last entry?

The problem is using join to join the lines using some string, which only inserts that string between elements of the array, and then using a regex that requires all entries to end on that string. In the following example, based on your original code, I'll demonstrate the problem, note how in the output, "Baz" is missing because it is not followed by "ENDOFLINE". The other problem I mentioned was that log entries that don't match the regex will be skipped (and may possibly even cause other entries to be parsed incorrectly, as this example shows):

use warnings;
use strict;
use Data::Dumper;
$Data::Dumper::Useqq=1;

my @mylog = <DATA>;
my $myfixedlog = join("ENDOFLINE", @mylog);
print Dumper $myfixedlog;

while ($myfixedlog =~ /([A-Za-z]+)\nENDOFLINE/smg) {
    print Dumper $1;
}

__DATA__
Foo
123
Bar
Quz
Baz
[download]

Output:

$VAR1 = "Foo\nENDOFLINE123\nENDOFLINEBar\nENDOFLINEQuz\nENDOFLINEBaz\n
+";
$VAR1 = "Foo";
$VAR1 = "ENDOFLINEBar";
$VAR1 = "Quz";
[download]

[reply]
[d/l]
[select]