DateTime::Format::Flexible; for Log Parse with multiple formatted lines

TCLion has asked for the wisdom of the Perl Monks concerning the following question:

Basic Info:Log is in text lines. I am combining all lines and breaking it up to separate tags/lines. The Log files are copies and not being updated currently for development.

Trying to figure out the best way to get my time from lines in log files. Parsing the line for date time error and error message to output to csv. This process is working on some of my log files but one in particular has 2 different time/date formats. For this reason the code is not picking up the lines not formatted the specific way.

This line is ok format:
2017-02-20T09:30:53.177000 20[] 0000000000000000 Error Description

Second (problem)line
Mon Feb 20 09:31:25 2017 INFO AGENTEXEC Error Description

Since I now have 2 formats in 1 log I now have installed DateTime::Format::Flexible. Hoping this will fix my issue and convert one format to match the other.

#Original Working code prior to 2nd date format
#!/usr/bin/perl
use IO::File;

my $file = 'file.location.log';
open(MYLOG, "<$file") or die "Can't open $file: " . $!;
my @mylog = <MYLOG>;
close(MYLOG);

my $myfixedlog = join("ENDOFLINE", @mylog);
#                           Date      1               ,T 2,        Tim
+e  3                 ,          4,         5 ,       6    ,   7     ,
+ 8  ,   9
while ($myfixedlog =~ /([0-9]{4}-[0-9][0-9]-[0-9][0-9])(T)([0-9][0-9]:
+[0-9][0-9]:[0-9][0-9])(.\d+\s\d+.\d+.\s)(.{16}\s)(\[\w+\])(\w+\S\s+)(
+.+?)(ENDOFLINE)/smg) {
                $date = $1;
                $severity = $7;
                $timestamp = $3;
                $errormsg = $6;
                chomp($errormsg);


                      print "New Error Found...\n";
                      print "0 $0\n";
                      print "1 $1\n";
                      print "2 $2\n";
                      print "3 $3\n";
                      print "4 $4\n";
                      print "5 $5\n";
                      print "6 $6\n";
                      print "7 $7\n";
                      print "8 $8\n";
                      print "9 $9\n";
}
[download]

That was to get the positions for the output.
Then to get the desired output using this code.

use 5.18.0;
use warnings;
use IO::File;
use Time::Piece;


my $servername = "Server1";

# This is the Location of the Original Log File
my $OLF = '\\\server\logs\server.log';

# This is the location of the file parsed
my $file = 'server.log';
open(MYLOG, "<$file") or die "Can't open $file: " . $!;
my @mylog = <MYLOG>;
close(MYLOG);

# This is the Location\Name of the file Being Created
my $CSVLOG = 'output.csv';
open(OUTLOG, ">>$CSVLOG") ;
my @outlog = <OUTLOG>;


my $runtimestamp = localtime(time);

my $myfixedlog = join("ENDOFLINE", @mylog);


print OUTLOG "$OLF\n";
print OUTLOG "Server, Date, TimeStamp, Severity, ErrorMsg, \n";
#                           Date      1               ,T 2,        Tim
+e  3                 ,          4,         5 ,       6    ,   7  , 8 
+ ,   9
while ($myfixedlog =~ /([0-9]{4}-[0-9][0-9]-[0-9][0-9])(T)([0-9][0-9]:
+[0-9][0-9]:[0-9][0-9])(.\d+\s\d+.\d+.\s)(.{16}\s)(\[\w+\])(\w+\S)(.+?
+)(ENDOFLINE)/smg) {
             my   $date = $1;
             my   $timestamp = $3;
             my   $severity = $7;
             my   $errormsg = join "",$6, $7, $8;
                  chomp($errormsg);
                
           
                 if ($errormsg =~ /DfException/ || $severity eq "error
+:" || $errormsg =~ /started in/){
       
                      
                          foreach ($errormsg =~ /DfException/ || $seve
+rity eq "error:" || $errormsg =~ /started in/)
                          {print OUTLOG "$servername, $date, $timestam
+p, $severity, $errormsg \n"}};    
        #                               print OUTLOG "$servername, $da
+te, $timestamp, $severity, $errormsg, \n"
}

 
print OUTLOG @outlog;
close(OUTLOG);
print "CSV - File Created  Log File: $CSVLOG";
[download]

So now trying to add and use new code but cant figure out how to set the desired format or how to have it collect and identify properly as before from original code. This is the added code I have found so far to the code to find locations for $myfixedlog

DateTime::Format::Flexible;
my $dt = DateTime::Format::Flexible->parse_datetime( $date, lang => ['
+en'], );
[download]

Ok so I guess the questions are:
How to declare $date if its already being used (in code i am not yet familiar with)?
If this code pulls out the date would it have a position or just be placed by $dt per line?
If the date is pulled and formatted correctly would I still break up my line the same way?

Looking at the preview it looks like i might have grabbed code for 2 different log files but the basics are the same.
Any Assistance would be appreciated. And no I did not write the original code but I have modified these listed to work. I am Still new to Perl but it is getting easier the more I do.
Hopefully this makes sense to someone.

Comment on DateTime::Format::Flexible; for Log Parse with multiple formatted lines Select or Download Code

Replies are listed 'Best First'.
Re: DateTime::Format::Flexible; for Log Parse with multiple formatted lines by haukex (Archbishop) on Mar 23, 2017 at 19:25 UTC
In your code, you slurp the entire file into an array, then join all the lines using a fixed string, and then use a regex that specifically includes the fixed string as the last thing to match. I don't quite understand why you're doing it this way, I don't see the advantage of this over a normal `while (<$filehandle>) { ... }` loop? I didn't really test your code because the sample log entry you provided doesn't actually match your regex, but from what I can tell, your code will silently skip any log entries that don't match the regex, including that it will always skip the last log entry. I see a couple of other issues with your code: You don't Use strict and warnings, and you don't check some opens for errors. In your regexes, you don't need to put `(...)` capturing groups around things you don't actually want to capture into the `$1, $2, ...` variables, e.g. you can say `/...T.../` instead of `/...(T).../`. You might also want to look into the `/x` regex modifier (perlre) to make your regexes easier to read and follow. Also, I'd strongly recommend using an appropriate module such as Text::CSV for CSV output. I'm not sure I fully understand your questions. Instead, I can show you how I might have coded this. Personally, I like to validate the format of input files a little bit as I read them. Instead of DateTime::Format::Flexible, I'd use several DateTime::Format::Strptime parsers, and first use a heuristic to decide which format the log line has. It seems from your sample inputs that the log line formats are quite different, which is why I've duplicated the parsing and output logic in the `if` statements below, but if your log lines are instead similar, you should of course not duplicate that code and move the common parsing code outside of the `if`s. #!/usr/bin/env perl use warnings; use strict; use 5.010; # for /p and ${^MATCH} use DateTime; use DateTime::Format::Strptime; use Text::CSV; my $strp_one = DateTime::Format::Strptime->new(on_error=>'croak', time_zone=>'UTC', pattern => '%Y-%m-%dT%H:%M:%S.%6N'); my $strp_two = DateTime::Format::Strptime->new(on_error=>'croak', time_zone=>'UTC', pattern => '%a %b %d %H:%M:%S %Y'); my $csv = Text::CSV->new({binary=>1, always_quote=>1, blank_is_undef=> +1, eol=>$/, auto_diag=>2}); while (<DATA>) { chomp; if (/^\d{4,}-[\d\-T\:\.]+(?=\s+)/p) { my ($dts,$rest) = (${^MATCH}, ${^POSTMATCH}); my $dt = $strp_one->parse_datetime($dts); # parse "$rest" and break it into more fields here $csv->print(select, [ $dt->strftime('%Y-%m-%d-%H-%M-%S-%6N-%Z'), $rest ] ); } elsif (/^\w+\s+\w+\s+\d+\s+[\d\:]+\s+\d{4,}(?=\s+)/p) { my ($dts,$rest) = (${^MATCH}, ${^POSTMATCH}); my $dt = $strp_two->parse_datetime($dts); # parse "$rest" and break it into more fields here $csv->print(select, [ $dt->strftime('%Y-%m-%d-%H-%M-%S-%6N-%Z'), $rest ] ); } else { warn "Skipping unknown line format: $_" } } __DATA__ 2017-02-20T09:30:53.177000 20[] 0000000000000000 Error Description One Mon Feb 20 09:31:25 2017 [INFO] [AGENTEXEC] Error Description Two 2017-02-20T09:30:53.177000 20[] 0000000000000000 Error Description Thr +ee Mon Feb 20 09:31:25 2017 [INFO] [AGENTEXEC] Error Description Four [download] Output: `"2017-02-20-09-30-53-177000-UTC"," 20[] 0000000000000000 Error Descrip +tion One" "2017-02-20-09-31-25-000000-UTC"," [INFO] [AGENTEXEC] Error Descriptio +n Two" "2017-02-20-09-30-53-177000-UTC"," 20[] 0000000000000000 Error Descrip +tion Three" "2017-02-20-09-31-25-000000-UTC"," [INFO] [AGENTEXEC] Error Descriptio +n Four"` [download] One disadvantage of the above approach is that if you have a lot of different date/time formats in your log files, you'd have to add more and more parsers. So if that's the case, you can also try using DateTime::Format::Flexible, and the same basic idea as above (use a regex to pull the date/time string from the beginning of the line before attempting to parse it) applies.	[reply] [d/l] [select]
Re^2: DateTime::Format::Flexible; for Log Parse with multiple formatted lines by TCLion (Novice) on Mar 27, 2017 at 13:14 UTC
I now understand about the Data Portion and this looks good. Boss doesn't like it... says his script was better because was more simple. I still would like to use and modify this one (thank you) but I am having a problem with breaking up the $rest. trying to add in my first with the positions but its not working correctly. I am thinking that the $dts has extra left over that is being pushed out and making $rest not the same. But I probably don't have the code correct `if (/^\d{4,}-[\d\-T\:\.]+(?=\s+)/p) { my ($dts,$rest) = (${^MATCH}, ${^POSTMATCH}); my $dt = $strp_one->parse_datetime($dts); # parse "$rest" and break it into more fields here while ($rest =~ /(\w+\s+)(\w+\s+)(.+?)/smg) { print "New Error Found...\n"; print "0 $0\n"; print "1 $1\n"; print "2 $2\n"; print "3 $3\n"; print "4 $4\n"; print "5 $5\n"; print "6 $6\n";} $csv->print(select, [ $dt->strftime('%Y-%m-%d,%H:%M:%S'),#'%Y-%m-%d-%H-%M-%S-%6N +-%Z' $rest ] ); }` [download] I did add full data strings __DATA__ 2017-02-20T09:30:53.177000 20848[30892] 0000000000000000 [DM_ +MQ_I_DAEMON_START]info: "Message queue daemon (tid : 27944, session +0102b20d80000456) is started sucessfully." 2017-02-20T09:30:53.193000 20848[17732] 0102b20d80000003 [DM_ +DOCBROKER_I_PROJECTING]info: "Sending information to Docbroker locat +ed on host (PWDOCPRDCON32) with port (1489). Information: (Config(se +rver), Proximity(1), Status(Open), Dormancy Status(Active))." 2017-02-20T09:30:53.193000 20848[17732] 0102b20d80000003 [DM_ +DOCBROKER_I_PROJECTING]info: "Sending information to Docbroker locat +ed on host (server) with port (1354). Information: (Config(server), +Proximity(2), Status(Open), Dormancy Status(Active))." 2017-02-20T09:30:53.193000 20848[17732] 0102b20d80000003 [DM_ +DOCBROKER_I_PROJECTING]info: "Sending information to Docbroker locat +ed on host (server) with port (1354). Information: (Config(Server), +Proximity(3), Status(Open), Dormancy Status(Active))." Mon Feb 20 09:31:25 2017 [INFORMATION] [AGENTEXEC 26816] Detected duri +ng program initialization: Version: 7.2.0160.0297 Win64 Mon Feb 20 09:31:30 2017 [INFORMATION] [AGENTEXEC 26816] Detected duri +ng program initialization: Agent Exec connected to server server: [D +M_SESSION_I_SESSION_START]info: "Session 0102b20d80397508 started fo +r user user." [download] I do need to pull out the word position (info:) as it would say error: if a problem but add back in for full message DM...info: "message.." and thats what I need. 2084817732 and 0102b20... are not needed as well as INFORMATION and agentexect 26816 so what did I do wrong with the code in trying to find the positions?	[reply] [d/l] [select]
Re^3: DateTime::Format::Flexible; for Log Parse with multiple formatted lines by haukex (Archbishop) on Mar 27, 2017 at 14:54 UTC
Two problems I see with that code are: First, just like in the original code, `while ($str =~ /.../g)` without a `\G` regex anchor (`$str =~ /\G.../g`) will skip over stuff in `$str` that doesn't match the regex, possibly resulting in missed data. Second, as 1nickt already said, `$0` is not a regex capture (see $0), and the regex only has three capture groups, so `$4` and above will never be populated by that regex. Based on your regex, it looks like you're trying to break up the string based on whitespace, in which case a simple `my @parts = split ' ', $rest;` might be easiest. However, I see that your log entries have quoted strings, so that might not be appropriate either. Your first couple of example log entries could possibly be broken apart like this: `my @parts = split /\s[\[\]]\s/, $rest, 5;`, or, you'll have to write regexes that actually match the log entries, e.g. `/^ \s* (\d+) \s* \[(\d+)\] \s+ (\S+) \s+ \[(.+?)\] \s* (\w+): \s* (.?) \s $/x`, for example. To match quoted strings, you could use Regexp::Common::delimited or the core module Text::Balanced. Good resources on regexes in general are perlretut, perlrequick, and perlre.	[reply] [d/l] [select]
Re^4: DateTime::Format::Flexible; for Log Parse with multiple formatted lines by TCLion (Novice) on Mar 27, 2017 at 17:08 UTC
Re^5: DateTime::Format::Flexible; for Log Parse with multiple formatted lines by haukex (Archbishop) on Mar 27, 2017 at 18:07 UTC
Re^5: DateTime::Format::Flexible; for Log Parse with multiple formatted lines by poj (Abbot) on Mar 27, 2017 at 20:18 UTC
Re^5: DateTime::Format::Flexible; for Log Parse with multiple formatted lines by 1nickt (Canon) on Mar 27, 2017 at 17:35 UTC
Re^2: DateTime::Format::Flexible; for Log Parse with multiple formatted lines by TCLion (Novice) on Apr 04, 2017 at 14:23 UTC
What is the reason my original code will skip the last entry?	[reply]
Re^3: DateTime::Format::Flexible; for Log Parse with multiple formatted lines by huck (Prior) on Apr 04, 2017 at 14:35 UTC
join only puts the first argument between array entries, not after each one `$myfixedlog =~ /......(ENDOFLINE)/smg` [download] so there is no text ENDOFLINE after the last entry	[reply] [d/l]
Re^3: DateTime::Format::Flexible; for Log Parse with multiple formatted lines by haukex (Archbishop) on Apr 04, 2017 at 14:33 UTC
What is the reason my original code will skip the last entry? The problem is using join to join the lines using some string, which only inserts that string between elements of the array, and then using a regex that requires all entries to end on that string. In the following example, based on your original code, I'll demonstrate the problem, note how in the output, "Baz" is missing because it is not followed by "ENDOFLINE". The other problem I mentioned was that log entries that don't match the regex will be skipped (and may possibly even cause other entries to be parsed incorrectly, as this example shows): `use warnings; use strict; use Data::Dumper; $Data::Dumper::Useqq=1; my @mylog = <DATA>; my $myfixedlog = join("ENDOFLINE", @mylog); print Dumper $myfixedlog; while ($myfixedlog =~ /([A-Za-z]+)\nENDOFLINE/smg) { print Dumper $1; } __DATA__ Foo 123 Bar Quz Baz` [download] Output: `$VAR1 = "Foo\nENDOFLINE123\nENDOFLINEBar\nENDOFLINEQuz\nENDOFLINEBaz\n +"; $VAR1 = "Foo"; $VAR1 = "ENDOFLINEBar"; $VAR1 = "Quz";` [download]	[reply] [d/l] [select]
Re: DateTime::Format::Flexible; for Log Parse with multiple formatted lines by 1nickt (Canon) on Mar 23, 2017 at 19:12 UTC
It appears that you are not only reading in all the lines of the file at once, but then combining them all into one string. This is probably not a scalable solution. Instead consider using while to go through the lines one at a time, processing the text and extracting the data you need in a loop. The other thing I saw at a glance is that you are trying to use `$0` as a regex capture, which will not do what you expect. Edit: Also please place your sample input into `<code></code>` tags, as it's not rendering accurately at the moment and can't be used for testing. Please consider sharing an SSCCE which in this case would include a couple of sample lines in the `__DATA__` section, the regexp you're using to extract the fields, and then the date handling code ... all in one script of 20 - 30 lines. The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re^2: DateTime::Format::Flexible; for Log Parse with multiple formatted lines by TCLion (Novice) on Mar 23, 2017 at 19:48 UTC
I did use the code tags. it works fine on my browser, just small window and the download link works on all of them. So instead of one continuous string how would you use while for this code in a loop? I am not sure what you are asking for, for SSCCE, I did what I believe I was told for the SoPW and I did give all code (properly as I understand it) and example lines, What am I missing/not understanding. Where is a data section? Part of my problem is I am unsure of the date handling code due to DateTime::Format::Flexible; is new to me and I have not understood the documentation completely which is why I am asking in the first place. Please don't take this as me being an ass. I just don't understand and need clarification.	[reply]
Re^3: DateTime::Format::Flexible; for Log Parse with multiple formatted lines by Marshall (Canon) on Mar 23, 2017 at 22:08 UTC
You need to use code tags, not only for the program code, but also for the the data that the program is supposed to read. In the text of your question, please use code tags to display: `Mon Feb 20 09:31:25 2017 [INFO] [AGENTEXEC] Error Description`. code tags also put things into a fixed width font. Your program is unable to parse this line: `2017-02-20T09:30:53.177000 20[] 0000000000000000 Error Description` In Perl, it is possible to define a file that is contained within the program code itself! This is called a DATA segment. A simple example: `#!/usr/bin/perl use strict; use warnings; while (<DATA>) { print; } __DATA__ Some example lines that could be in a file` [download] The `__DATA__` segment is a pre-opened file handle. There are ways to put multiple input files within the code, but a DATA segment for a single file is the most often used. It is not clear to me what the desired output is. Please include an example of that in your post.	[reply] [d/l] [select]
Re^3: DateTime::Format::Flexible; for Log Parse with multiple formatted lines by 1nickt (Canon) on Mar 23, 2017 at 21:39 UTC
Hi, I was referring to the sample data lines you included, which are not in code tags and thus are mangled. This had prevented me from trying to solve your problem ... An SSCCE here would, as I said, contain only enough code and data to demonstrate your issue. Using the `__DATA__` section in a file allows you to include data and code in the same file but keep them separate. Perfect for an SSCCE. For example: `use strict; use warnings; use feature 'say'; use DateTime::Format::Flexible; use Test::More tests => 2; my $wanted = '2017-02-20 09:30:53'; for my $string ( <DATA> ) { chomp $string; my $dt = DateTime::Format::Flexible->parse_datetime( $string ); is( $dt->strftime('%F %T'), $wanted, "with >$string<" ); } __DATA__ Mon Feb 20 09:30:53 2017 2017-02-20T09:30:53.177000` [download] Output: `1..2 ok 1 - with >Mon Feb 20 09:30:53 2017< ok 2 - with >2017-02-20T09:30:53.177000<` [download] (edit: updated example with OP's data) Hope this helps! The way forward always starts with a minimal test.	[reply] [d/l] [select]