in reply to DateTime::Format::Flexible; for Log Parse with multiple formatted lines

In your code, you slurp the entire file into an array, then join all the lines using a fixed string, and then use a regex that specifically includes the fixed string as the last thing to match. I don't quite understand why you're doing it this way, I don't see the advantage of this over a normal while (<$filehandle>) { ... } loop? I didn't really test your code because the sample log entry you provided doesn't actually match your regex, but from what I can tell, your code will silently skip any log entries that don't match the regex, including that it will always skip the last log entry.

I see a couple of other issues with your code: You don't Use strict and warnings, and you don't check some opens for errors. In your regexes, you don't need to put (...) capturing groups around things you don't actually want to capture into the $1, $2, ... variables, e.g. you can say /...T.../ instead of /...(T).../. You might also want to look into the /x regex modifier (perlre) to make your regexes easier to read and follow. Also, I'd strongly recommend using an appropriate module such as Text::CSV for CSV output.

I'm not sure I fully understand your questions. Instead, I can show you how I might have coded this. Personally, I like to validate the format of input files a little bit as I read them. Instead of DateTime::Format::Flexible, I'd use several DateTime::Format::Strptime parsers, and first use a heuristic to decide which format the log line has. It seems from your sample inputs that the log line formats are quite different, which is why I've duplicated the parsing and output logic in the if statements below, but if your log lines are instead similar, you should of course not duplicate that code and move the common parsing code outside of the ifs.

#!/usr/bin/env perl use warnings; use strict; use 5.010; # for /p and ${^MATCH} use DateTime; use DateTime::Format::Strptime; use Text::CSV; my $strp_one = DateTime::Format::Strptime->new(on_error=>'croak', time_zone=>'UTC', pattern => '%Y-%m-%dT%H:%M:%S.%6N'); my $strp_two = DateTime::Format::Strptime->new(on_error=>'croak', time_zone=>'UTC', pattern => '%a %b %d %H:%M:%S %Y'); my $csv = Text::CSV->new({binary=>1, always_quote=>1, blank_is_undef=> +1, eol=>$/, auto_diag=>2}); while (<DATA>) { chomp; if (/^\d{4,}-[\d\-T\:\.]+(?=\s+)/p) { my ($dts,$rest) = (${^MATCH}, ${^POSTMATCH}); my $dt = $strp_one->parse_datetime($dts); # parse "$rest" and break it into more fields here $csv->print(select, [ $dt->strftime('%Y-%m-%d-%H-%M-%S-%6N-%Z'), $rest ] ); } elsif (/^\w+\s+\w+\s+\d+\s+[\d\:]+\s+\d{4,}(?=\s+)/p) { my ($dts,$rest) = (${^MATCH}, ${^POSTMATCH}); my $dt = $strp_two->parse_datetime($dts); # parse "$rest" and break it into more fields here $csv->print(select, [ $dt->strftime('%Y-%m-%d-%H-%M-%S-%6N-%Z'), $rest ] ); } else { warn "Skipping unknown line format: $_" } } __DATA__ 2017-02-20T09:30:53.177000 20[] 0000000000000000 Error Description One Mon Feb 20 09:31:25 2017 [INFO] [AGENTEXEC] Error Description Two 2017-02-20T09:30:53.177000 20[] 0000000000000000 Error Description Thr +ee Mon Feb 20 09:31:25 2017 [INFO] [AGENTEXEC] Error Description Four

Output:

"2017-02-20-09-30-53-177000-UTC"," 20[] 0000000000000000 Error Descrip +tion One" "2017-02-20-09-31-25-000000-UTC"," [INFO] [AGENTEXEC] Error Descriptio +n Two" "2017-02-20-09-30-53-177000-UTC"," 20[] 0000000000000000 Error Descrip +tion Three" "2017-02-20-09-31-25-000000-UTC"," [INFO] [AGENTEXEC] Error Descriptio +n Four"

One disadvantage of the above approach is that if you have a lot of different date/time formats in your log files, you'd have to add more and more parsers. So if that's the case, you can also try using DateTime::Format::Flexible, and the same basic idea as above (use a regex to pull the date/time string from the beginning of the line before attempting to parse it) applies.

Replies are listed 'Best First'.
Re^2: DateTime::Format::Flexible; for Log Parse with multiple formatted lines
by TCLion (Novice) on Mar 27, 2017 at 13:14 UTC

    I now understand about the Data Portion and this looks good. Boss doesn't like it... says his script was better because was more simple. I still would like to use and modify this one (thank you) but I am having a problem with breaking up the $rest. trying to add in my first with the positions but its not working correctly. I am thinking that the $dts has extra left over that is being pushed out and making $rest not the same. But I probably don't have the code correct

    if (/^\d{4,}-[\d\-T\:\.]+(?=\s+)/p) { my ($dts,$rest) = (${^MATCH}, ${^POSTMATCH}); my $dt = $strp_one->parse_datetime($dts); # parse "$rest" and break it into more fields here while ($rest =~ /(\w+\s+)(\w+\s+)(.+?)/smg) { print "New Error Found...\n"; print "0 $0\n"; print "1 $1\n"; print "2 $2\n"; print "3 $3\n"; print "4 $4\n"; print "5 $5\n"; print "6 $6\n";} $csv->print(select, [ $dt->strftime('%Y-%m-%d,%H:%M:%S'),#'%Y-%m-%d-%H-%M-%S-%6N +-%Z' $rest ] ); }

    I did add full data strings

    __DATA__ 2017-02-20T09:30:53.177000 20848[30892] 0000000000000000 [DM_ +MQ_I_DAEMON_START]info: "Message queue daemon (tid : 27944, session +0102b20d80000456) is started sucessfully." 2017-02-20T09:30:53.193000 20848[17732] 0102b20d80000003 [DM_ +DOCBROKER_I_PROJECTING]info: "Sending information to Docbroker locat +ed on host (PWDOCPRDCON32) with port (1489). Information: (Config(se +rver), Proximity(1), Status(Open), Dormancy Status(Active))." 2017-02-20T09:30:53.193000 20848[17732] 0102b20d80000003 [DM_ +DOCBROKER_I_PROJECTING]info: "Sending information to Docbroker locat +ed on host (server) with port (1354). Information: (Config(server), +Proximity(2), Status(Open), Dormancy Status(Active))." 2017-02-20T09:30:53.193000 20848[17732] 0102b20d80000003 [DM_ +DOCBROKER_I_PROJECTING]info: "Sending information to Docbroker locat +ed on host (server) with port (1354). Information: (Config(Server), +Proximity(3), Status(Open), Dormancy Status(Active))." Mon Feb 20 09:31:25 2017 [INFORMATION] [AGENTEXEC 26816] Detected duri +ng program initialization: Version: 7.2.0160.0297 Win64 Mon Feb 20 09:31:30 2017 [INFORMATION] [AGENTEXEC 26816] Detected duri +ng program initialization: Agent Exec connected to server server: [D +M_SESSION_I_SESSION_START]info: "Session 0102b20d80397508 started fo +r user user."

    I do need to pull out the word position (info:) as it would say error: if a problem but add back in for full message DM...info: "message.." and thats what I need.
    2084817732 and 0102b20... are not needed as well as INFORMATION and agentexect 26816

    so what did I do wrong with the code in trying to find the positions?

      Two problems I see with that code are: First, just like in the original code, while ($str =~ /.../g) without a \G regex anchor ($str =~ /\G.../g) will skip over stuff in $str that doesn't match the regex, possibly resulting in missed data. Second, as 1nickt already said, $0 is not a regex capture (see $0), and the regex only has three capture groups, so $4 and above will never be populated by that regex.

      Based on your regex, it looks like you're trying to break up the string based on whitespace, in which case a simple my @parts = split ' ', $rest; might be easiest.

      However, I see that your log entries have quoted strings, so that might not be appropriate either. Your first couple of example log entries could possibly be broken apart like this: my @parts = split /\s*[\[\]]\s*/, $rest, 5;, or, you'll have to write regexes that actually match the log entries, e.g. /^ \s* (\d+) \s* \[(\d+)\] \s+ (\S+) \s+ \[(.+?)\] \s* (\w+): \s* (.*?) \s* $/x, for example.

      To match quoted strings, you could use Regexp::Common::delimited or the core module Text::Balanced. Good resources on regexes in general are perlretut, perlrequick, and perlre.

        Please remember I am new to Perl. I am trying to understand your code examples but have not been able to get it to work in any way. I did go through and looked up the code to understand it but I am missing somthing. I dont want every white space to seperate just the first few then the error message. One of the lines doesnt have a seperate severity like INFORMATION but an info: in the line. So the spot for info: I would be able to pull out error: if it was an error. I am still trying to use $1 $2 $3 to understand and find the positions to place where I want them but unsuccessful. I do understand the higher numbers would pull nothing but are there because I am trying to break it up and see what is there as I go to see if I missed one. Also $0 printed the file location which is a good separator when going through the results.

        Ok maybe this will help explain what I am trying to do. Here is the data and the desired output for both line formats leaving out unnecessary info.

        __data__ Mon Feb 20 09:31:25 2017 [INFORMATION] [AGENTEXEC 26816] Detected duri +ng program initialization: Version: 7.2.0160.0297 Win64 2017-02-20T09:30:53.177000 20848[30892] 0000000000000000 [DM_ +MQ_I_DAEMON_START]info: "Message queue daemon (tid : 27944, session +0102b20d80000456) is started sucessfully."
        Server Name 2017-02-20 09:30:53 info: DM_MQ_I_DAEMON_START info: "Message queue daemon (tid : 27944, session 0102b20d80000456) is started sucessfully."
        Server Name 2017-02-20 09:31:25 INFORMATION Detected during program initialization: Version: 7.2.0160.0297 Win64

        I do appreciate your time for helping and explaining this to me.

Re^2: DateTime::Format::Flexible; for Log Parse with multiple formatted lines
by TCLion (Novice) on Apr 04, 2017 at 14:23 UTC

    What is the reason my original code will skip the last entry?

      join only puts the first argument between array entries, not after each one

      $myfixedlog =~ /......(ENDOFLINE)/smg
      so there is no text ENDOFLINE after the last entry

      What is the reason my original code will skip the last entry?

      The problem is using join to join the lines using some string, which only inserts that string between elements of the array, and then using a regex that requires all entries to end on that string. In the following example, based on your original code, I'll demonstrate the problem, note how in the output, "Baz" is missing because it is not followed by "ENDOFLINE". The other problem I mentioned was that log entries that don't match the regex will be skipped (and may possibly even cause other entries to be parsed incorrectly, as this example shows):

      use warnings; use strict; use Data::Dumper; $Data::Dumper::Useqq=1; my @mylog = <DATA>; my $myfixedlog = join("ENDOFLINE", @mylog); print Dumper $myfixedlog; while ($myfixedlog =~ /([A-Za-z]+)\nENDOFLINE/smg) { print Dumper $1; } __DATA__ Foo 123 Bar Quz Baz

      Output:

      $VAR1 = "Foo\nENDOFLINE123\nENDOFLINEBar\nENDOFLINEQuz\nENDOFLINEBaz\n +"; $VAR1 = "Foo"; $VAR1 = "ENDOFLINEBar"; $VAR1 = "Quz";