Help parsing badly constructed logfiles

Amblikai has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks, i spent the last day or so writing a quick logfile parser. It literally just searches the file for categories of search terms. I was happy with my new creation!

Until....

I've just realised that when the logfile is created, newlines are being randomly inserted into the file. I've no idea why. It seems just to be the way these logs are created!

The problem being that occasionally, my search terms are cut in half by a newline! How on earth am i supposed to deal with that?

Any advice would be greatly appreciated. Thank you!

Comment on Help parsing badly constructed logfiles

Replies are listed 'Best First'.
Re: Help parsing badly constructed logfiles by AppleFritter (Vicar) on Jul 13, 2014 at 16:16 UTC
I don't suppose fixing whatever's creating those logfiles is an option? Here's what I might do: remember the previously-read line, and match your search term against both the current line and the concatenation of the previous and current line. Off the top of my head (completely untested): `my $previous = ""; while(defined(my $line = <>)) { chomp $line; ($previous . $line) =~ m/$pattern/ and say $line; $previous = $line; }` [download] (This isn't perfect, obviously; it'll also print lines where the previous line just so happened to match, even though there's no match across the line boundary. But it should get you started.)	[reply] [d/l]
Re^2: Help parsing badly constructed logfiles by Amblikai (Scribe) on Jul 13, 2014 at 17:14 UTC
That's great advice. I hadn't thought of that! I'm still trying to get to the bottom of what is causing the logfile to be so badly formatted. It seems it might actually be the platform LSF which is messing with the output files somehow. I have little to no experience of using the LSF though, (which is another problem!) Thanks again!	[reply]
Re: Help parsing badly constructed logfiles by Bethany (Scribe) on Jul 13, 2014 at 17:22 UTC
(N.B.: This is a blue-sky suggestion. Depending on the logfiles' sizes and maybe available resources it might be impractical, don't know. The approach has worked well for me for a certain class of text file messed-up-ness.) Independent of newlines, is there a reliable way to tell where "real" lines, meaning actual log entries, end? For instance, if whole log entries have fixed length you could remove all newlines, then break the combined text into fixed-length records. If it were that easy you'd probably have tried this, but maybe the approach can be altered to suit your situation -- if entries always end with some characteristic pattern that doesn't occur elsewhere in an entry, you could split just after each match of that pattern, and so forth. Maybe a combination of checking for a pattern within a certain range of possible lengths will do the trick.	[reply]