text parsing question

perlAffen has asked for the wisdom of the Perl Monks concerning the following question:

I have a log file (of snmp traps) that should have all the info on one single line, however some pieces of the trap (varbinds) contain newlines that make the data hard to interpret. I was thinking I could strip all newlines and then insert an new one at the date, since the date string indicates a new entry, however the junk with newlines is so verbose that I need to discard it, for it makes the line exceed the buffer I am working with. I will try to insert a sample here - The junk that looks like MIME encoded data is the culprit, with the newlines, traps without this junk are all on one line, but vary in format....

09:59:58 09/28/07 1 192.168.0.7 1.3.6.1.2.1.1.3.0 TimeTick 335604562 1.3.6.1.6.3.1.1.4.1.0 OID 1.3.6.1.4.1.9.9.383.0.1 1.3.6.1.4.1.9.9.383.1.1.1 Counter64 1119125906 1.3.6.1.4.1.9.9.383.1.1.2 String 07d7091c09361400 1.3.6.1.4.1.9.9.383.1.2.14 String AC4AMAAAAFcAaQBuAGQAbwB3AHMAIAAyADAAMAAwACAATABBAE4AIABNAGEA
bgBhAGcAZQByAAAAAAAAOP9TTUJ1AAAAAJgHyAAAAAAAAAAAAAAAAAIg//4B
MMCNB/8AOAABAP8BAAD/AQAABwBJUEMAAAAAAAAAh/9TTUKiAAAAAJgHyAAA
AAAAAAAAAAAAAAIgNA8BMACOKv8AhwAADcABAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAACAAAAAABAAAAAAAAAAAAAAAAAAAAIA/wUAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAmwESAJsBEgAAAA== 1.3.6.1.4.1.9.9 String aQBuAGQAbwB3AHMAIAAyADAAMAAyACAANQAuADEAAAAAAAAAAE7/U01CdQAF
AAAYB8gAAAAAAAAAAAAAAAAAAP/+ATDAjQT/AE4ACAABACMAAFwAXABWAEUA
TgBTAE4AQQBcAEkAUABDACQAAAA/Pz8/PwAAAABk/1NNQqIAAAAAGAfIAAAA
AAAAAAAAAAAAAiA0DwEwAI4Y/wDe3gAOABYAAAAAAAAAnwECAAAAAAAAAAAA
AAAAAAMAAAABAAAAQABAAAIAAAABEQAAXABzAHIAdgBzAHYAYwAAAAAAAIj/
U01CLwAAAAAYB8gAAAAAAAAAAAAAAAACIP/+ATBAjg== 1.3.6.1.4.1.9.9.383.1.2.16 String 192.168.20.60:3089 1.3.6.1.4.1.9.9.383.1.2.17 String osIdSource="unknown" osRelevance="relevant" osType="unknown" 192.168.0.23:139

09:59:58 09/28/07 1 192.168.0.7 1.3.6.1.2.1.1.3.0 TimeTick 335604562 1.3.6.1.6.3.1.1.4.1.0 OID 1.3.6.1.4.1.9.9.383.0.1 1.3.6.1.4.1.9.9.383.1.1.1

Ideas how to rid those offending lines (in bold - bold has newlines, the rest doesn't - sorry for the format) and reattach the latter part of the trap to the beginning ? I was thinking I could discard lines with really long words (not sure how to do that), but reattaching the remainder to the prefix, I can't get my hands around.
Thanks

Comment on text parsing question

Replies are listed 'Best First'.
Re: text parsing question by throop (Chaplain) on Oct 01, 2007 at 14:06 UTC
Conceptually, to separate the sheep from the goats, you either need to really know what a sheep looks like, or really know what a goat looks like. Do the traps that you want to keep generally start with a newline? Do all the lines you want start with a time tag? Is there a reliable way to recognize the end of a trap? Do the junk lines have any regularity to them at all? More generally, how good does your filter have to be? How big a deal is it if some of the junk slips thru? How big a deal is it if your filter throws a few real traps away? throop	[reply]
Re^2: text parsing question by roboticus (Chancellor) on Oct 01, 2007 at 22:37 UTC
throop: ++, and if I could do it a couple more times, I would! ...roboticus	[reply]
Re: text parsing question by ikegami (Patriarch) on Oct 01, 2007 at 15:38 UTC
Have you considered fixing the underlying problem: the bug limiting the size of your buf? Here's a solution if you can't. If you remove the newlines, then you can just search for very long terms and replace them. `use strict; use warnings; my $file = '...'; open(my $fh, '<', $file) or die("Unable to open file \"$file\": $!\n"); my $reader = Reader->new($fh); while (defined(my $line = $reader->get_line())) { $line =~ s/\S{40,}/.../g; print($line); }` [download] Read more... (1192 Bytes) Output: `09:59:58 09/28/07 1 192.168.0.7 1.3.6.1.2.1.1.3.0 TimeTick 335604562 1 +.3.6.1.6.3.1.1.4.1.0 OID 1.3.6.1.4.1.9.9.383.0.1 1.3.6.1.4.1.9.9.383. +1.1.1 Counter64 1119125906 1.3.6.1.4.1.9.9.383.1.1.2 String 07d7091c0 +9361400 1.3.6.1.4.1.9.9.383.1.2.14 String ... 1.3.6.1.4.1.9.9 String +... 1.3.6.1.4.1.9.9.383.1.2.16 String 192.168.20.60:3089 1.3.6.1.4.1. +9.9.383.1.2.17 String osIdSource="unknown" osRelevance="relevant" osT +ype="unknown" 192.168.0.23:139` [download]	[reply] [d/l] [select]
Re^2: text parsing question by perlAffen (Sexton) on Oct 01, 2007 at 18:08 UTC
can't 'fix' the buffer problem, it is something I can't touch. This works very well. Thanks. Now I need to figure out how it works so I may absorb the wisdom.	[reply]
Re^3: text parsing question by ikegami (Patriarch) on Oct 01, 2007 at 19:02 UTC
Here are some notes to help you understand `get_line`: `!defined($buf)` is only true the first time `get_line` is called. It proceeds to trash all the lines before the first timestamped line, if any. The file is processed as follows: `File Read by Returned by ----------------- ----------- ----------- line 1st call() scrapped line 1st call() scrapped timestamped line 1st call() 1st call line 1st call 1st call line 1st call 1st call timestamped line 1st call 2nd call line 2nd call 2nd call line 2nd call 2nd call timestamped line 2nd call 3rd call line 3rd call 3rd call line 3rd call 3rd call timestamped line 3rd call 4th call EOF 4th call 5th call` [download] — By the body of `if (!defined($buf))`. Between calls, `$buf` contains the line that has been read, but not returned. `our $var; local *var = \($self->[$idx]);` creates an alias so that any change to `$var` is reflected in `$self->[$idx]`. `for (;;)` can be read as "for ever". The loop will loop until `last`, `return`, `die`, `exit` or other exceptional means are used to exit it. `return ((undef, $buf) = ($buf, $line))[0];` is short for `my $temp = $buf;` `$buf = $line;` `return $temp;`	[reply] [d/l] [select]
Re: text parsing question by jdporter (Paladin) on Oct 01, 2007 at 14:20 UTC
My feeling is that you don't need to worry about it. AFAICT, parsing should be as follows: Records are "paragraph" delimited; you can read records from the input stream by setting `$/ = '';` Fields within records are separated by literal space characters: `my @fields = split / /;` Then some of the fields (the "String" ones) may have newlines in them; those should be easy to find and discard if you want. A word spoken in Mind will reach its own level, in the objective world, by its own weight	[reply] [d/l] [select]
Re^2: text parsing question by perlAffen (Sexton) on Oct 01, 2007 at 15:25 UTC
so once I place all the words of the paragraph in an array, how do I eval each one and rebuild the paragraph or string ? I think I know how to do this in bash.... mynewline="" for i in ${fields[@]} ; do if ! echo $i \|grep -q "newlinechar" ; then mynewline="$mynewline $i " fi done mynewline=`echo $mynewline \| sed "s/^ //g"` [download] How to do this in perl is beyond me at the moment.	[reply] [d/l]
Re^3: text parsing question by jdporter (Paladin) on Oct 01, 2007 at 15:35 UTC
In Perl, the opposite of split is join.	[reply]
Re: text parsing question by snopal (Pilgrim) on Oct 01, 2007 at 14:14 UTC
This may not be optimal for your situation, but it is possible to change your linebreak value to something more suitable to your needs; `#Untested { # localize the block local $/ = q{ } # space character is the new newline my $fh; unless (open $fh, "<".$logfile_name) { die ("File open failure\n\t$!\n"); } my $buffer; while (<$fh>) { next if /\n/; if (/^\d\d:\d\d:\d\d/) { print $buffer." "; $buffer = $_; } else { $buffer .= $_." "; } print $buffer; close $fh; }` [download]	[reply] [d/l]
Re: text parsing question by ikegami (Patriarch) on Oct 01, 2007 at 13:55 UTC
You could discard lines that don't start with a timestamp. `perl -ne "print if /^\d\d:\d\d:\d\d /" log >log.fixed` [download] Update: Oh, nevermind, you want to keep the `osIdSource` part. I'll have to get back to you if noone beats me to it.	[reply] [d/l] [select]
Re: text parsing question by aquarium (Curate) on Oct 01, 2007 at 15:11 UTC
Is your log file an ASCII/UTF formatted file?...because it looks suspiciously like a binary file that happens to have some ASCII text in it. If it is a binary file either use the appropriate tool to dump it as text before processing OR proceed with due caution as regex and other text based perl functions may not work the way you want when working on binary data. the hardest line to type correctly is: stty erase ^H	[reply]
Re^2: text parsing question by perlAffen (Sexton) on Oct 01, 2007 at 15:29 UTC
it is pure text. thanks	[reply]