perlAffen has asked for the wisdom of the Perl Monks concerning the following question:

I have a log file (of snmp traps) that should have all the info on one single line, however some pieces of the trap (varbinds) contain newlines that make the data hard to interpret. I was thinking I could strip all newlines and then insert an new one at the date, since the date string indicates a new entry, however the junk with newlines is so verbose that I need to discard it, for it makes the line exceed the buffer I am working with. I will try to insert a sample here - The junk that looks like MIME encoded data is the culprit, with the newlines, traps without this junk are all on one line, but vary in format....

09:59:58 09/28/07 1 192.168.0.7 1.3.6.1.2.1.1.3.0 TimeTick 335604562 1.3.6.1.6.3.1.1.4.1.0 OID 1.3.6.1.4.1.9.9.383.0.1 1.3.6.1.4.1.9.9.383.1.1.1 Counter64 1119125906 1.3.6.1.4.1.9.9.383.1.1.2 String 07d7091c09361400 1.3.6.1.4.1.9.9.383.1.2.14 String AC4AMAAAAFcAaQBuAGQAbwB3AHMAIAAyADAAMAAwACAATABBAE4AIABNAGEA
bgBhAGcAZQByAAAAAAAAOP9TTUJ1AAAAAJgHyAAAAAAAAAAAAAAAAAIg//4B
MMCNB/8AOAABAP8BAAD/AQAABwBJUEMAAAAAAAAAh/9TTUKiAAAAAJgHyAAA
AAAAAAAAAAAAAAIgNA8BMACOKv8AhwAADcABAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAACAAAAAABAAAAAAAAAAAAAAAAAAAAIA/wUAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAmwESAJsBEgAAAA== 1.3.6.1.4.1.9.9 String aQBuAGQAbwB3AHMAIAAyADAAMAAyACAANQAuADEAAAAAAAAAAE7/U01CdQAF
AAAYB8gAAAAAAAAAAAAAAAAAAP/+ATDAjQT/AE4ACAABACMAAFwAXABWAEUA
TgBTAE4AQQBcAEkAUABDACQAAAA/Pz8/PwAAAABk/1NNQqIAAAAAGAfIAAAA
AAAAAAAAAAAAAiA0DwEwAI4Y/wDe3gAOABYAAAAAAAAAnwECAAAAAAAAAAAA
AAAAAAMAAAABAAAAQABAAAIAAAABEQAAXABzAHIAdgBzAHYAYwAAAAAAAIj/
U01CLwAAAAAYB8gAAAAAAAAAAAAAAAACIP/+ATBAjg== 1.3.6.1.4.1.9.9.383.1.2.16 String 192.168.20.60:3089 1.3.6.1.4.1.9.9.383.1.2.17 String osIdSource="unknown" osRelevance="relevant" osType="unknown" 192.168.0.23:139

09:59:58 09/28/07 1 192.168.0.7 1.3.6.1.2.1.1.3.0 TimeTick 335604562 1.3.6.1.6.3.1.1.4.1.0 OID 1.3.6.1.4.1.9.9.383.0.1 1.3.6.1.4.1.9.9.383.1.1.1

Ideas how to rid those offending lines (in bold - bold has newlines, the rest doesn't - sorry for the format) and reattach the latter part of the trap to the beginning ? I was thinking I could discard lines with really long words (not sure how to do that), but reattaching the remainder to the prefix, I can't get my hands around.
Thanks

Replies are listed 'Best First'.
Re: text parsing question
by throop (Chaplain) on Oct 01, 2007 at 14:06 UTC
    Conceptually, to separate the sheep from the goats, you either need to really know what a sheep looks like, or really know what a goat looks like.
    • Do the traps that you want to keep generally start with a newline?
    • Do all the lines you want start with a time tag?
    • Is there a reliable way to recognize the end of a trap?
    • Do the junk lines have any regularity to them at all?
    More generally, how good does your filter have to be?
    • How big a deal is it if some of the junk slips thru?
    • How big a deal is it if your filter throws a few real traps away?
    throop
      throop:

      ++, and if I could do it a couple more times, I would!

      ...roboticus

Re: text parsing question
by ikegami (Patriarch) on Oct 01, 2007 at 15:38 UTC

    Have you considered fixing the underlying problem: the bug limiting the size of your buf? Here's a solution if you can't.

    If you remove the newlines, then you can just search for very long terms and replace them.

    use strict; use warnings; my $file = '...'; open(my $fh, '<', $file) or die("Unable to open file \"$file\": $!\n"); my $reader = Reader->new($fh); while (defined(my $line = $reader->get_line())) { $line =~ s/\S{40,}/.../g; print($line); }

    Output:

    09:59:58 09/28/07 1 192.168.0.7 1.3.6.1.2.1.1.3.0 TimeTick 335604562 1 +.3.6.1.6.3.1.1.4.1.0 OID 1.3.6.1.4.1.9.9.383.0.1 1.3.6.1.4.1.9.9.383. +1.1.1 Counter64 1119125906 1.3.6.1.4.1.9.9.383.1.1.2 String 07d7091c0 +9361400 1.3.6.1.4.1.9.9.383.1.2.14 String ... 1.3.6.1.4.1.9.9 String +... 1.3.6.1.4.1.9.9.383.1.2.16 String 192.168.20.60:3089 1.3.6.1.4.1. +9.9.383.1.2.17 String osIdSource="unknown" osRelevance="relevant" osT +ype="unknown" 192.168.0.23:139
      can't 'fix' the buffer problem, it is something I can't touch. This works very well. Thanks. Now I need to figure out how it works so I may absorb the wisdom.

        Here are some notes to help you understand get_line:

        • !defined($buf) is only true the first time get_line is called. It proceeds to trash all the lines before the first timestamped line, if any.

        • The file is processed as follows:

          File Read by Returned by ----------------- ----------- ----------- line 1st call(*) scrapped line 1st call(*) scrapped timestamped line 1st call(*) 1st call line 1st call 1st call line 1st call 1st call timestamped line 1st call 2nd call line 2nd call 2nd call line 2nd call 2nd call timestamped line 2nd call 3rd call line 3rd call 3rd call line 3rd call 3rd call timestamped line 3rd call 4th call EOF 4th call 5th call

          * — By the body of if (!defined($buf)).

        • Between calls, $buf contains the line that has been read, but not returned.

        • our $var; local *var = \($self->[$idx]);
          creates an alias so that any change to $var is reflected in $self->[$idx].

        • for (;;) can be read as "for ever". The loop will loop until last, return, die, exit or other exceptional means are used to exit it.

        • return ((undef, $buf) = ($buf, $line))[0];
          is short for
          my $temp = $buf;
          $buf = $line;
          return $temp;

Re: text parsing question
by jdporter (Paladin) on Oct 01, 2007 at 14:20 UTC

    My feeling is that you don't need to worry about it. AFAICT, parsing should be as follows:

    1. Records are "paragraph" delimited; you can read records from the input stream by setting $/ = '';
    2. Fields within records are separated by literal space characters: my @fields = split / /;
    Then some of the fields (the "String" ones) may have newlines in them; those should be easy to find and discard if you want.

    A word spoken in Mind will reach its own level, in the objective world, by its own weight
      so once I place all the words of the paragraph in an array, how do I eval each one and rebuild the paragraph or string ? I think I know how to do this in bash....
      mynewline="" for i in ${fields[@]} ; do if ! echo $i |grep -q "newlinechar" ; then mynewline="$mynewline $i " fi done mynewline=`echo $mynewline | sed "s/^ //g"`

      How to do this in perl is beyond me at the moment.
Re: text parsing question
by snopal (Pilgrim) on Oct 01, 2007 at 14:14 UTC

    This may not be optimal for your situation, but it is possible to change your linebreak value to something more suitable to your needs;

    #Untested { # localize the block local $/ = q{ } # space character is the new newline my $fh; unless (open $fh, "<".$logfile_name) { die ("File open failure\n\t$!\n"); } my $buffer; while (<$fh>) { next if /\n/; if (/^\d\d:\d\d:\d\d/) { print $buffer." "; $buffer = $_; } else { $buffer .= $_." "; } print $buffer; close $fh; }
Re: text parsing question
by ikegami (Patriarch) on Oct 01, 2007 at 13:55 UTC

    You could discard lines that don't start with a timestamp.

    perl -ne "print if /^\d\d:\d\d:\d\d /" log >log.fixed

    Update: Oh, nevermind, you want to keep the osIdSource part. I'll have to get back to you if noone beats me to it.

Re: text parsing question
by aquarium (Curate) on Oct 01, 2007 at 15:11 UTC
    Is your log file an ASCII/UTF formatted file?...because it looks suspiciously like a binary file that happens to have some ASCII text in it. If it is a binary file either use the appropriate tool to dump it as text before processing OR proceed with due caution as regex and other text based perl functions may not work the way you want when working on binary data.
    the hardest line to type correctly is: stty erase ^H
      it is pure text. thanks