Re: text parsing question
by throop (Chaplain) on Oct 01, 2007 at 14:06 UTC
|
Conceptually, to separate the sheep from the goats, you either need to really know what a sheep looks like, or really know what a goat looks like.
- Do the traps that you want to keep generally start with a newline?
- Do all the lines you want start with a time tag?
- Is there a reliable way to recognize the end of a trap?
- Do the junk lines have any regularity to them at all?
More generally, how good does your filter have to be?
- How big a deal is it if some of the junk slips thru?
- How big a deal is it if your filter throws a few real traps away?
throop | [reply] |
|
|
| [reply] |
Re: text parsing question
by ikegami (Patriarch) on Oct 01, 2007 at 15:38 UTC
|
Have you considered fixing the underlying problem: the bug limiting the size of your buf? Here's a solution if you can't.
If you remove the newlines, then you can just search for very long terms and replace them.
use strict;
use warnings;
my $file = '...';
open(my $fh, '<', $file)
or die("Unable to open file \"$file\": $!\n");
my $reader = Reader->new($fh);
while (defined(my $line = $reader->get_line())) {
$line =~ s/\S{40,}/.../g;
print($line);
}
Output:
09:59:58 09/28/07 1 192.168.0.7 1.3.6.1.2.1.1.3.0 TimeTick 335604562 1
+.3.6.1.6.3.1.1.4.1.0 OID 1.3.6.1.4.1.9.9.383.0.1 1.3.6.1.4.1.9.9.383.
+1.1.1 Counter64 1119125906 1.3.6.1.4.1.9.9.383.1.1.2 String 07d7091c0
+9361400 1.3.6.1.4.1.9.9.383.1.2.14 String ... 1.3.6.1.4.1.9.9 String
+... 1.3.6.1.4.1.9.9.383.1.2.16 String 192.168.20.60:3089 1.3.6.1.4.1.
+9.9.383.1.2.17 String osIdSource="unknown" osRelevance="relevant" osT
+ype="unknown" 192.168.0.23:139
| [reply] [d/l] [select] |
|
|
can't 'fix' the buffer problem, it is something I can't touch. This works very well. Thanks. Now I need to figure out how it works so I may absorb the wisdom.
| [reply] |
|
|
| [reply] [d/l] [select] |
Re: text parsing question
by jdporter (Paladin) on Oct 01, 2007 at 14:20 UTC
|
My feeling is that you don't need to worry about it. AFAICT, parsing should be as follows:
- Records are "paragraph" delimited; you can read records from the input stream by setting $/ = '';
- Fields within records are separated by literal space characters: my @fields = split / /;
Then some of the fields (the "String" ones) may have newlines in them; those should be easy to find and discard if you want.
A word spoken in Mind will reach its own level, in the objective world, by its own weight
| [reply] [d/l] [select] |
|
|
so once I place all the words of the paragraph in an array, how do I eval each one and rebuild the paragraph or string ? I think I know how to do this in bash....
mynewline=""
for i in ${fields[@]} ; do
if ! echo $i |grep -q "newlinechar" ; then
mynewline="$mynewline $i "
fi
done
mynewline=`echo $mynewline | sed "s/^ //g"`
How to do this in perl is beyond me at the moment.
| [reply] [d/l] |
|
|
| [reply] |
Re: text parsing question
by snopal (Pilgrim) on Oct 01, 2007 at 14:14 UTC
|
This may not be optimal for your situation, but it is possible to change your linebreak value to something more suitable to your needs;
#Untested
{ # localize the block
local $/ = q{ } # space character is the new newline
my $fh;
unless (open $fh, "<".$logfile_name) {
die ("File open failure\n\t$!\n");
}
my $buffer;
while (<$fh>) {
next if /\n/;
if (/^\d\d:\d\d:\d\d/) {
print $buffer." ";
$buffer = $_;
}
else {
$buffer .= $_." ";
}
print $buffer;
close $fh;
}
| [reply] [d/l] |
Re: text parsing question
by ikegami (Patriarch) on Oct 01, 2007 at 13:55 UTC
|
perl -ne "print if /^\d\d:\d\d:\d\d /" log >log.fixed
Update: Oh, nevermind, you want to keep the osIdSource part. I'll have to get back to you if noone beats me to it. | [reply] [d/l] [select] |
Re: text parsing question
by aquarium (Curate) on Oct 01, 2007 at 15:11 UTC
|
| [reply] |
|
|
| [reply] |