saintmike has asked for the wisdom of the Perl Monks concerning the following question:
there's been a lot of talks about how to parse given logfile formats -- but my question today is: What's the best way to log a set of fields per line in a somewhat readable format in order to be able to parse them efficiently later?
Text::CSV comes to mind. Logging fields comma-separated, escaping comma-containing fields with double-quotes ("...") and escaping quotes by doubling them ("") works nicely:
However, Text::CSV is slow and chokes on special characters (try parsing "fü,bär,bäz").use Text::CSV; my $csv = Text::CSV->new(); while(<DATA>) { $csv->parse($_) or die "Parse error"; my @cols = $csv->fields(); print join('/', @cols), "\n"; } __DATA__ foo,bar,baz foo,"bar baz","foo ""the bar"" baz"
Being able to determine the log format is certainly an advantage, how about the following: Separate fields by spaces, literal spaces are escaped by a backslash (\ ) and backslashes are escaped by another backslash (\\). The following code parses this format:
I bet there's plenty of other ways to do this -- what's the most efficient one? My conditions are that the format is readable (so no \0 as field separator) and that the separator could show up in a field value.while(<DATA>) { my @columns; while(/("(?:\\\\|\\"|.)*?")| # "quoted" (\S+) # unquoted /gx) { my $match; if(defined $1) { # matched quoted $match = $1; $match =~ s/^"//; # remove opening " $match =~ s/"$//; # remove closing " } else { $match = $2; } $match =~ s#\\\\#\\#g; # \\ -> \ $match =~ s#\\"#"#g; # \" -> " push(@columns, $match); # store } print join('/', @columns), "\n"; } __DATA__ foo bar baz foo "bar \\ baz" "foo \"the bar\" baz" fü bär bäz
Who's got the most efficient format/algorithm?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Who beats Text::CSV?
by jZed (Prior) on Jun 19, 2004 at 07:11 UTC | |
|
Re: Who beats Text::CSV?
by tachyon (Chancellor) on Jun 19, 2004 at 10:30 UTC | |
|
Re: Who beats Text::CSV?
by biosysadmin (Deacon) on Jun 19, 2004 at 23:01 UTC |