Data-munching monks,

there's been a lot of talks about how to parse given logfile formats -- but my question today is: What's the best way to log a set of fields per line in a somewhat readable format in order to be able to parse them efficiently later?

Text::CSV comes to mind. Logging fields comma-separated, escaping comma-containing fields with double-quotes ("...") and escaping quotes by doubling them ("") works nicely:

use Text::CSV; my $csv = Text::CSV->new(); while(<DATA>) { $csv->parse($_) or die "Parse error"; my @cols = $csv->fields(); print join('/', @cols), "\n"; } __DATA__ foo,bar,baz foo,"bar baz","foo ""the bar"" baz"
However, Text::CSV is slow and chokes on special characters (try parsing "fü,bär,bäz").

Being able to determine the log format is certainly an advantage, how about the following: Separate fields by spaces, literal spaces are escaped by a backslash (\ ) and backslashes are escaped by another backslash (\\). The following code parses this format:

while(<DATA>) { my @columns; while(/("(?:\\\\|\\"|.)*?")| # "quoted" (\S+) # unquoted /gx) { my $match; if(defined $1) { # matched quoted $match = $1; $match =~ s/^"//; # remove opening " $match =~ s/"$//; # remove closing " } else { $match = $2; } $match =~ s#\\\\#\\#g; # \\ -> \ $match =~ s#\\"#"#g; # \" -> " push(@columns, $match); # store } print join('/', @columns), "\n"; } __DATA__ foo bar baz foo "bar \\ baz" "foo \"the bar\" baz" fü bär bäz
I bet there's plenty of other ways to do this -- what's the most efficient one? My conditions are that the format is readable (so no \0 as field separator) and that the separator could show up in a field value.

Who's got the most efficient format/algorithm?


In reply to Who beats Text::CSV? by saintmike

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.