comment on

Data-munching monks,

there's been a lot of talks about how to parse given logfile formats -- but my question today is: What's the best way to log a set of fields per line in a somewhat readable format in order to be able to parse them efficiently later?

Text::CSV comes to mind. Logging fields comma-separated, escaping comma-containing fields with double-quotes ("...") and escaping quotes by doubling them ("") works nicely:

    use Text::CSV;
    my $csv = Text::CSV->new();
    while(<DATA>) {
        $csv->parse($_) or die "Parse error";
        my @cols = $csv->fields();
        print join('/', @cols), "\n";
    }
    __DATA__
    foo,bar,baz
   foo,"bar baz","foo ""the bar"" baz"
[download]

However, Text::CSV is slow and chokes on special characters (try parsing "fü,bär,bäz").

Being able to determine the log format is certainly an advantage, how about the following: Separate fields by spaces, literal spaces are escaped by a backslash (\ ) and backslashes are escaped by another backslash (\\). The following code parses this format:

    while(<DATA>) {
        my @columns;
        while(/("(?:\\\\|\\"|.)*?")| # "quoted"
                (\S+)                 # unquoted
               /gx) {
            my $match;
            if(defined $1) {         # matched quoted
                $match = $1;
                $match =~ s/^"//;    # remove opening "
                $match =~ s/"$//;    # remove closing "
            } else {
                $match = $2;
            }
            $match =~ s#\\\\#\\#g;   # \\ -> \
            $match =~ s#\\"#"#g;     # \" -> "
            push(@columns, $match);  # store
        }

        print join('/', @columns), "\n";
    }

    __DATA__
    foo bar baz
    foo "bar \\ baz" "foo \"the bar\" baz"
    fü bär bäz
[download]

I bet there's plenty of other ways to do this -- what's the most efficient one? My conditions are that the format is readable (so no \0 as field separator) and that the separator could show up in a field value.

Who's got the most efficient format/algorithm?

In reply to Who beats Text::CSV? by saintmike

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.