Who beats Text::CSV?

saintmike has asked for the wisdom of the Perl Monks concerning the following question:

Data-munching monks,

there's been a lot of talks about how to parse given logfile formats -- but my question today is: What's the best way to log a set of fields per line in a somewhat readable format in order to be able to parse them efficiently later?

Text::CSV comes to mind. Logging fields comma-separated, escaping comma-containing fields with double-quotes ("...") and escaping quotes by doubling them ("") works nicely:

    use Text::CSV;
    my $csv = Text::CSV->new();
    while(<DATA>) {
        $csv->parse($_) or die "Parse error";
        my @cols = $csv->fields();
        print join('/', @cols), "\n";
    }
    __DATA__
    foo,bar,baz
   foo,"bar baz","foo ""the bar"" baz"
[download]

However, Text::CSV is slow and chokes on special characters (try parsing "fü,bär,bäz").

Being able to determine the log format is certainly an advantage, how about the following: Separate fields by spaces, literal spaces are escaped by a backslash (\ ) and backslashes are escaped by another backslash (\\). The following code parses this format:

    while(<DATA>) {
        my @columns;
        while(/("(?:\\\\|\\"|.)*?")| # "quoted"
                (\S+)                 # unquoted
               /gx) {
            my $match;
            if(defined $1) {         # matched quoted
                $match = $1;
                $match =~ s/^"//;    # remove opening "
                $match =~ s/"$//;    # remove closing "
            } else {
                $match = $2;
            }
            $match =~ s#\\\\#\\#g;   # \\ -> \
            $match =~ s#\\"#"#g;     # \" -> "
            push(@columns, $match);  # store
        }

        print join('/', @columns), "\n";
    }

    __DATA__
    foo bar baz
    foo "bar \\ baz" "foo \"the bar\" baz"
    fü bär bäz
[download]

I bet there's plenty of other ways to do this -- what's the most efficient one? My conditions are that the format is readable (so no \0 as field separator) and that the separator could show up in a field value.

Who's got the most efficient format/algorithm?

Comment on Who beats Text::CSV? Select or Download Code

Replies are listed 'Best First'.
Re: Who beats Text::CSV? by jZed (Prior) on Jun 19, 2004 at 07:11 UTC
I think you'll find the answer here. Text::CSV_XS is much faster than Text::CSV and handles "special" characters just fine as long as the binary flag is set.	[reply]
Re: Who beats Text::CSV? by tachyon (Chancellor) on Jun 19, 2004 at 10:30 UTC
I prefer TAB separated files. You can almost always do `$field =~ s/[\n\r\t]/ /g`, join the fields with tabs, and output them. The need for TABS and NEWLINES in data fields is far rarer than commas so you loose a lot of complexity as there is no need for quoting/parsing. On reading it is a simple split "\t" on a line and there are your fields - I dare say it does not get any faster. Hand edits on TAB files can cause issues and by default the tab char is 'invisible' so a file can look OK after a hand edit but be broken. One advantage is that fields may line up (then again it does not take much diff in size and they won't). Tab sep files import into spreadsheets just as easily as CSV. As always YMMV. If you need more speed and want to use CSV look to a C/XS based module. Text::CSV_XS or similar. cheers tachyon	[reply] [d/l]
Re: Who beats Text::CSV? by biosysadmin (Deacon) on Jun 19, 2004 at 23:01 UTC
Like anything, this depends on your data. If your data set contains a lot of commas, then using comma as a delimiter is not a very good idea. I generally use tab or whitespace delimited text files for quick and easy data input/output (which is part of the reason why tab-delimited text files are very popular in Bioinformatics). Is there a specific class or type of logging that you're doing? The module Text::CSV_XS seems like a pretty good idea, especially if you're doing system level logging of network, user and filesystem accounting. The idea of using an XS module to do input and output of data formats is a good idea, and could easily be extended to use your delimiter of choice.	[reply]