I was never considering single-byte anything. Writing code in Perl means that I don't have to (unlike writing code in XS). Yes, I actually meant what I said. Yes, I realized that your example was using multi-byte single-character tokens.

The reason that single-character vs. multi-character (usually) leads to different approaches is because [^"\\]+ as part of a regex works fine for those single-character quote and escape values (respectively) but isn't even close to what you have to do if either of those is multi-character.

And you are quite wrong about:

One glance at the source code and it's obvious the author doesn't mean single character; he means single byte.

For one, the author of Text::xSV didn't have to think about multi-byte characters. Their module is written in Perl so, unless they do something moderately strange or stupid, then multi-byte characters "just work" (provided the user of the module does the little bit of extra work to ensure that Perl has/will properly decode the strings/streams being given to the module).

Looking at the code for Text::xSV in some detail, I see that 90% of the uses of the separator character would work completely fine with a separator that is even composed of more than one multi-byte character. There is one important place where the code would break for a multi-character separator (but that, indeed, continues to work for a separator that is a single multi-byte character):

my $start_field_ms = qr/\G([^"$q_sep]*)/;

Now, fixing the unfortunate hard-coding of the quote character is probably quite a simple task. And that would probably be sufficient to make the module work fine on multi-byte quote characters. Certainly much easier than trying to get multi-byte character support into a much more complex XS module.

Why not?

Because you haven't done the tiny bit of work to fix Text::xSV? Or the small amount of work to write a simple CSV parser in Perl?

No matter. I'm almost done writing my new CSV module.

- tye        


In reply to Re^9: Speeds vs functionality (utf8 csv) by tye
in thread Speeds vs functionality by Tux

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.