in reply to Re^6: Peculiar Reference To U+00FE In Text::CSV_XS Documentation
in thread Peculiar Reference To U+00FE In Text::CSV_XS Documentation
If U+00AE is just a placeholder for newlines *inside* fields, my proposed solution works fine.
I have been playing with thoughts about BOM handling quite a few times already, but came to the same conclusion time after time: the advantage is not worth the performance penalty, which is huge.
Text::CSV_XS is written for sheer speed, and having to check BOM on every record-start (yes, eventually that is what it turns out to be if one wants to support streams) is not worth it. It is relatively easy to
Any of these will imply a speed penalty, even if I would allow it and implement it. That is because the parser is a state machine, which means that the internal structure should change to both allowing multi-byte characters and handling them (1st check on start of each of them, then read-ahead if the next is part of the "character" and so on. I already allow this on eol up to 8 characters, which was a pain in the ass to do safely. I'm not saying it is impossible, but I'm not sure if it is worth development time.
You can still use Text::CSV_XS if you are sure that there are no U_0014 characters inside fields, but I bet you cannot be (binary fields tend to hold exactly what causes trouble).
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^8: Peculiar Reference To U+00FE In Text::CSV_XS Documentation
by Jim (Curate) on Dec 10, 2012 at 21:24 UTC |