Re^2: utf8, locale and regexp

The correct way of handling encodings in Perl is not caring about. If you're caring too much, you're doing the wrong way...

I wish I could agree with this statement... but I'm afraid I can't.

During the last few months at work, I've been involved in a number of Perl projects in Japanese and Chinese environments, where correct handling of encodings is of paramount importance (in particular on Windows, with its unholy mixture of encodings, like UCS-2, UTF-8 and various legacy codepages.) During that time, I've run into several encoding issues, where you just have to "care too much" (to use your words), or else things simply won't work.

For one, Perl doesn't (yet) provide any convenient abstraction layer for handling file names (as opposed to file contents), which means you have to take care of everything yourself manually (by writing wrapper functions, using Encode::(en|de)code explicitly, etc.). In case you're interested in the details, look here for the kind of things I'm having in mind.

This isn't the only problem, though. There are a few "borderline" bugs, like the one I posted recently, in the hope to get some feedback on whether other people would also consider this a bug. (Didn't work out, btw. Not a single reply -- which makes me conclude that, with respect to unicode issues, there's not exactly an overwhelming amount of interest in the Perl community. Kind of a pity, but such is life.). Anyway, what I mean to say is that, having to figure out that you need to specify :raw:encoding(ucs-2le):crlf:utf8 to read/write ordinary UCS-2 files (as frequently encountered on Windows platforms) is just a bit "having to care too much" for my taste... Not to forget the bug revealed in this thread, and other oddities related to subtle differences between use utf8 and use encoding 'utf8', for example.

Of course, whether something is a bug, always is kind of subjective, as it largely depends on your expectations of how things should work, but I think we're not doing ourselves a favor to pretend that everything encoding-related in Perl is working without hassles...

Sorry for the rant, and don't get me wrong. I'm a big fan of Perl, and I would surely advocate Perl wherever appropriate. However, in one of the projects mentioned above, I've had a rather hard time convincing my clients to stick with Perl, and not switch to some other language altogether. This involved investing quite a few unpaid hours on my side (spent on debugging and working around various peculiarities) to keep the price competitive.

Hope you can forgive the somewhat emotional tone of this post. In any case it's not meant to attack you personally, ruoso. Just needed to vent a little... and I'm feeling better now :)

Comment on Re^2: utf8, locale and regexp Select or Download Code

Replies are listed 'Best First'.
Re^3: utf8, locale and regexp by Joost (Canon) on Apr 14, 2007 at 01:39 UTC
Perl's unicode support is far from complete - especially when you consider outside-the-base-distro modules that everyone relies on. I've been running into bugs in DBD::mysql myself. I even supplied a couple of patches. Right now I would say that perl's unicode is better than most programming languages, if you only look at the base language. I believe perl's internal distinguishing between the 8bit (latin-1) endocing and internal, multibyte (utf8) representation is the right choice for a language that has to keep strings == bytearray backward compatibility. It also keeps C <-> perl translation relatively straightforward. Also, I must say I've not run into any unicode bugs in perl since 7 months ago, when started working on a fairly large multi-language system. But like I said, it's not quite like that when you consider modules. Most modules on CPAN aren't under the kind of scrutiny that the base perl distro is under. Right now I'm examining a DBD::mysql bug that seems to not affect the system I'm working on, but I can't figure out why it doesn't. <--- that means; no one's going to pay me for fixing it, probably. :-) "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^3: utf8, locale and regexp by ruoso (Curate) on Apr 12, 2007 at 10:53 UTC
Actually, your post was very much informative. Thank you. And yes, I do believe the points you made are about bugs (specially the crlf issue). As to filenames, I think this is a wishlist bug to File::Spec, as far as I understand, File::Spec should be able to deal with the encoding used in the operating system also (or at least be able to receive the information about which encoding to use). daniel	[reply]