I've had to deal with translating my code into multiple languages for a while. However, until recently, I've not actually written any perl code that gets run directly by a paying user, so all translations have been for code in C and shell script (with teammates also adding Java to the mix). These all have their own uniqueness, but none have quite the same expressiveness as Perl, though if someone wanted to, much of what I've learned could apply this to Java, I think. So, I hope to share some of the planning for translation that I've gone through in case it helps you think about it.

Start from the very beginning

And that's the first lesson. If you don't plan on translation up front, it becomes much more difficult to handle later. Adding a second language to your output becomes very painful as you have to go through all of your code finding it, unless you plan up front and at least force all your text through some well-named dummy method. If you stop reading here, you will still have received the most important piece of information I can share.

Location of the text

There are two basic approaches here. One of them is actually, in my experience, somewhat unique to perl. The general method is to create a file with keys and strings (usually English, but that may be my unilingual bias here). This file then goes to the translator(s), coming back with the text translated, and then gets checked in to your version-control system under either a new name, a new directory, or both. Meanwhile, in your code, whenever you need to output text, you call your text-handling module with the key of the string you want, and any replacement variables, and it returns the text to you. In some languages, like C, this can impose additional memory-management requirements on the caller (either pass in a big enough buffer, or call back to free it, or maybe it's a static and you must copy it, such as by printing it, before calling back in to the library for another string). Perl, shell, and Java, obviously don't worry about that detail.

The second option is to leave your text in your code. This has some significant advantages, but also some very serious disadvantages. In my opinion, not always shared by management, these disadvantages can be significantly mitigated if not nearly completely eliminated, and I'll get in to that, too.

The advantages are straight forward:

But, the disadvantages are:

Choose an interpolation method

The de facto way of dealing with all of this text seems to be Locale::Maketext simply by virtue of it being part of core. And that's what I started with, too, until our requirements got too complex for it. And, really, I'm already thankful that we've grown past it inside of our first release with translations.

First off, let's go over a few options. Depending on the tool you're using, you may be forced to have text as above: "Directory %s is not found." This is obviously very simplistic for C users, just take the text, use it as input to (f)printf, pass in the directory name, and you're golden: printf(get_message(ERR_FILE_NOT_FOUND), directory); Very trivial to use. But also very error prone. Getting %s vs %d vs %u all set properly is annoying. Some of these libraries mandate that everything is a string just to make it a bit easier should you need to change something later (e.g., changing %f to %.2f would be a change that affects all the translations, but if you did that in your code and used %s, you could change it all you want). And, of course, if you have multiple variables to interpolate, well, some languages might find it more natural to reverse the order, but you can't deal with that with printf (well, at least you couldn't at one time, not still the case on all platforms now). And, of course, plurals are annoying: "%d directory(ies) deleted.". Due to the myriad of pluralisation rules, the number of permutations here can grow immensely. Unfortunately, this was state of the art for so long that some places have this as their golden standard from which One Must Not Deviate.

Another option is much like the above, but being more explicit. Everything must be strings, but we use %1 %2 %3, etc., for which argument to interpolate where. This was, IIRC, the next state-of-the-art, and is likely where printf's %1$s specifier comes from. IMO, this option is made redundant by printf's new formatting, but it's not available in libC on all platforms, so this is still useful for some portability. (Of course, perl's sprintf has this and is portable.)

Then Java came along. Radical newness in their MessageFormat class: instead of %1, %2, %3, use {0}, {1}, {2}. Okay, that's not all that radical. Even when you add in the optional "type" flag, e.g., {0,int}, that's not really any different from %1$d in C. There are some other, more useful types (time, date), so that's cool, but not quite radical. Once we get to their custom choice formats, then we see some radicalness: "Deleted {0,choice,0#no directories|1#one directory|1<{0,number,integer} directories}." And for languages that have more or fewer pluralisations, they can use more or fewer choices. A bit arcane, but that's kind of the price to pay.

The next option is Locale::Maketext. It is much like Java, offering a few options on formatting, though handling "int" vs "string" isn't much of an issue in Perl, of course, where such conversions are automatic and hidden. For users where seeing "File(s)" isn't a big deal, there's really not much here over %1 %2 (which is much like Java, really). If you want to handle pluralisation, there's some rudimentary support, where you say [quant,_1,file] and that becomes "1 file" or "0 files" or "10 files" or whatever or [quant,_1,box,boxes] becomes "0 boxes", "1 box", "10 boxes", etc.. However, it's not trivial for translators to adjust for their language if the text doesn't just have two forms (singular and plural), and these are 1==singular and !1==plural. It's possible, but not trivial.

I've gone a fair bit into plural vs singular, but the biggest problem with all of the above is ordering. Except for the first old format of %s where parameters are inserted in the same order they're passed in to printf, yes, parameters can be reordered. The third parameter can show up before the first with a simple "%3$s %1$s" / "{2} {0}" / "[_3] [_1]". No big deal. But that's not the problem. The real problem is that the developer and the code reviewer must painstakingly go through the message (and some of our messages are 500-1000 bytes long, at least in English) to ensure that the parameters we pass in are in the same order that the text expects them, which, of course, is not necessarily the same as they show up in the text, but usually is. And the translator needs to ensure that the variables get moved around properly, and likely needs to painstakingly ensure that the context for each number is correct. Pain pain pain.

And I hate pain.

I'm a wuss.

The solution I've gone with, as we've done many a time in Perl already, is named parameters. Instead of {0}, how about {dir_name}? And then in the code, we just pass in dir_name => $dir, regardless of ordering. Suddenly, all that pain disappears. Even the translators, who necessarily understand English to do the translation, can at a glance see what the tokens are and evaluate that they are still in the correct context.

Of course, we have a bazillion modules that already handle named parameter interpolation. Rather than reinventing such a system (I know, it's a rite of initiation, but I'm already initiated, I think), I ended up going with Template Toolkit. So now my text reads "Directory [% dir_name %] not found.". Simple. Especially since I'm already using Template toolkit for a bunch of other templates anyway. But, even if I wasn't, this makes things really easy. For pluralisations, if we needed them, we could theoretically either explain how to do switch statements in TT, or provide other plugins (but we're not allowed).

Just yesterday, one of the translation centers sent back a query about what was going in to a {0}. Turns out it was a numeric return code. But if that were perl code, the [% ret_code %] might have at least given them enough of a hint not to need to send back a question to development. And we've saved a fair bit of time just reading the code and seeing what everything is at a glance. And when I see "message => $rc", I have an indication it's not right before I even scan any further. More free code smells. :-)

Update: Reduce poor-taste humour.


In reply to Translation for Perl for Fun and Profit by Tanktalus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.