comment on

I've had to deal with translating my code into multiple languages for a while. However, until recently, I've not actually written any perl code that gets run directly by a paying user, so all translations have been for code in C and shell script (with teammates also adding Java to the mix). These all have their own uniqueness, but none have quite the same expressiveness as Perl, though if someone wanted to, much of what I've learned could apply this to Java, I think. So, I hope to share some of the planning for translation that I've gone through in case it helps you think about it.

Start from the very beginning

And that's the first lesson. If you don't plan on translation up front, it becomes much more difficult to handle later. Adding a second language to your output becomes very painful as you have to go through all of your code finding it, unless you plan up front and at least force all your text through some well-named dummy method. If you stop reading here, you will still have received the most important piece of information I can share.

Location of the text

There are two basic approaches here. One of them is actually, in my experience, somewhat unique to perl. The general method is to create a file with keys and strings (usually English, but that may be my unilingual bias here). This file then goes to the translator(s), coming back with the text translated, and then gets checked in to your version-control system under either a new name, a new directory, or both. Meanwhile, in your code, whenever you need to output text, you call your text-handling module with the key of the string you want, and any replacement variables, and it returns the text to you. In some languages, like C, this can impose additional memory-management requirements on the caller (either pass in a big enough buffer, or call back to free it, or maybe it's a static and you must copy it, such as by printing it, before calling back in to the library for another string). Perl, shell, and Java, obviously don't worry about that detail.

The second option is to leave your text in your code. This has some significant advantages, but also some very serious disadvantages. In my opinion, not always shared by management, these disadvantages can be significantly mitigated if not nearly completely eliminated, and I'll get in to that, too.

The advantages are straight forward:

By keeping the text close to the code that uses it, code reviews become easier. Code reviews can now trivially review your message, the way you're using it, that you're passing in all required parameters, and everything else. The reviewer does not need to open your resource file and read it in concert with your code, jumping around the resource file with the code. Reviewing the two separately, even if in the same review session, is more likely to miss the fact that your replacement values are ordered incorrectly, for example, so it has to be done simultaneously. By making the code review trivial, you make it more likely that the review will be completely done, and less likely to miss something either by taking short cuts or just forgetting/not noticing.
You also make it easier to code. During development, you no longer need to jump between your code and your resource file, just to add a new message.
And you're going to reduce contention on your resource files. Generally, you don't have one resource file per source file, you have one resource file per team, or one resource file per programming language per team. This results in many people likely wanting to edit the same resource file at the same time. And, depending on your version-control system, this means either the file is repeatedly locked and unlocked, introducing synchronisation delays, or you deal with merge conflicts when you go to check-in/deliver/commit (unless you're lucky and/or the merge conflict resolution is really good). By having the text in your code, and by having plenty of code files (many modules, for example), you only have contention when you happen to be editing the same file, but that is the same with or without translation. This method eliminates the file that everyone needs regular access to.

But, the disadvantages are:

Translators don't want to go through your code. You don't want them to go through your code. You don't want all the translations to be in your code. You need a way to get the text to translators, they need a way to get the text back to you, and you need a way to use their text when your environment says you're in that language. I have a multi-pronged approach to this.
First is that all my text is an object (I can hear the groaning already). This is largely so I can overload q[""] (more groaning, I think). The object's constructor takes the English text, and then looks it up in the translated files (which are all perl code using big hashes and a use utf8; at the top). And then, during the stringification, inserts all the interpolation, and returns the text.

The second prong is that the constructor is designed to be unique. In C++, I've seen this type of thing done using a function called _ (just an underscore). That doesn't work in perl, that function is already taken. So I adopted _T. Then I wrote a bunch of code using PPI to find all calls to _T, and pull them out and put them into my English resource file. I then, right before sending the text to translation, run this script which scans all of the code in my workspace, and check in the result (after a manual sanity check). Translators get a single file, I get my text in my code.
This has a secondary advantage: it makes it reasonably possible to detect messages that are no longer in use because all the code calling it is removed.
Message re-use. When you have a single resource file, each key is short and nearly meaningless. By that I mean that a message key might be "ERR_FILE_NOT_FOUND" but, over the years, the message has morphed into "Directory %s is not found." Okay, that might be extreme, but for bigger messages, it's harder to capture their entire meaning in a small key. However, by having the full message in your code, we have just gone to the other extreme. If you need that "File not found" error in multiple locations in your code, you need to copy it to multiple locations. Cut-and-paste is already a bug, but when you need to fix a typo in that message, you now have to get all locations. Sure, ack can help, but it's still annoying.

My solution is to go back to the message ID concept. Except that creating message IDs is another synchronisation point, and thus painful. So, instead, I generate the message IDs. The code scanner above doesn't just find all the messages, it finds their IDs from the _T call, and, if there isn't one, assigns it one, inserts it back into the code so the message ID remains constant forevermore, and also uses it in the translated files. Arguably, this can speed up the search in the hash by using smaller strings for the hash (while the lookup might be O(1), the calculation of the hash key is probably O(n) on the length of the key), but I don't really care about that.

Once the key is generated, I can reuse just that key in other code. The downside here is that we kind of get back to having to watch two different pieces of code during a code review. Or, another solution, which I've not yet implemented just because it hasn't come up yet, would be that because all the places in the code that use that message now have the same ID (that was generated after the first one was used), the code scan tool can detect whether all locations have the same text or not. This also has a downside: having to keep everything in sync. But the tool can automatically warn when something is out of sync. We've not entirely decided which way to go, but we have the flexibility here to pick one. Or both, really.

In my experience, this actually turns out to be rare. Much more rare than having multiple developers modifying a single resource file at the same time. So while there is a cost either way, eating these disadvantages seems to be the lesser cost to me. YMMV

Choose an interpolation method

The de facto way of dealing with all of this text seems to be Locale::Maketext simply by virtue of it being part of core. And that's what I started with, too, until our requirements got too complex for it. And, really, I'm already thankful that we've grown past it inside of our first release with translations.

First off, let's go over a few options. Depending on the tool you're using, you may be forced to have text as above: "Directory %s is not found." This is obviously very simplistic for C users, just take the text, use it as input to (f)printf, pass in the directory name, and you're golden: printf(get_message(ERR_FILE_NOT_FOUND), directory); Very trivial to use. But also very error prone. Getting %s vs %d vs %u all set properly is annoying. Some of these libraries mandate that everything is a string just to make it a bit easier should you need to change something later (e.g., changing %f to %.2f would be a change that affects all the translations, but if you did that in your code and used %s, you could change it all you want). And, of course, if you have multiple variables to interpolate, well, some languages might find it more natural to reverse the order, but you can't deal with that with printf (well, at least you couldn't at one time, not still the case on all platforms now). And, of course, plurals are annoying: "%d directory(ies) deleted.". Due to the myriad of pluralisation rules, the number of permutations here can grow immensely. Unfortunately, this was state of the art for so long that some places have this as their golden standard from which One Must Not Deviate.

Another option is much like the above, but being more explicit. Everything must be strings, but we use %1 %2 %3, etc., for which argument to interpolate where. This was, IIRC, the next state-of-the-art, and is likely where printf's %1$s specifier comes from. IMO, this option is made redundant by printf's new formatting, but it's not available in libC on all platforms, so this is still useful for some portability. (Of course, perl's sprintf has this and is portable.)

Then Java came along. Radical newness in their MessageFormat class: instead of %1, %2, %3, use {0}, {1}, {2}. Okay, that's not all that radical. Even when you add in the optional "type" flag, e.g., {0,int}, that's not really any different from %1$d in C. There are some other, more useful types (time, date), so that's cool, but not quite radical. Once we get to their custom choice formats, then we see some radicalness: "Deleted {0,choice,0#no directories|1#one directory|1<{0,number,integer} directories}." And for languages that have more or fewer pluralisations, they can use more or fewer choices. A bit arcane, but that's kind of the price to pay.

The next option is Locale::Maketext. It is much like Java, offering a few options on formatting, though handling "int" vs "string" isn't much of an issue in Perl, of course, where such conversions are automatic and hidden. For users where seeing "File(s)" isn't a big deal, there's really not much here over %1 %2 (which is much like Java, really). If you want to handle pluralisation, there's some rudimentary support, where you say [quant,_1,file] and that becomes "1 file" or "0 files" or "10 files" or whatever or [quant,_1,box,boxes] becomes "0 boxes", "1 box", "10 boxes", etc.. However, it's not trivial for translators to adjust for their language if the text doesn't just have two forms (singular and plural), and these are 1==singular and !1==plural. It's possible, but not trivial.

I've gone a fair bit into plural vs singular, but the biggest problem with all of the above is ordering. Except for the first old format of %s where parameters are inserted in the same order they're passed in to printf, yes, parameters can be reordered. The third parameter can show up before the first with a simple "%3$s %1$s" / "{2} {0}" / "[_3] [_1]". No big deal. But that's not the problem. The real problem is that the developer and the code reviewer must painstakingly go through the message (and some of our messages are 500-1000 bytes long, at least in English) to ensure that the parameters we pass in are in the same order that the text expects them, which, of course, is not necessarily the same as they show up in the text, but usually is. And the translator needs to ensure that the variables get moved around properly, and likely needs to painstakingly ensure that the context for each number is correct. Pain pain pain.

And I hate pain.

I'm a wuss.

The solution I've gone with, as we've done many a time in Perl already, is named parameters. Instead of {0}, how about {dir_name}? And then in the code, we just pass in dir_name => $dir, regardless of ordering. Suddenly, all that pain disappears. Even the translators, who necessarily understand English to do the translation, can at a glance see what the tokens are and evaluate that they are still in the correct context.

Of course, we have a bazillion modules that already handle named parameter interpolation. Rather than reinventing such a system (I know, it's a rite of initiation, but I'm already initiated, I think), I ended up going with Template Toolkit. So now my text reads "Directory [% dir_name %] not found.". Simple. Especially since I'm already using Template toolkit for a bunch of other templates anyway. But, even if I wasn't, this makes things really easy. For pluralisations, if we needed them, we could theoretically either explain how to do switch statements in TT, or provide other plugins (but we're not allowed).

Just yesterday, one of the translation centers sent back a query about what was going in to a {0}. Turns out it was a numeric return code. But if that were perl code, the [% ret_code %] might have at least given them enough of a hint not to need to send back a question to development. And we've saved a fair bit of time just reading the code and seeing what everything is at a glance. And when I see "message => $rc", I have an indication it's not right before I even scan any further. More free code smells. :-)

Update: Reduce poor-taste humour.

In reply to Translation for Perl for Fun and Profit by Tanktalus

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.