in reply to Perl to convert US to UK punctuation/spelling?

Believe it or not this sounds fairly easy. The real problem is when you have no proper quote marks: but these " and '. Then there is a very slippery set of heuristics to see what they become. In your case you're just looking to swap balanced “ … ” with balanced ‘ … ’ and vice versa (might need a placeholder pass between the two) knowing you can leave alone \w’\w (contractions) and ’\w (shortened terms like ’cause for because). I've never used it but I suspect you just need to adapt these ideas to work with Text::Balanced and whatever RTF parser you like. I personally would keep the utf8 but there is nothing wrong with using entities instead if you're more comfortable with them (encode them to entities up front to save utf8 woes).

For spelling, IIRC, there are only a couple of hundred words which are different. Just dig up that list and turn it into a hash and bingo.

  • Comment on Re: Perl to convert US to UK punctuation/spelling?

Replies are listed 'Best First'.
Re^2: Perl to convert US to UK punctuation/spelling?
by dragonchild (Archbishop) on Jun 16, 2008 at 01:22 UTC
    It's not just 'color' vs. 'colour' and 'theater' vs. 'theatre'. It's also 'gas' vs. 'petrol' and 'trunk' (of a car) vs. 'boot', but not confusing that with trunk of an elephant or trunk in which one carries things or trunk of a tree.

    Honestly, the best approach is to have a program suggest changes (using highlighting which RTF supports) and have a human go through the documents. Finding people who know proper British spelling and are willing to work for cheap is easy - India was a British colony and the largest group of people doing O and A levels is there.


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      Finding people who know proper British spelling and are willing to work for cheap is easy - India was a British colony and the largest group of people doing O and A levels is there.

      Oooh! Dangerous assumptions. The Indians have their own dialect(s) of English. As do the Antipodeans (at least two flavours); the Caribbeans (half a dozen or more); the Irish (North and South). And every other native English-speaking country and group.

      Heck. Go anywhere north of the Watford Gap and if a girl talks about "being made-up", it's as likely that she is happy about something, as it is that she is wearing cosmetics. (And there are at least two other interpretations of that two word phrase: "made-up ground" and "bottle of made-up vodka & orange".bnc).

      And if you start considering colloquialisms, you're into nearly as many regional variations within the British Isles as there are counties. And that's before you even begin to consider things like youth culture and so-called business-speak.

      Once you go beyond 'Received Pronunciation', which not even the Queen speaks any more, there is no such thing a "Standard British English". Neither pronunciation, nor grammar. Even British Academics are having to to show flexibility in what is acceptable these days.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Excellent point on dialects. The United States has more than you can shake a stick at. Tempers flare over the simple (what does "barbeque" really mean?) to the complicated ('Civil War' vs. 'War of Northern Agression').

      Even worse are local metaphor and other colloquialisms.

      How do you translate to American English expressions such as "took the piss" or "Bob's your uncle" in an automated way? And sometimes, you have to be careful, or you accidentally leave in an expression like "pack of fags" and offend someone.

      I'd completely agree with dragonchild -- translation is a case for real humans to look at it.

      Good point, and approach. I keep tottering on writing an XHTML acronym/abbr inserter that works this way. Perhaps asks up front what general information domain a document inhabits and then presents the spelled-out list for each potential candidate to insert the tag with the title attribute; choices ordered (default on "return," skip on...) by the initial preference.

      I concur with dragonchild in as much as this should be treated as any other piece of language translation, and can't be managed by a translation dictionary alone.

      However I'd disagree that "outsourcing" to India is a viable solution -- in any translation work you should only translate into your native language, and this would be particularly true for the case of US -> UK English where nuance is important and where pop culture English is dominantly the US version.

      The detailed requirements will likely vary a lot depending on the nature of the work being translated and its intended purpose, in many regards scientific or technical material will be the easiest to translate (though this will often still require significant work -- my father worked for several years with a team that translated RAF technical manuals from UK English into US English for the USAF) anything using idiom, allegory, wordplay, etc is likely to be far more problematic -- the traditional "beauty versus truth" problem here manifesting as when do you faithfully preserve the text and when do you interpret its meaning.

Re^2: Perl to convert US to UK punctuation/spelling?
by ww (Archbishop) on Jun 16, 2008 at 02:46 UTC
    ...you're just looking to swap balanced “ … ” with balanced ‘ … ’ and vice versa...

    True... but that doesn't necessarily display correctly for all of us. What I see (FF 2.0.0.14, W2k) is two straight, slanted, end-double-quotes separated by ... and two straight, slanted, single quotes, similarly separated. (The fact that they're straight, not curly quotes is NOT relevant; what is on point is that all slant in the same direction.)

    So why did I bother to mention this?

    Well, I didn't discover that until I cut'n'pasted from Your_Mother's node (which appears to have been written with character entities, "“ … ” with balanced ‘ … ’"), with the intention of pointing out that (what I saw) was not what OP was asking for. Hence, this, FWIW, for others whose display is not what Your_Mother intended.

      utf8 doesn't work on PM. :) This has no bearing on whether it works in another program.

        True, again, but that has considerable bearing on whether readers see what you intended.

        At least one reader, /me, saw something quite different.