regex for translation

klayman has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone. For past few days am struggling with problem and am hoping that someone can point me to the right direction. Project that am working on is written in mason and using older version of perl (5.10.1). Its online cms that is using simple translation module that needs to be extened. At the moment cms is pulling in templates that have few tags in them which are replaced before served to the client like {T_OPTION} or {T_SELECTION}. These can be replaced with strings like __("pick") or __('shirt'), where regular expression is matching these tags and replacing them with translations for specific language selected by client:

$c =~ s/__$["'](.*?)["']$/&gettr($1)/egis

Problem arise if we want to add specific tags to be replaced by cms into template translation sentences eg
__("Please choose your {T_OPTION}")
I've tried to take out the tags for translation so they can be used anywhere in the sentence and that the final sentence can be matched against the po file that contains all translated strings eg
__("Please choose your [_0]",{T_OPTION})
and sentense can have different order of words in different languages to make sense as well. [_0] will be replaced by whatever is added as {T_OPTION} where after tag is replaced whole translation text may look like __("Please choose your [_0]",__("shirt")).
Here you can see my problem. I need to somehow let regex match first part of the code as translation string and second (everything after '",') as parameters and then replace them in the original sentense. I have sub doing this. So I've tried to create regex that will match these cases >

s/__$('.*?[^\\']'|".*?[^\\"]")(?:,(,?.*?[^'"]))?$/&gettr($1, $2)/egis;

this accounts for escaped apostrophes and brackets as well and it works however this is slowing our cms down and renders it almost unusable. I was trying to use standard perl modules like Text::Ballanced and Regexp::Common however could not get them working as I wanted to. Could there be any simpler solution to this problem? (I cannot use different translation module as many templates would need to be rewritten to use it)

Comment on regex for translation Select or Download Code

Replies are listed 'Best First'.
Re: regex for translation by hdb (Monsignor) on Oct 02, 2013 at 13:31 UTC
Not sure I really understand what you are looking for. My understanding is that your are looking for substrings like `__(...)` with various possible types of strings between the parentheses. I would suggest to use a regex to find these general patterns and then call a function to process (with further regexes) what's in between. It could look like this: `sub process { print shift, "\n"; } my $string = "ccc __(dddd)ccc __(eeee)ssss"; $string =~ s/__$([^)]+)$/process($_)/egis;` [download] The advantage is to split the complex regex into two pieces, one applied to the larger text and then one for the template bits you have extracted. You could even call it recursively. Whether or not this makes sense or is faster, you would have to try.	[reply] [d/l] [select]
Re^2: regex for translation by klayman (Initiate) on Oct 02, 2013 at 16:06 UTC
I agree. I will need to get the regular expression split at least into two as string within __(...) can contain another __(...) string. Working on the two regexes that will accomplish this. Problem is that text in the translation tags __(...) can contain any combination of `(` or `)` or even `__` or even `"` and `'` which am using in regex to track for translation tags `__(" ... ")`	[reply] [d/l] [select]
Re: regex for translation by pvaldes (Chaplain) on Oct 02, 2013 at 13:47 UTC
I need to somehow let regex match first part of the code as translation string and second (everything after '",') as parameters and then replace them in the sentense If I'm understanding correctly the problem... in pseudocode: If found (something), replace $1; save $POSTMATCH (linked to the match?) and do something with it Seems a job for a %hash{item} = "translation" to me, where translation could be maybe an array of options? As number of tags is, probably, limited you could check and validate only for some predefined values. If those names are well defined you don't need probably to care for saving the ("") envelopment.	[reply]
Re^2: regex for translation by klayman (Initiate) on Oct 03, 2013 at 07:26 UTC
Yes thats how it is used at the moment. Simple translation string may be __('here is something') and %hash{'here is something'} = translation in specific language. This works fine. Problem arise if i introduce more complicated translations where I can replace internal words in the translated sentence (like word 'something') where am not aware how many specific words can be replace in the sentence where sentence 'here is something' can be 'here is anything' or 'here is nothing' or 'here is the thing'.	[reply]
Re: regex for translation by jethro (Monsignor) on Oct 02, 2013 at 16:49 UTC
I think your regex takes too long because it tries to match the parameter as first part again and fails (not sure if this is possible, depends on where the regex continues after a match has happened, I'm a little rusty on that part of the regex lore and too lazy to look it up). But to fail it has to search through to the end of the file until it can be sure it failed, because of the evil "s" parameter on your regex, so that ".*" means rest of file instead of just rest of line. How to correct that depends. Maybe you can change the "__" in front of the parameter to something else. Or it might make sense to split on "__", then work on the array piece by piece, avoiding the g parameter on your regex. Or change the global matching to happen in a loop and make sure the matching starts after the parameter	[reply]
Re^2: regex for translation by klayman (Initiate) on Oct 03, 2013 at 07:17 UTC
The problem is that i need to go through whole file as it contain html markup and that can contain translation strings anywhere in it. But you are right about splitting it on __ symbol, however issue arise if there is translation string inside translation string, i need to separate them somehow and make sure that translation string or anything within _(' .. ') doesnt contain __(' ... ') as well	[reply]
Re^3: regex for translation by jethro (Monsignor) on Oct 03, 2013 at 11:11 UTC
As I said you have more than one option. If my hypothesis is right. I would do the following: 1) Make sure hypothesis is right: If possible, call the code from a small test-script which calls the code a few 100.000 times and time that. Then use two testfiles: One with a few translation strings at the beginning of a long file, the other with the same translation strings at the end of the file. If the first file takes much longer then you can be pretty sure that runaway regex search is the culprit. Another possibility would be to execute the code (either all or a extracted parts with a test script) with a newer perl version and use debugging features like "use re "debug";". 2) Change your programm to do the search and replace in a loop. If you call a regex with g parameter in scalar context, it only finds one occurence and stops, but it remembers where it left of (you can find out with pos() and change where it continues with pos() as well). What I would propose would be something like this: my $result=""; while ($trans=m/__$'.?[^\\']'\|".?[^\\"]"(?:,,?.*?[^'"])?$/gis) { +#changed to remove the two capture parens my $pos=pos(); $result.= substr($_,0,$pos); my $translen= length($trans); my $transtext= substr($_,$pos,$translen); <here $transtext has your complete translation string. Do the subs +titution on $transtext, you can use the code you already used or even + simplify it> $result.= $transtext; #remove the already translated part from $_ substr($_, 0, $pos+$translen)=''; #we reset search to begin at position 0 again pos()=0; } $_= $result . $_; [download] Untested code but this should theoretically work. It has to parse the translation string twice, so it will naturally be twice as slow as your original simple regex. But it should not bring your webserver to its knees. Clarification update: "twice as slow" only applies to the parsing of the string, not to the complete regex execution. gettr() will still be called only once,	[reply] [d/l]
Re: regex for translation by Anonymous Monk on Oct 02, 2013 at 08:53 UTC
Could there be any simpler solution to this problem? Um, maybe, if the sample input and wanted output was presented	[reply]
Re^2: regex for translation by klayman (Initiate) on Oct 02, 2013 at 13:10 UTC
Sure, one of the sentenses in templates reads `__("Thank you for choosing {T_SITE} for your reservation")` after tag replacement it will be `__("Thank you for choosing __("our system") for your reservation")` translation file contains this key > `__("Thank you for choosing new system for your reservation")` so I would need to take out tag that is going to be replaced from the sentense and add it as a parameter which can be parsed `__("Thank you for choosing [_0] for your reservation", {T_SITE})` After replacement it will be `__("Thank you for choosing [_0] for your reservation", __("our system"))` And both can be translated as > `Dziekujemy za wybranie [_0] do dokonania rezerwacji.` and `naszego systemu` and add connect them as `Dziekujemy za wybranie naszego systemu do dokonania rezerwacji.` problem is detecting where translation starts and where it ends and where parameters that I want to pass start and end. Regex that Ive posted is detecting that but unfortunately is very slow to use.	[reply] [d/l] [select]
Re^3: regex for translation by Anonymous Monk on Oct 02, 2013 at 15:13 UTC
I'm sorry but I don't understand that any better than what you originally posted, too many targets Is this it? #!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw/ dd pp /; my @stuff = ( { in => q{__("Thank you for choosing {T_SITE} for your reservation +")}, want => q{__("Thank you for choosing [_0] for your reservation",{T +_SITE})}, }, ); for my $test ( @stuff ){ my( $in, $want ) = @{$test}{qw/ in want /}; my $out = $in; $out =~ s{ (?: \Q__("\E (.+) # $1 \Q")\E ) }{ something( $1 ); }xegis; dd({ -in, $in, -out, $out, -want, $want }); } sub something { my( $what ) = @_; use vars '$fudge'; local $fudge; $what =~ s{ \{ ( [^\}]+ ) \} }{ $fudge = $1; q{[_0]} }sex; if( not defined $fudge){ } qq{__("$what",{$fudge})}; } __END__ [download]	[reply] [d/l]