Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I've been developing a primitive package that translates along the lines of _Intermediate Perl_, specifically using module-starter and customizing an appropriate Makefile.PL . I may have put files in loopy places due to inexperience, but I am getting preliminary results. Of my 3 test files, 2 of 3 are translated faithfully. I have output and source to show, to motivate a couple questions:

Here are the driver script and the package:

#!/usr/bin/perl -w use 5.011; use WWW::Google::Translate; use Data::Dumper; use open OUT => ':utf8'; use Path::Tiny; use lib "."; use translate; binmode STDOUT, 'utf8'; use POSIX qw(strftime); ### values to initialize (customize these to suit) my $ini_path = qw( /home/bob/Documents/html_template_data/3.values.ini + ); my $sub_hash = "google"; my ( $from, $to ) = ( 'en', 'ru' ); #put defaults here my $input_directory = qw( /home/bob/Documents/meditations/castaways/translate/data ); my $output_appendage = "output"; ## get values for google from an .ini file my $key = get_config( $ini_path, $sub_hash ); say "Would you like to see the possibilities?"; my $prompt1 = <STDIN>; chomp $prompt1; if ( $prompt1 eq ( "y" | "Y" ) ) { show_lang_codes(); } say "Would you like to change the from language?"; $prompt1 = <STDIN>; chomp $prompt1; if ( $prompt1 eq ( "y" | "Y" ) ) { $from = get_lang($from); } say "Would you like to change the to language?"; $prompt1 = <STDIN>; chomp $prompt1; if ( $prompt1 eq ( "y" | "Y" ) ) { $to = get_lang($to); } # create output directory say "Creating output directory as nephew of input"; say "using localtime for uniqueness"; my $munge = strftime( "%d-%m-%Y-%H-%M-%S\.txt", localtime ); my $parent = path($input_directory)->parent; my $out_dir = path( $parent, $output_appendage, $munge ); my $wgt = WWW::Google::Translate->new( { key => $key, default_source => $from, default_target => $to, } ); my @texts = path("$input_directory")->children(qr/\.txt$/); say "texts are @texts"; for my $file (@texts) { local $/ = ""; open my $fh, '<', $file; my $base_name = path("$file")->basename; my $out_file = path( $out_dir, $base_name )->touchpath; say "out_file is $out_file"; while (<$fh>) { print "New Paragraph: $_"; my $r = get_trans( $wgt, $_ ); for my $trans_rh ( @{ $r->{data}->{translations} } ) { #print $trans_rh->{translatedText}, "\n"; my $result = $trans_rh->{translatedText}; say "result is $result "; my @lines = split /\n/, $result; push @lines, "\n"; path("$out_file")->append_utf8(@lines); } } close $fh; }
package translate; use 5.006; use strict; use warnings; require Exporter; our @ISA = qw(Exporter); our @EXPORT = qw( get_config get_trans get_lang show_lang_codes rever +se_trans); our $VERSION = '0.01'; =head1 SYNOPSIS use translate; my $key = get_config('path-to-ini-file', $sub_hash); my $from = get_lang($from_default); my $to = get_lang($to_default); my $trans_output_file = get_trans($input_file, $from, $to, $key); my $reverse = reverse_trans($trans_output_file, $to, $from, $key); =cut sub get_config { use Config::Tiny; use Data::Dumper; use open OUT => ':encoding(UTF-8)'; use Path::Tiny; use 5.011; my ( $ini_path, $sub_hash ) = @_; say "ini path is $ini_path"; say "sub_hash is $sub_hash"; my $Config = Config::Tiny->new; $Config = Config::Tiny->read( $ini_path, 'utf8' ); say Dumper $Config; my $key = $Config->{$sub_hash}{'api_key_1'}; return $key; } sub get_lang { use Path::Tiny; use 5.011; my $lang = shift; say "Would you like to change languages?"; my $prompt1 = <STDIN>; chomp $prompt1; if ( $prompt1 eq ( "y" | "Y" ) ) { say "enter new lang: "; $prompt1 = <STDIN>; chomp $prompt1; $lang = $prompt1; } return $lang; } sub show_lang_codes { use Path::Tiny; use 5.011; my $path_to_langs = path( "my_data", "lang_data", "1.langlist.txt" ) +; my $data = $path_to_langs->slurp_utf8; say "$data"; } sub get_trans { use Path::Tiny; use 5.011; use WWW::Google::Translate; my ( $wgt, $paragraph ) = @_; my $r = $wgt->translate( { q => $paragraph } ); return $r; } 1; # End of translate

This is typical output for my first two input files:

New Paragraph:         Did the last version also pass all tests? Were the changes required? Were new tests added to cover the changes?

result is Последняя версия также прошла все тесты? Были ли необходимы изменения? Были ли добавлены новые тесты для покрытия изменений?

It chunks up nicely and formats well in the output file too.

My final input file did not. It is Shelley's Frankenstein from Gutenberg texts online. With my input record separator as it is, it seems to slurp in the entire book at once:

, source => en, format => text unsuccessful translate POST for 450783 bytes: Request payload size exc +eeds the limit: 204800 bytes. check that BOB-THINKPAD-SL510 has API Access for this API key at https://console.developers.google.com/cloud-resource-manager $

I took a look at it in the hex editor, and the Shelley text has

0D 0A

as line feeds, while the first two have 0A . Is this not the Unix versus windows line feed problem? If so, it must be a well-worn path.

Q1: How do I rewrite my script so that I get paragraph-sized chunks getting sent to google regardless of line feed encoding?

my $r = get_trans( $wgt, $_ ); for my $trans_rh ( @{ $r->{data}->{translations} } ) { #print $trans_rh->{translatedText}, "\n"; my $result = $trans_rh->{translatedText}; say "result is $result "; my @lines = split /\n/, $result; push @lines, "\n"; path("$out_file")->append_utf8(@lines); }

Q2: Do I really need all of this to extract one paragraph of translation?

Thanks for your comment,

Replies are listed 'Best First'.
Re: chunking up texts correctly for online translation
by haukex (Archbishop) on Jun 14, 2019 at 22:02 UTC
    Q1: How do I rewrite my script so that I get paragraph-sized chunks getting sent to google regardless of line feed encoding?

    You're right that paragraph mode ($/ = "") doesn't work right when reading a CRLF file on *NIX. You could enable the :crlf PerlIO layer, which will leave plain LF as is but convert CRLF to LF, and paragraph mode will work: open my $fh, '<:crlf', $file

    By the way, to pick two nits: Module names in all lowercase are reserved (by convention) for pragmas, so I'd name your module Translate. Also, you're not checking your open for errors.

    Q2: Do I really need all of this to extract one paragraph of translation?

    In general I'd say get it fully working first, and leave the simplification of the code for a little later :-)

      enable the :crlf PerlIO layer

      Thx haukex, that worked. Even so, the input to google exceeded their rate limit, so I had to slow it down. I added sleep time and a means to keep track of how long a file takes to translate.

      for my $file (@texts) { local $/ = ""; open my $fh, '<:crlf', $file or die; my $base_name = path("$file")->basename; my $out_file = path( $out_dir, $base_name )->touchpath; say "out_file is $out_file"; ## time it use Benchmark; my $t0 = Benchmark->new; while (<$fh>) { print "New Paragraph: $_"; my $r = get_trans( $wgt, $_ ); for my $trans_rh ( @{ $r->{data}->{translations} } ) { my $result = $trans_rh->{translatedText}; say "result is $result "; my @lines = split /\n/, $result; push @lines, "\n"; path("$out_file")->append_utf8(@lines); sleep(1); } } my $t1 = Benchmark->new; my $td = timediff( $t1, $t0 ); print "$file took:", timestr($td), "\n"; sleep(3); close $fh;

      84-0.txt is Shelley's Frankenstein, which is about 450 k in length. Of the $300 credit they give anyone to sign up for their API, I used 7 cents of it, so I'm down to $297.22 left. It made for an interesting way to skim both the original and the translation. This ballparks 20 minutes as an outer limit:

      /home/bob/Documents/meditations/castaways/Translate1/data/84-0.txt too +k:1180 wallclock secs (23.34 usr + 1.36 sys = 24.70 CPU) $

      Q3: What do the usr and sys numbers mean?

      Module names in all lowercase are reserved (by convention) for pragmas, so I'd name your module Translate. Also, you're not checking your open for errors.

      I did fix both of these but went with Translate1 . The reason I did this is that I know there is going to be a Translate2 that will not work with Translate1. I've heard such naming called "trampolining," and something to be avoided. Q4: Am I supposed to not have such collisions using version numbers or clever use of git? The features of the package change quickly, and sometimes, I have to roll back to something that actually worked.

      I found that I had to go back to make clean every time I made a change in the script, so I wrote a little helper bash script:

      $ cat 1.google.sh #!/bin/bash pwd make clean perl Makefile.PL make make test make install ls cd blib cd script ./3.my_script.pl $

      I offer this as a keystroke reduction mechanism, not wanting to be OT.

      The translations went well with the exception of certain characters. Let's look at a couple paragraphs with differing tags. Here is output with pre tags

      New Paragraph: €œAre you mad, my friend?€ said he. €œOr whither does your
      senseless curiosity lead you? Would you also create for yourself and the
      world a demoniacal enemy? Peace, peace! Learn my miseries and do not seek
      to increase your own.€
      
      result is - Ты злишься, друг мой? - спросил он. Или куда ты
      бессмысленное любопытство приведет тебя? Не могли бы вы также создать для себя и
      мир демонический враг? Мир, мир! Узнай мои страдания и не ищи
      увеличить свой собственный. 
      
       
      New Paragraph: Frankenstein discovered that I made notes concerning his history; he asked
      to see them and then himself corrected and augmented them in many places,
      but principally in giving the life and spirit to the conversations he held
      with his enemy. €œSince you have preserved my narration,€ said
      he, €œI would not that a mutilated one should go down to
      posterity.€
      
      result is Франкенштейн обнаружил, что я делал заметки, касающиеся его истории; он спросил
      чтобы увидеть их, а затем сам исправить и дополнить их во многих местах,
      но главным образом в том, чтобы дать жизнь и дух разговорам, которые он вел
      со своим врагом. "Так как вы сохранили мое повествование", сказал
      он, Я бы не хотел, чтобы изуродованный
      posterity.

      Here is what the 1st paragraph looks like in code tags:

      New Paragraph: &#128;&#156;Are you mad, my friend?&#128; said he. +&#128;&#156;Or whither does your senseless curiosity lead you? Would you also create for yourself and t +he world a demoniacal enemy? Peace, peace! Learn my miseries and do not s +eek to increase your own.&#128;

      For some reason, Shelley quotes paragraphs as a matter of course, and they are getting garbled as I read in under these conditions:

      #!/usr/bin/perl -w use 5.011; use WWW::Google::Translate; use Data::Dumper; use open OUT => ':utf8'; use Path::Tiny; use lib "."; use translate; binmode STDOUT, 'utf8'; use POSIX qw(strftime);

      Google sometimes gives the correct rendering of quotes in russian. They do it somewhat like this: << >> .

      Q5: How do I change my script so that these characters are rendered correctly? They look right as I read them in gedit.

      Finally, as I look at the arguments in Makefile.Pl:

      my %WriteMakefileArgs = ( NAME => 'Translate1', AUTHOR => q{gilligan <gilligan@island.coconut>}, VERSION_FROM => 'lib/Translate1.pm', LICENSE => 'artistic_2', MIN_PERL_VERSION => '5.006', CONFIGURE_REQUIRES => { 'ExtUtils::MakeMaker' => '0', }, TEST_REQUIRES => { 'Test::More' => '0', }, PREREQ_PM => { #'ABC' => '1.6', #'Foo::Bar::Module' => '5.0401', }, EXE_FILES => ['lib/3.my_script.pl'], dist => { COMPRESS => 'gzip -9f', SUFFIX => 'gz', }, clean => { FILES => 'Translate1-*' }, );

      Q6: How would I determine which version of WWW::Google::Translate to require?

      Thank you for your comments,

        Try <:crlf:encoding(UTF-8), see PerlIO.

        I was going through replies and I noticed there were some unanswered questions:

        took:1180 wallclock secs (23.34 usr +  1.36 sys = 24.70 CPU) Q3: What do the usr and sys numbers mean?

        "User time" is the amount of time spent in user-mode code (your code plus any libraries it's using), and "system time" is the amount of time spent in the kernel, such as system calls.

        Q4: Am I supposed to not have such collisions using version numbers or clever use of git? The features of the package change quickly, and sometimes, I have to roll back to something that actually worked.

        This depends very much on how you plan on using and releasing this module. If this is something you're going to release on CPAN, then it's definitely important to put some thought into naming and versioning. For example, it'd be best to work beneath a single namespace (just for example Lingua::Translate::*), and especially not to pollute the top level with multiple namespaces such as Translate1:: and Translate2:: - instead, it'd be best to use a naming scheme such as Translate::MyEngine::V1 and Translate::MyEngine::V2.

        On the other hand, if this something for your personal use, then you are free to do whatever you like and what is practical for you - you can do version control with Git, or, if you think that you'll be using multiple versions in parallel, naming like Translate1 and Translate2 (or maybe better: Translate::V1 and Translate::V2) would probably work too. Of course, it's also possible to switch between these two development modes - I've done rapid prototyping in a repository that ended up being quite littered with experiments etc., and then when it came time to release, I set up a new, clean repository into which I just put the files that should be released, added proper versioning, better naming, etc.

        “Are you mad, my friend?” said he. ... Q5: How do I change my script so that these characters are rendered correctly?

        That's definitely an encoding problem, but you'd have to show us a Short, Self-Contained, Correct Example that reproduces the issue. I showed an example of what information to provide in the case of encoding issues here.

        Q6: How would I determine which version of WWW::Google::Translate to require?

        That depends on what features of the module you're using, or whether older versions had bugs that your code is having problems with. For example, the changelog shows that the format parameter was added in 0.06, headers in 0.08, and model in 0.10. Another thing to look for might be whether newer versions changed the dependencies. Usually, I'll require the lowest possible version of a module, unless there have been egregious bugs in older versions, in which case I'll require the version after those bugs were fixed.

Re: chunking up texts correctly for online translation
by bliako (Abbot) on Jun 14, 2019 at 22:41 UTC

    Hello Aldebaran,

    haukex provided a solution to your question.

    I have a suggestion regarding user interaction: Getopt::Long can parse parameters passed on to a script via the command line, i.e. the well known pattern translate.pl --from-lang XYZ --to-lang ZYX --config ABC --outfile 123 or even translate.pl --help to show available languages.

    This is straight forward to implement and will help you to abstract and automate even more, for example, by creating higher level bash scripts, e.g. translate.bash "hello there" or translate.bash < /dev/telephone1 > /dev/telephone2

    Here is an example

    use Getopt::Long; my $outfile = undef; my $configfile = undef; my $infile = undef; if( ! Getopt::Long::GetOptions( "outfile=s", \$outfile, "infile=s", \$infile, "configfile=s", \$configfile, "help", sub { print "Usage : $0 --configfile C [--outfile O] [--infi +le I] [--help]\n"; exit 0; }, ) ){ die "error, commandline" } die "configfile is needed (via --configfile)" unless defined $configfi +le; my $inFH = <STDIN>; # read input from stdin by default, unless an in f +ile is provided if( defined $infile ){ open $inFH, '<', $infile or die "opening input +file $infile, $!"; } my $instr; {local $/ = undef; $instr = <$inFH> } if( defined ($infile){ close $inFH } # do similar for outfile and STDOUT ... # and call your module translate, input text is in $instr ...

      Struggling to get basic functionality here. I'm removing the option to use STDIN for input, as I have a translate shell already that covers this functionality for me. Parts are working, for example, I die if no value for config is supplied. I can't seem to get to our favorite example text of late:

      $ ./1.get_opt.pl --configfile C --outfile /home/bob/Documents/meditati +ons/Algorithm-Markov-Multiorder-Learner-master/output/1.txt --infile +/home/bob/Documents/meditations/Algorithm-Markov-Multiorder-Learner-m +aster/data/2.short.shelley.txt Use of uninitialized value $inFH in <HANDLE> at ./1.get_opt.pl line 34 +. readline() on unopened filehandle at ./1.get_opt.pl line 34. $ cat 1.get_opt.pl #!/usr/bin/perl -w use 5.011; use Getopt::Long; my $outfile = undef; my $configfile = undef; my $infile = undef; if ( !Getopt::Long::GetOptions( "outfile=s", \$outfile, "infile=s", \$infile, "configfile=s", \$configfile, "help", sub { print "Usage : $0 --configfile C [--outfile O] [--infile I] [--h +elp]\n"; exit 0; }, ) ) { die "error, commandline"; } die "configfile is needed (via --configfile)" unless defined $configfi +le; my $inFH; if ( defined($infile) ) { open my $inFH, '<', $infile or die "opening input file $infile, $!"; } my $instr; { local $/ = undef; $instr = <$inFH> } if ( defined($instr) ) { say "input is $instr"; } $
        my $inFH; if ( defined($infile) ) { open my $inFH, '<', $infile or die "opening input file $infile, $!"; } my $instr; { local $/ = undef; $instr = <$inFH> } if ( defined($instr) ) { say "input is $instr"; }

        $inFH is being abused! You declare it, then you declare it again in an inner scope. Whatever value it gets from that inner scope is forgotten as soon as it comes out (of that scope). Then you slurp it ... but it's already closed (I mean $inFH) because a filehandle exiting the scope is closed automatically, see When do filehandles close?.

        So perhaps something like this?:

        my $instr; if ( defined($infile) ) { my $inFH; open $inFH, '<', $infile or die "opening input file $infile, $!"; { local $/ = undef; $instr = <$inFH> } close $inFH; # just polite to + close it } if ( defined($instr) ) { say "input is $instr"; } else { die "sorry, you did not specify an input either via a file or + via [other ways of specifying an instr] " }
Re: chunking up texts correctly for online translation
by jwkrahn (Abbot) on Jun 15, 2019 at 00:20 UTC
    my $ini_path = qw( /home/bob/Documents/html_template_data/3.values.ini + ); ... my $input_directory = qw( /home/bob/Documents/meditations/castaways/translate/data );

    You are assigning a list to a scalar variable. This only works because there is only one element in the list. The correct way is to assign a scalar value to a scalar variable:

    my $ini_path = '/home/bob/Documents/html_template_data/3.values.ini';

    Or a list to a list:

    my ( $ini_path ) = qw( /home/bob/Documents/html_template_data/3.values +.ini );
    if ( $prompt1 eq ( "y" | "Y" ) ) { show_lang_codes(); }
    $ perl -le'use Data::Dumper; my $x = ( "y" | "Y" ); print Dumper $x' $VAR1 = "y";

    That code is only comparing $prompt1 to a lower case "y" but not to an upper case "Y". If you want to compare to both lower and upper case you can do the following:

    if ( $prompt1 eq "y" || $prompt1 eq "Y" ) { show_lang_codes(); }

    Or:

    if ( lc $prompt1 eq "y" ) { show_lang_codes(); }

    Or:

    if ( $prompt1 =~ /\Ay\z/i ) { show_lang_codes(); }

      qq looks more suitable in this case (for OP: it justs double-quotes the parameter and returns a scalar)

        I've been looking at the source for the pm and wonder if I shouldn't imitate it in significant ways. source listing for Translate.pm For example, should I use Readonly for these values?

        my ( $REST_HOST, $REST_URL, $CONSOLE_URL, %SIZE_LIMIT_FOR ); { Readonly $REST_HOST => 'translation.googleapis.com'; Readonly $REST_URL => "https://$REST_HOST/language/translate +/v2"; Readonly $CONSOLE_URL => "https://console.developers.google.com +/cloud-resource-manager"; Readonly %SIZE_LIMIT_FOR => ( translate => 2000, # google states 2K but observed results +vary detect => 2000, languages => 9999, # N/A ); }

        Also, I want to build a central hash to store the data. Values would go in with Getopt::Long and end up in %self . This treatment creates a class of it. One thing I don't see is where the value of $class gets passed to the sub new:

        sub new { my ( $class, $param_hr ) = @_; my %self = ( key => 0, format => 0, model => 0, prettyprint => 0, default_source => 0, default_target => 0, data_format => 'perl', timeout => 60, force_post => 0, rest_url => $REST_URL, agent => ( sprintf '%s/%s', __PACKAGE__, $VERSION ), cache_file => 0, headers => {}, );