chunking up texts correctly for online translation

Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I've been developing a primitive package that translates along the lines of _Intermediate Perl_, specifically using module-starter and customizing an appropriate Makefile.PL . I may have put files in loopy places due to inexperience, but I am getting preliminary results. Of my 3 test files, 2 of 3 are translated faithfully. I have output and source to show, to motivate a couple questions:

Here are the driver script and the package:

#!/usr/bin/perl -w
use 5.011;

use WWW::Google::Translate;
use Data::Dumper;
use open OUT => ':utf8';
use Path::Tiny;
use lib ".";
use translate;
binmode STDOUT, 'utf8';
use POSIX qw(strftime);

### values to initialize  (customize these to suit)

my $ini_path = qw( /home/bob/Documents/html_template_data/3.values.ini
+ );
my $sub_hash = "google";
my ( $from, $to ) = ( 'en', 'ru' );    #put defaults here
my $input_directory =
  qw( /home/bob/Documents/meditations/castaways/translate/data
);
my $output_appendage = "output";

## get values for google from an .ini file
my $key = get_config( $ini_path, $sub_hash );

say "Would you like to see the possibilities?";
my $prompt1 = <STDIN>;
chomp $prompt1;
if ( $prompt1 eq ( "y" | "Y" ) ) {
  show_lang_codes();
}

say "Would you like to change the from language?";
$prompt1 = <STDIN>;
chomp $prompt1;
if ( $prompt1 eq ( "y" | "Y" ) ) {
  $from = get_lang($from);
}

say "Would you like to change the to language?";
$prompt1 = <STDIN>;
chomp $prompt1;
if ( $prompt1 eq ( "y" | "Y" ) ) {
  $to = get_lang($to);
}

# create output directory
say "Creating output directory as nephew of input";
say "using localtime for uniqueness";
my $munge = strftime( "%d-%m-%Y-%H-%M-%S\.txt", localtime );

my $parent = path($input_directory)->parent;
my $out_dir = path( $parent, $output_appendage, $munge );

my $wgt = WWW::Google::Translate->new(
  {
    key            => $key,
    default_source => $from,
    default_target => $to,
  }
);

my @texts = path("$input_directory")->children(qr/\.txt$/);
say "texts are @texts";
for my $file (@texts) {

  local $/ = "";
  open my $fh, '<', $file;
  my $base_name = path("$file")->basename;
  my $out_file = path( $out_dir, $base_name )->touchpath;
  say "out_file is $out_file";
  while (<$fh>) {

    print "New Paragraph: $_";
    my $r = get_trans( $wgt, $_ );
    for my $trans_rh ( @{ $r->{data}->{translations} } ) {
      #print $trans_rh->{translatedText}, "\n";
      my $result = $trans_rh->{translatedText};
      say "result is $result ";
      my @lines = split /\n/, $result;
      push @lines, "\n";
      path("$out_file")->append_utf8(@lines);
    }

  }

  close $fh;

}
[download]

package translate;

use 5.006;
use strict;
use warnings;

require Exporter;

our @ISA    = qw(Exporter);
our @EXPORT = qw(  get_config get_trans get_lang show_lang_codes rever
+se_trans);
our $VERSION = '0.01';

=head1 SYNOPSIS



  use translate;

  my $key = get_config('path-to-ini-file', $sub_hash);
  my $from = get_lang($from_default);
  my $to = get_lang($to_default);
  my $trans_output_file = get_trans($input_file, $from, $to, $key);
  my $reverse = reverse_trans($trans_output_file, $to, $from, $key);



=cut

sub get_config {

  use Config::Tiny;
  use Data::Dumper;
  use open OUT => ':encoding(UTF-8)';
  use Path::Tiny;
  use 5.011;

  my ( $ini_path, $sub_hash ) = @_;
  say "ini path is $ini_path";
  say "sub_hash is $sub_hash";
  my $Config = Config::Tiny->new;
  $Config = Config::Tiny->read( $ini_path, 'utf8' );
  say Dumper $Config;
  my $key = $Config->{$sub_hash}{'api_key_1'};
  return $key;
}

sub get_lang {

  use Path::Tiny;
  use 5.011;

  my $lang = shift;

  say "Would you like to change languages?";
  my $prompt1 = <STDIN>;
  chomp $prompt1;
  if ( $prompt1 eq ( "y" | "Y" ) ) {
    say "enter new lang: ";
    $prompt1 = <STDIN>;
    chomp $prompt1;

    $lang = $prompt1;

  }

  return $lang;

}

sub show_lang_codes {

  use Path::Tiny;
  use 5.011;

  my $path_to_langs = path( "my_data", "lang_data", "1.langlist.txt" )
+;
  my $data = $path_to_langs->slurp_utf8;
  say "$data";

}

sub get_trans {

  use Path::Tiny;
  use 5.011;
  use WWW::Google::Translate;

  my ( $wgt, $paragraph ) = @_;

  my $r = $wgt->translate( { q => $paragraph } );
  return $r;

}

1;    # End of translate
[download]

This is typical output for my first two input files:

New Paragraph:         Did the last version also pass all tests? Were the changes required? Were new tests added to cover the changes?

result is ааааааааПоследняя версия также прошла все тесты? Были ли необходимы изменения? Были ли добавлены новые тесты для покрытия изменений?

It chunks up nicely and formats well in the output file too.

My final input file did not. It is Shelley's Frankenstein from Gutenberg texts online. With my input record separator as it is, it seems to slurp in the entire book at once:

,
source => en,
format => text
unsuccessful translate POST for 450783 bytes: Request payload size exc
+eeds the limit: 204800 bytes.
check that BOB-THINKPAD-SL510 has API Access for this API key
at https://console.developers.google.com/cloud-resource-manager
$
[download]

I took a look at it in the hex editor, and the Shelley text has

0D 0A

as line feeds, while the first two have 0A . Is this not the Unix versus windows line feed problem? If so, it must be a well-worn path.

Q1: How do I rewrite my script so that I get paragraph-sized chunks getting sent to google regardless of line feed encoding?

    my $r = get_trans( $wgt, $_ );
    for my $trans_rh ( @{ $r->{data}->{translations} } ) {
      #print $trans_rh->{translatedText}, "\n";
      my $result = $trans_rh->{translatedText};
      say "result is $result ";
      my @lines = split /\n/, $result;
      push @lines, "\n";
      path("$out_file")->append_utf8(@lines);
    }
[download]

Q2: Do I really need all of this to extract one paragraph of translation?

Thanks for your comment,

Comment on chunking up texts correctly for online translation Select or Download Code

Replies are listed 'Best First'.
Re: chunking up texts correctly for online translation by haukex (Archbishop) on Jun 14, 2019 at 22:02 UTC
Q1: How do I rewrite my script so that I get paragraph-sized chunks getting sent to google regardless of line feed encoding? You're right that paragraph mode (`$/ = ""`) doesn't work right when reading a CRLF file on NIX. You could enable the `:crlf` PerlIO layer, which will leave plain LF as is but convert CRLF to LF, and paragraph mode will work: `open my $fh, '<:crlf', $file` By the way, to pick two nits: Module names in all lowercase are reserved (by convention) for pragmas, so I'd name your module `Translate`. Also, you're not checking your open for errors. Q2: Do I really need all of this to extract one paragraph of translation?* In general I'd say get it fully working first, and leave the simplification of the code for a little later :-)	[reply] [d/l] [select]
Re^2: chunking up texts correctly for online translation by Aldebaran (Curate) on Jun 17, 2019 at 23:46 UTC
enable the :crlf PerlIO layer Thx haukex, that worked. Even so, the input to google exceeded their rate limit, so I had to slow it down. I added sleep time and a means to keep track of how long a file takes to translate. for my $file (@texts) { local $/ = ""; open my $fh, '<:crlf', $file or die; my $base_name = path("$file")->basename; my $out_file = path( $out_dir, $base_name )->touchpath; say "out_file is $out_file"; ## time it use Benchmark; my $t0 = Benchmark->new; while (<$fh>) { print "New Paragraph: $_"; my $r = get_trans( $wgt, $_ ); for my $trans_rh ( @{ $r->{data}->{translations} } ) { my $result = $trans_rh->{translatedText}; say "result is $result "; my @lines = split /\n/, $result; push @lines, "\n"; path("$out_file")->append_utf8(@lines); sleep(1); } } my $t1 = Benchmark->new; my $td = timediff( $t1, $t0 ); print "$file took:", timestr($td), "\n"; sleep(3); close $fh; [download] 84-0.txt is Shelley's Frankenstein, which is about 450 k in length. Of the $300 credit they give anyone to sign up for their API, I used 7 cents of it, so I'm down to $297.22 left. It made for an interesting way to skim both the original and the translation. This ballparks 20 minutes as an outer limit: `/home/bob/Documents/meditations/castaways/Translate1/data/84-0.txt too +k:1180 wallclock secs (23.34 usr + 1.36 sys = 24.70 CPU) $` [download] Q3: What do the usr and sys numbers mean? Module names in all lowercase are reserved (by convention) for pragmas, so I'd name your module Translate. Also, you're not checking your open for errors. I did fix both of these but went with Translate1 . The reason I did this is that I know there is going to be a Translate2 that will not work with Translate1. I've heard such naming called "trampolining," and something to be avoided. Q4: Am I supposed to not have such collisions using version numbers or clever use of git? The features of the package change quickly, and sometimes, I have to roll back to something that actually worked. I found that I had to go back to make clean every time I made a change in the script, so I wrote a little helper bash script: $ cat 1.google.sh #!/bin/bash pwd make clean perl Makefile.PL make make test make install ls cd blib cd script ./3.my_script.pl $ [download] I offer this as a keystroke reduction mechanism, not wanting to be OT. The translations went well with the exception of certain characters. Let's look at a couple paragraphs with differing tags. Here is output with pre tags New Paragraph: т€œAre you mad, my friend?т€Э said he. т€œOr whither does your senseless curiosity lead you? Would you also create for yourself and the world a demoniacal enemy? Peace, peace! Learn my miseries and do not seek to increase your own.т€Э result is - Ты злишься, друг мой? - спросил он. лИли куда ты бессмысленное любопытство приведет тебя? Не могли бы вы также создать для себя и мир демонический враг? Мир, мир! Узнай мои страдания и не ищи увеличить свой собственный. А New Paragraph: Frankenstein discovered that I made notes concerning his history; he asked to see them and then himself corrected and augmented them in many places, but principally in giving the life and spirit to the conversations he held with his enemy. т€œSince you have preserved my narration,т€Э said he, т€œI would not that a mutilated one should go down to posterity.т€Э result is Франкенштейн обнаружил, что я делал заметки, касающиеся его истории; он спросил чтобы увидеть их, а затем сам исправить и дополнить их во многих местах, но главным образом в том, чтобы дать жизнь и дух разговорам, которые он вел со своим врагом. "Так как вы сохранили мое повествование", сказал он, лЯ бы не хотел, чтобы изуродованный posterity.т Here is what the 1st paragraph looks like in code tags: `New Paragraph: тAre you mad, my friend?тЭ said he. т +Or whither does your senseless curiosity lead you? Would you also create for yourself and t +he world a demoniacal enemy? Peace, peace! Learn my miseries and do not s +eek to increase your own.тЭ` [download] For some reason, Shelley quotes paragraphs as a matter of course, and they are getting garbled as I read in under these conditions: `#!/usr/bin/perl -w use 5.011; use WWW::Google::Translate; use Data::Dumper; use open OUT => ':utf8'; use Path::Tiny; use lib "."; use translate; binmode STDOUT, 'utf8'; use POSIX qw(strftime);` [download] Google sometimes gives the correct rendering of quotes in russian. They do it somewhat like this: << >> . Q5: How do I change my script so that these characters are rendered correctly? They look right as I read them in gedit. Finally, as I look at the arguments in Makefile.Pl: `my %WriteMakefileArgs = ( NAME => 'Translate1', AUTHOR => q{gilligan <gilligan@island.coconut>}, VERSION_FROM => 'lib/Translate1.pm', LICENSE => 'artistic_2', MIN_PERL_VERSION => '5.006', CONFIGURE_REQUIRES => { 'ExtUtils::MakeMaker' => '0', }, TEST_REQUIRES => { 'Test::More' => '0', }, PREREQ_PM => { #'ABC' => '1.6', #'Foo::Bar::Module' => '5.0401', }, EXE_FILES => ['lib/3.my_script.pl'], dist => { COMPRESS => 'gzip -9f', SUFFIX => 'gz', }, clean => { FILES => 'Translate1-*' }, );` [download] Q6: How would I determine which version of WWW::Google::Translate to require? Thank you for your comments,	[reply] [d/l] [select]
Re^3: chunking up texts correctly for online translation by daxim (Curate) on Jun 19, 2019 at 11:15 UTC
Try `<:crlf:encoding(UTF-8)`, see PerlIO.	[reply] [d/l]
Re^4: chunking up texts correctly for online translation by Aldebaran (Curate) on Jun 27, 2019 at 18:15 UTC
Re^5: chunking up texts correctly for online translation by hippo (Archbishop) on Jun 27, 2019 at 21:20 UTC
Re^5: chunking up texts correctly for online translation by haukex (Archbishop) on Jun 30, 2019 at 09:27 UTC
Re^3: chunking up texts correctly for online translation by haukex (Archbishop) on Jul 07, 2019 at 08:45 UTC
I was going through replies and I noticed there were some unanswered questions: `took:1180 wallclock secs (23.34 usr + 1.36 sys = 24.70 CPU)` Q3: What do the usr and sys numbers mean? "User time" is the amount of time spent in user-mode code (your code plus any libraries it's using), and "system time" is the amount of time spent in the kernel, such as system calls. Q4: Am I supposed to not have such collisions using version numbers or clever use of git? The features of the package change quickly, and sometimes, I have to roll back to something that actually worked. This depends very much on how you plan on using and releasing this module. If this is something you're going to release on CPAN, then it's definitely important to put some thought into naming and versioning. For example, it'd be best to work beneath a single namespace (just for example `Lingua::Translate::`), and especially not to pollute the top level with multiple namespaces such as `Translate1::` and `Translate2::` - instead, it'd be best to use a naming scheme such as `Translate::MyEngine::V1` and `Translate::MyEngine::V2`. On the other hand, if this something for your personal use, then you are free to do whatever you like and what is practical for you - you can do version control with Git, or, if you think that you'll be using multiple versions in parallel, naming like `Translate1` and `Translate2` (or maybe better: `Translate::V1` and `Translate::V2`) would probably work too. Of course, it's also possible to switch between these two development modes - I've done rapid prototyping in a repository that ended up being quite littered with experiments etc., and then when it came time to release, I set up a new, clean repository into which I just put the files that should be released, added proper versioning, better naming, etc. `тАЬAre you mad, my friend?тАЭ said he.` ... Q5: How do I change my script so that these characters are rendered correctly?* That's definitely an encoding problem, but you'd have to show us a *Short, Self-Contained, Correct Example that reproduces the issue. I showed an example of what information to provide in the case of encoding issues here. Q6: How would I determine which version of WWW::Google::Translate to require?* That depends on what features of the module you're using, or whether older versions had bugs that your code is having problems with. For example, the changelog shows that the `format` parameter was added in 0.06, `headers` in 0.08, and `model` in 0.10. Another thing to look for might be whether newer versions changed the dependencies. Usually, I'll require the lowest possible version of a module, unless there have been egregious bugs in older versions, in which case I'll require the version after those bugs were fixed.	[reply] [d/l] [select]
Re: chunking up texts correctly for online translation by bliako (Abbot) on Jun 14, 2019 at 22:41 UTC
Hello Aldebaran, haukex provided a solution to your question. I have a suggestion regarding user interaction: Getopt::Long can parse parameters passed on to a script via the command line, i.e. the well known pattern `translate.pl --from-lang XYZ --to-lang ZYX --config ABC --outfile 123` or even `translate.pl --help` to show available languages. This is straight forward to implement and will help you to abstract and automate even more, for example, by creating higher level bash scripts, e.g. `translate.bash "hello there"` or `translate.bash < /dev/telephone1 > /dev/telephone2` Here is an example use Getopt::Long; my $outfile = undef; my $configfile = undef; my $infile = undef; if( ! Getopt::Long::GetOptions( "outfile=s", \$outfile, "infile=s", \$infile, "configfile=s", \$configfile, "help", sub { print "Usage : $0 --configfile C [--outfile O] [--infi +le I] [--help]\n"; exit 0; }, ) ){ die "error, commandline" } die "configfile is needed (via --configfile)" unless defined $configfi +le; my $inFH = <STDIN>; # read input from stdin by default, unless an in f +ile is provided if( defined $infile ){ open $inFH, '<', $infile or die "opening input +file $infile, $!"; } my $instr; {local $/ = undef; $instr = <$inFH> } if( defined ($infile){ close $inFH } # do similar for outfile and STDOUT ... # and call your module translate, input text is in $instr ... [download]	[reply] [d/l] [select]
Re^2: chunking up texts correctly for online translation by Aldebaran (Curate) on Jul 02, 2019 at 21:39 UTC
Struggling to get basic functionality here. I'm removing the option to use STDIN for input, as I have a translate shell already that covers this functionality for me. Parts are working, for example, I die if no value for config is supplied. I can't seem to get to our favorite example text of late: $ ./1.get_opt.pl --configfile C --outfile /home/bob/Documents/meditati +ons/Algorithm-Markov-Multiorder-Learner-master/output/1.txt --infile +/home/bob/Documents/meditations/Algorithm-Markov-Multiorder-Learner-m +aster/data/2.short.shelley.txt Use of uninitialized value $inFH in <HANDLE> at ./1.get_opt.pl line 34 +. readline() on unopened filehandle at ./1.get_opt.pl line 34. $ cat 1.get_opt.pl #!/usr/bin/perl -w use 5.011; use Getopt::Long; my $outfile = undef; my $configfile = undef; my $infile = undef; if ( !Getopt::Long::GetOptions( "outfile=s", \$outfile, "infile=s", \$infile, "configfile=s", \$configfile, "help", sub { print "Usage : $0 --configfile C [--outfile O] [--infile I] [--h +elp]\n"; exit 0; }, ) ) { die "error, commandline"; } die "configfile is needed (via --configfile)" unless defined $configfi +le; my $inFH; if ( defined($infile) ) { open my $inFH, '<', $infile or die "opening input file $infile, $!"; } my $instr; { local $/ = undef; $instr = <$inFH> } if ( defined($instr) ) { say "input is $instr"; } $ [download]	[reply] [d/l]
Re^3: chunking up texts correctly for online translation by bliako (Abbot) on Jul 03, 2019 at 00:39 UTC
`my $inFH; if ( defined($infile) ) { open my $inFH, '<', $infile or die "opening input file $infile, $!"; } my $instr; { local $/ = undef; $instr = <$inFH> } if ( defined($instr) ) { say "input is $instr"; }` [download] `$inFH` is being abused! You declare it, then you declare it again in an inner scope. Whatever value it gets from that inner scope is forgotten as soon as it comes out (of that scope). Then you slurp it ... but it's already closed (I mean `$inFH`) because a filehandle exiting the scope is closed automatically, see When do filehandles close?. So perhaps something like this?: `my $instr; if ( defined($infile) ) { my $inFH; open $inFH, '<', $infile or die "opening input file $infile, $!"; { local $/ = undef; $instr = <$inFH> } close $inFH; # just polite to + close it } if ( defined($instr) ) { say "input is $instr"; } else { die "sorry, you did not specify an input either via a file or + via [other ways of specifying an instr] " }` [download]	[reply] [d/l] [select]
Re: chunking up texts correctly for online translation by jwkrahn (Abbot) on Jun 15, 2019 at 00:20 UTC
`my $ini_path = qw( /home/bob/Documents/html_template_data/3.values.ini + ); ... my $input_directory = qw( /home/bob/Documents/meditations/castaways/translate/data );` [download] You are assigning a list to a scalar variable. This only works because there is only one element in the list. The correct way is to assign a scalar value to a scalar variable: `my $ini_path = '/home/bob/Documents/html_template_data/3.values.ini';` [download] Or a list to a list: `my ( $ini_path ) = qw( /home/bob/Documents/html_template_data/3.values +.ini );` [download] `if ( $prompt1 eq ( "y" \| "Y" ) ) { show_lang_codes(); }` [download] `$ perl -le'use Data::Dumper; my $x = ( "y" \| "Y" ); print Dumper $x' $VAR1 = "y";` [download] That code is only comparing `$prompt1` to a lower case "y" but not to an upper case "Y". If you want to compare to both lower and upper case you can do the following: `if ( $prompt1 eq "y" \|\| $prompt1 eq "Y" ) { show_lang_codes(); }` [download] Or: `if ( lc $prompt1 eq "y" ) { show_lang_codes(); }` [download] Or: `if ( $prompt1 =~ /\Ay\z/i ) { show_lang_codes(); }` [download]	[reply] [d/l] [select]
Re^2: chunking up texts correctly for online translation by bliako (Abbot) on Jun 15, 2019 at 08:55 UTC
`qq` looks more suitable in this case (for OP: it justs double-quotes the parameter and returns a scalar)	[reply] [d/l]
Re^3: chunking up texts correctly for online translation by Aldebaran (Curate) on Jun 30, 2019 at 18:12 UTC
I've been looking at the source for the pm and wonder if I shouldn't imitate it in significant ways. source listing for Translate.pm For example, should I use Readonly for these values? `my ( $REST_HOST, $REST_URL, $CONSOLE_URL, %SIZE_LIMIT_FOR ); { Readonly $REST_HOST => 'translation.googleapis.com'; Readonly $REST_URL => "https://$REST_HOST/language/translate +/v2"; Readonly $CONSOLE_URL => "https://console.developers.google.com +/cloud-resource-manager"; Readonly %SIZE_LIMIT_FOR => ( translate => 2000, # google states 2K but observed results +vary detect => 2000, languages => 9999, # N/A ); }` [download] Also, I want to build a central hash to store the data. Values would go in with Getopt::Long and end up in %self . This treatment creates a class of it. One thing I don't see is where the value of $class gets passed to the sub new: `sub new { my ( $class, $param_hr ) = @_; my %self = ( key => 0, format => 0, model => 0, prettyprint => 0, default_source => 0, default_target => 0, data_format => 'perl', timeout => 60, force_post => 0, rest_url => $REST_URL, agent => ( sprintf '%s/%s', __PACKAGE__, $VERSION ), cache_file => 0, headers => {}, );` [download]	[reply] [d/l] [select]
Re^4: chunking up texts correctly for online translation by bliako (Abbot) on Jul 01, 2019 at 08:58 UTC
Re^5: chunking up texts correctly for online translation by Aldebaran (Curate) on Jul 01, 2019 at 22:15 UTC
Some notes below your chosen depth have not been shown here