Aldebaran has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I present with a transformed version of my html template, where the scope is limited to russian crosswords. A lot of things just don't seem to want to work the same when you use cyrillic characters, and I hope that the lessons I seek will generalize for others' unicode projects. Let me start with a listing of the main script in readmore tags, and then I'll pull out the parts that need grease.

$ cat 7.cw1.pl #!/usr/bin/perl -w use 5.011; use lib "template_stuff"; use html7; use trans2; ##yandex option available use Path::Tiny; use utils1; use utf8; use Encode; use open OUT => ':encoding(UTF-8)', ':std'; use Net::SFTP::Foreign; use Data::Dumper; # initializations that must precede main data structure my $ts = "template_stuff"; my $images = "aimages"; my $captions = "captions"; my $ruscaptions = "ruscaptions"; ## turning things to Path::Tiny # decode paths my $abs = path(__FILE__)->absolute; my $path1 = Path::Tiny->cwd; my $title = $path1->basename; $abs = decode( 'UTF-8', $abs ); $path1 = decode( 'UTF-8', $path1 ); $title = decode( 'UTF-8', $title ); say "title is $title"; say "path1 is $path1"; say "abs is $abs"; my $path2 = path( $path1, $ts ); # page params my %vars = ( title => $title, headline => undef, place => 'Vancouver', base_url => 'http://www.merrillpjensen.com', css_file => "${title}1.css", header => path( $path2, "hc_input2.txt" ), footer => path( $path2, "footer_center3.txt" ), body => path( $path2, "rebus5.tmpl" ), print_script => "1", code_tmpl => path( $path2, "code2.tmpl" ), oitop => path( $path2, "oitop.txt" ), oibottom => path( $path2, "oibottom.txt" ), to_images => path( $path2, $images ), eng_captions => path( $path2, $captions ), rus_captions => path( $path2, $ruscaptions ), translations => path( $path2, 'translations' ), bottom => path( $path2, "bottom1.txt" ), book => 'Crosswords: ', chapter => 'Кроссво&# +1088;ды', make_puzzle => 1, print_module => 0, script_file => $abs, module_tmpl => path( $path2, "code3.tmpl" ), server_dir => 'perlmonks', image_dir => 'pmimage', ts => 'template_system', css_path => $path2, ini_path => path('/home/bob/Documents/html_template_data/3.&#109 +4;енности.ini'), cw => path($path2,'crosswords'), ); my $rvars = \%vars; my $word = 'cw'; foreach my $child ( $vars{cw}->children ) { next unless $child->is_dir; say "dyetya is $child"; my $base_dir = $child->basename; say "base dir is $base_dir"; $vars{$base_dir} = path( $child ); say "dir is $vars{$base_dir}"; } my $sftp = get_тайный($rvars); say "result is $sftp"; my $dir2 = $vars{"server_dir"}; say "dir2 is $dir2"; my $ls = $sftp->ls( "/$dir2", wanted => qr/$word/ ) or warn "unable to retrieve " . $sftp->error; #print "$_->{filename}\n" for (@$ls); my @remote_files = map { $_->{filename} } @$ls; #say "files are @remote_files"; my $rref = \@remote_files; #say Dumper $rref; say "ultimate disposition of main hash-------"; say Dumper $rvars; __END__ $

One reason I'm hiding this under a readmore tag is that perltidy doesn't want to format it. Right now, this is what I use for all my perltidy commands:

$ cat .bash_aliases alias pt='perltidy -i=2 -b '
$ pt 7.cw1.pl ## Please see file 7.cw1.pl.ERR $ cat 7.cw1.pl.ERR Perltidy version is 20180220 85: unexpected character decimal 209 (�) in script 85: unexpected character decimal 130 (�) in script ... 85: unexpected character decimal 185 (�) in script 85: Giving up after error $

Q1) Can I alter my perltidy command so that these chars are not problematic?

My second issue is when I try to loop over cyrillic directories. The relevant source is:

foreach my $child ( $vars{cw}->children ) { next unless $child->is_dir; say "dyetya is $child"; my $base_dir = $child->basename; say "base dir is $base_dir"; $vars{$base_dir} = path( $child ); say "dir is $vars{$base_dir}"; }

The paths are built for the ascii path as well and loaded to the main data structure as a blessed entity:

dyetya is /home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/eug +ene base dir is eugene

,and there's nothing from cyrillic paths. Bash shows them here with pre tags:

$ pwd
/home/bob/2.scripts/pages/7.cw/template_stuff/crosswords
$ ls
caption_filled.gif  eugene  захват  изображение  подписи
$ ls -l
total 24
-rw-r--r-- 1 bob bob 7882 Jan  7 22:58 caption_filled.gif
drwxr-xr-x 3 bob bob 4096 Jan 31 16:41 eugene
drwxr-xr-x 2 bob bob 4096 Jan  7 22:28 захват
drwxr-xr-x 2 bob bob 4096 Jan  7 22:45 изображение
drwxr-xr-x 2 bob bob 4096 Feb  1 13:04 подписи
$ 

Q2) How do I convince Path::Tiny to give me all these paths?

My third question regards what I'm trying to build a a path to:

$ pwd
/home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/подписи
$ cat a.txt
л лесопарк
о и о о е 
комар р й 
л б трасса
ё  к   т б
пасс   ухо
  менеджер
фауна раки
  тоска  г
шкант м  е
    атаман
    в т т 
    н ухаб
    игроки
    к грач

$ 

I think this syntax from Path::Tiny could work:

@lines = $file->lines_utf8;

What I seek to do here is simply read it in, make sure that whitespace is removed and and have it be echoed out in its precise locations. What I have now addresses the encoding issue, but not the spacing:

sub print_aoa_utf8 { use warnings; use 5.011; use utf8; # a la Franois use open OUT => ':encoding(utf8)'; use open ':std'; my $a = shift; my @AoA = @$a; for my $i ( 0 .. $#AoA ) { my $aref = $AoA[$i]; for my $j ( 0 .. $#{$aref} ) { print "elt $i $j is $AoA[$i][$j]\n"; } } return $a; }

One final question. I wrote my newest version of my "get_tiny" sftp object creator as a homonym:

my $sftp = get_тайный($rvars);
, because I thought it would be a little harder to crack if it were done in a non-ascii way. (Of course, I may have dished myself to those who do.) Then I watch a NOVA that said that with the advent of quantum computing, public key encryption would be pass. I failed to understand how this is going to happen. Should we all just give up?

Thanks for your comment and cheers,

Replies are listed 'Best First'.
Re: help with cyrillic characters in odd places
by hippo (Archbishop) on Feb 06, 2019 at 11:21 UTC
    Q2) How do I convince Path::Tiny to give me all these paths?
    use strict;
    use warnings;
    use utf8;
    
    use Test::More tests => 2;
    use Path::Tiny;
    use Encode ('decode');
    
    my $dir = '/tmp/foo';
    mkdir $dir unless -d $dir;
    mkdir "$dir/eugene" or die $!;
    mkdir "$dir/захват" or die $!;
    
    my $pt = path ($dir);
    my @children = $pt->children;
    is (decode ('UTF-8', "$children1"), "$dir/eugene");
    is (decode ('UTF-8', "$children[0]"), "$dir/захват");
    
    $pt->remove_tree;
    

    Your environment may use some other collation but the above works just fine for me.

    (PS. You will have to fix up the line of the first test to put the square brackets back in. <pre> required as usual for the UTF-8 strings.)

      Thx, hippo, this syntax gets us started

      my @children = $pt->children;

      , but it seems to draw me away from Path::Tiny because @children appears to be a bunch of strings. (Progress is having them all there, so yay that.)

      children are /home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/caption_filled.gif /home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/за
      ва‚ /home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/изоб€ажение /home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/подписи /home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/eugene
      dyetya is /home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/caption_filled.gif
      dyetya is /home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/захват
      Can't locate object method "basename" via package "/home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/захват" (perhaps you forgot to load "/home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/захват"?) at ./7.cw2.pl line 83.
      $ 
      

      I suppose one could pull out File::Basename, but that's a module I'm trying to abrogate.

      my @children = $vars{cw}->children;
      say "children are @children";
      
      foreach my $child (@children) {
        $child = decode( 'UTF-8', $child );
        say "dyetya is $child";
        next unless -d $child;
      
        my $base_dir = $child->basename;
        say "base dir is $base_dir";
        $vars{$base_dir} = path($child);
        say "dir is $vars{$base_dir}";
      
      }
      Update:

      I believe that the Path::Tiny way to do this is using the iterator method:

      my $iter = $vars{cw}->iterator; while ( my $next = $iter->() ) { say "next is $next"; my $next_decode = decode( 'UTF-8', $next ); my $base = $next->basename(); say "base is $base"; my $base_decode = decode( 'UTF-8', $base ); say "next decode is $next_decode"; say "base decode is $base_decode"; }
      next is /home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/изоб€ажение
      base is изоб€ажение
      next decode is /home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/изображение
      base decode is изображение
      
        but it seems to draw me away from Path::Tiny because @children appears to be a bunch of strings.
        say "children are @children";

        They only appear to be a bunch of strings because you've stringified them in the say statement.

Re: help with cyrillic characters in odd places
by bliako (Abbot) on Feb 06, 2019 at 10:16 UTC

    FWIW, your directory lister (without loading your html7 etc. modules) works for me but prints gibberish. Removing :std fixes that (v5.26.2). Maybe use strict/warnings will give you additional hints? The rest I can't help sorry. (re: utf8 sub names, I did not know that!)

      Thanks all for replies. I'm truly humbled that people a half a world away give my problems the time of their day. I have been able to make progress: I threw in use strict and use warnings even though they should have been implied by the -w in the shebang and using 5.011. I have to admit that I don't know what :std does...get back to you on that. I did make a reply to you regarding usage in your recent meditation that I hope you can take a look at Re^5: n-dimensional statistical analysis of DNA sequences (or text, or ...).

Re: help with cyrillic characters in odd places
by Anonymous Monk on Feb 06, 2019 at 13:29 UTC
    Q1) Can I alter my perltidy command so that these chars are not problematic?
    This doesn't seem to be mentioned in perldoc perltidy, but see what I found in Perl-Tidy-20181120/docs/perltidy.html:

    -enc=s, --character-encoding=s

    where s=none or utf8. This flag tells perltidy the character encoding of both the input and output character streams. The value utf8 causes the stream to be read and written as UTF-8. The value none causes the stream to be processed without special encoding assumptions. At present there is no automatic detection of character encoding (even if there is a 'use utf8' statement in your code) so this flag must be set for streams encoded in UTF-8. Incorrectly setting this parameter can cause data corruption, so please carefully check the output.

    The default is none.

    The abbreviations -utf8 or -UTF8 are equivalent to -enc=utf8. So to process a file named file.pl which is encoded in UTF-8 you can use:

    perltidy -utf8 file.pl

    Regarding your Cyrillic path issues, Path::Tiny seems to work with file names as byte-strings, not character-strings. Allow me to demonstrate:

    $ ls -l
    итого 0
    -rw-r--r-- 1 user user 0 фев  6 16:20 привет
    
    $ perl -MData::Dump=dd -MPath::Tiny=path -E'dd path(".")->children' bless([ pack("H*","d0bfd180d0b8d0b2d0b5d182"), pack("H*","d0bfd180d0b8d0b2d0b5d182"), ], "Path::Tiny")
    $ perl -MData::Dump=dd -E'dd "привет"'
    pack("H*","d0bfd180d0b8d0b2d0b5d182")
    
    Data::Dump::dd output is the same for $path->children and a simple string consisting of UTF-8-encoded bytes. Perl wide characters are different when dumped using Data::Dump::dd or Data::Dumper::Dumper:
    $ perl -MData::Dump=dd -Mutf8 -E'dd "привет"'
    "\x{43F}\x{440}\x{438}\x{432}\x{435}\x{442}"
    You seem to be using an IOLayer to encode all characters being printed to STDOUT from wide characters to UTF-8. Since for Perl code, wide strings and byte strings are mostly same data type, except wide strings can have ord values > 255, encode doesn't really know whether it is encoding actual wide characters into UTF-8 or it is wrongly encoding UTF-8 bytes as if they were Unicode code points. So when you are trying to print file names, they undergo an unnecessary conversion and get garbled in the process. Your options include decode'ing them back into wide characters before printing (beware: filenames can contain invalid UTF-8 and arbitrary bytes!) or disabling the IOLayer that encodes the strings.

    with the advent of quantum computing, public key encryption would be pass
    See Post-quantum_cryptography. This may be true for "classical" asymmetric cryptography which may be easy to break after we get powerful enough quantum computers (we don't, yet), but new approaches are already being developed that wouldn't rely on integer factorization / discrete logarithm / elliptic-curve discrete logarithm problems to be secure and also wouldn't be vulnerable to quantum computers.
      perltidy -utf8 file.pl

      Thanks again for your generous comments, Anonymous Monk. I changed my perltidy command in .bash_aliases :

      $ cat .bash_aliases alias pt='perltidy -i=2 -utf8 -b '

      I was able to replicate your results and see for myself. I thought I was gonna beat it by using Path::Tiny methods, but I seem only to have dug myself in deeper:

      I'm completely fanning on getting ->is_dir to work for me, encoded or decoded....

        A good strategy for you would be either to start splitting your code into smaller and smaller parts until some of them start working - and seeing which change made a difference - or combining small self-contained reproducible examples we provide back into a whole that resembles your current code - and seeing when it stops working.

        Right now, your encoding handling is doing the wrong thing. Let's start with a small file and get it to output UTF-8 from Perl wide characters:

        use warnings; binmode STDOUT, ":utf8"; print "\x{44b}\n";

        No warnings, the string literal is definitely wide, and the output is evidently UTF-8. This was achieved by adding a perl IO layer to STDOUT that encodes wide characters to UTF-8 bytes. We can verify that:

        use Data::Dumper; binmode STDOUT, ":utf8"; use PerlIO; print Dumper [ PerlIO::get_layers \*STDOUT, output => 1 ]; __END__ $VAR1 = [ 'unix', 'perlio', 'utf8' ];

        Your code,

        use open qw/:std :utf8/; use open OUT => ':encoding(UTF-8)', ':std';
        adds the UTF-8 encoding layer multiple times:
        $VAR1 = [ 'unix', 'perlio', 'utf8', 'encoding(utf-8-strict)', 'utf8' ];

        That would be one of the reasons why you are getting Mojibake instead of Cyrillic characters. It may be helpful to use more simple and explicit code for now, until you understand better the machinery that makes it all tick. Start with binmode STDOUT, ":utf8" and get your code to output correctly-encoded UTF-8 to STDOUT (after reading your code, I think you are almost there: everywhere you get UTF-8 bytes, you decode them correctly before printing). Once that works, start adding pragmas like open that save you typing.

        I am not sure why would your code (appear to) entirely skip non-ASCII files and directories, but perhaps we could shed some light on it once we get Unicode display problem resoled.