in reply to Re: help with cyrillic characters in odd places
in thread help with cyrillic characters in odd places

perltidy -utf8 file.pl

Thanks again for your generous comments, Anonymous Monk. I changed my perltidy command in .bash_aliases :

$ cat .bash_aliases alias pt='perltidy -i=2 -utf8 -b '

I was able to replicate your results and see for myself. I thought I was gonna beat it by using Path::Tiny methods, but I seem only to have dug myself in deeper:

$ pwd
/home/bob/2.scripts/pages/7.cw/template_stuff/crosswords
$ ls -R
.:
1.txt  caption_filled.gif  eugene  захват  изображение  подписи

...
./изображение:
z.1.cw.jpg

./подписи:
a.txt
$ 

With pre tags, one sees the cyrillic directories with something in it. I just can't get to them. This is the output with code tags:

$ ./21.clone.pl 7.cw 2.scratch ... making directories abs to template is /home/bob/2.scripts/pages/2.scratch/template_stuff string abs from is /home/bob/2.scripts/pages/7.cw/template_stuff ------------- copying files child is /home/bob/2.scripts/pages/7.cw/template_stuff/дирек +Ñ‚Ð¾Ñ€Ð¸Ñ copy path is /home/bob/2.scripts/pages/3.scratch/template_stuff/&#1076 +;иректория /home/bob/2.scripts/pages/7.cw/template_stuff/директо +Ñ€Ð¸Ñ is neither file nor directory ... copy path is /home/bob/2.scripts/pages/2.scratch/template_stuff/ruscap +tions We are now in is_dir create directory return is 1 this ^^^ should be a return for mkpath ... cross path is /home/bob/2.scripts/pages/7.cw/template_stuff/crosswords ----------trying visit method $VAR1 = { '/home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/eu +gene/hs_ref_GRCh38.p12_chr20.fa' => 65455484, ... "/home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/\x +{d0}\x{b8}\x{d0}\x{b7}\x{d0}\x{be}\x{d0}\x{b1}\x{d1}\x{80}\x{d0}\x{b0 +}\x{d0}\x{b6}\x{d0}\x{b5}\x{d0}\x{bd}\x{d0}\x{b8}\x{d0}\x{b5}" => und +ef, ... "/home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/\x +{d0}\x{b7}\x{d0}\x{b0}\x{d1}\x{85}\x{d0}\x{b2}\x{d0}\x{b0}\x{d1}\x{82 +}" => undef, ... "/home/bob/2.scripts/pages/7.cw/template_stuff/crosswords/\x +{d0}\x{bf}\x{d0}\x{be}\x{d0}\x{b4}\x{d0}\x{bf}\x{d0}\x{b8}\x{d1}\x{81 +}\x{d0}\x{b8}" => undef, ... ----------------- $

In this last part with the visit method, we have no entries for the files in the russian directories. Source:

$ cat 21.clone.pl #!/usr/bin/perl -w use 5.011; use utf8; use open qw/:std :utf8/; use Path::Tiny; use Encode; use open OUT => ':encoding(UTF-8)', ':std'; # This script clones the template directory in $1 to $2. # Some names need munging. # $from is a populated child directory; $to is child dir to be create +d. ###### ## enabling cyrillic ## decode argv and current ###### revision for crosswords...losing $pop in argv.... say "argv is @ARGV"; foreach (@ARGV) { say "before decode is $_"; $_ = decode( 'UTF-8', $_ ); say "after decode is $_"; } my ( $from, $to) = @ARGV; my $current = Path::Tiny->cwd; $current = decode( 'UTF-8', $current ); say "current is $current"; say "-------------"; say "making directories"; # define the paths within the target directory: my $ts = "template_stuff"; my $abs_to = path( $current, $to, $ts ); $abs_to->mkpath; say "abs to template is $abs_to"; # $from template directory: my $abs_from = path( $current, $from, $ts ); say "string abs from is $abs_from"; say "-------------"; say "copying files"; #### using iterator method to copy template stuff my $iter = $abs_from->iterator(); while ( my $child = $iter->() ) { say "child is $child"; my $base = $child->basename; $base = decode( 'UTF-8', $base ); ### added to handle cyrillic # adding the following line in hopes of shoestring tackle: #$child = decode( 'UTF-8', $child ); my $copy_path = path( $abs_to, $base ); say "copy path is $copy_path"; if ( $child->is_dir ) { say "We are now in is_dir"; my $return7 = $copy_path->mkpath; say "create directory return is $return7"; say "this ^^^ should be a return for mkpath"; } elsif ( $child->is_file ) { my $return8 = path($child)->copy($copy_path); say "copy file return is $return8"; }else { say "$child is neither file nor directory"; } } #### exploring more of Path::Tiny my $cross_path = path($abs_from, "crosswords"); say "cross path is $cross_path"; say "----------trying visit method"; my $sizes = $cross_path->visit( sub { my ( $path7, $state ) = @_; return if $path7->is_dir; $state->{$path7} = -s $path7; }, { recurse => 1 } ); use Data::Dumper; print Dumper $sizes; #say "sizes is $sizes"; say "-----------------"; $

I'm completely fanning on getting ->is_dir to work for me, encoded or decoded....

Replies are listed 'Best First'.
Re^3: help with cyrillic characters in odd places
by Anonymous Monk on Feb 16, 2019 at 17:37 UTC

    A good strategy for you would be either to start splitting your code into smaller and smaller parts until some of them start working - and seeing which change made a difference - or combining small self-contained reproducible examples we provide back into a whole that resembles your current code - and seeing when it stops working.

    Right now, your encoding handling is doing the wrong thing. Let's start with a small file and get it to output UTF-8 from Perl wide characters:

    use warnings; binmode STDOUT, ":utf8"; print "\x{44b}\n";

    No warnings, the string literal is definitely wide, and the output is evidently UTF-8. This was achieved by adding a perl IO layer to STDOUT that encodes wide characters to UTF-8 bytes. We can verify that:

    use Data::Dumper; binmode STDOUT, ":utf8"; use PerlIO; print Dumper [ PerlIO::get_layers \*STDOUT, output => 1 ]; __END__ $VAR1 = [ 'unix', 'perlio', 'utf8' ];

    Your code,

    use open qw/:std :utf8/; use open OUT => ':encoding(UTF-8)', ':std';
    adds the UTF-8 encoding layer multiple times:
    $VAR1 = [ 'unix', 'perlio', 'utf8', 'encoding(utf-8-strict)', 'utf8' ];

    That would be one of the reasons why you are getting Mojibake instead of Cyrillic characters. It may be helpful to use more simple and explicit code for now, until you understand better the machinery that makes it all tick. Start with binmode STDOUT, ":utf8" and get your code to output correctly-encoded UTF-8 to STDOUT (after reading your code, I think you are almost there: everywhere you get UTF-8 bytes, you decode them correctly before printing). Once that works, start adding pragmas like open that save you typing.

    I am not sure why would your code (appear to) entirely skip non-ASCII files and directories, but perhaps we could shed some light on it once we get Unicode display problem resoled.

      Hmm, no, that was wrong, use open qw/:std :utf8/; use open OUT => ':encoding(UTF-8)', ':std'; alone doesn't cause Mojibake on my system (instead, I get proper UTF-8 output). Something else is going on.
        Something else is going on.

        I've just tried to avoid that something else as I kept cutting out parts that weren't working, especially with caller, and moved as much of the html stuff out of it as I could.

        I went all the way to hard-coding the paths. There's only two for right now.

        # crossword params
        my %vars = (
        
          cw          => path( $path2, 'crosswords' ),
          изображение => path( $path2, 'crosswords', 'изображение', "1.атаман.jpg" ),
          подписи     => path( $path2, 'crosswords', 'подписи', "1.кгосс.txt" ),
        
        );
        
        if ( $vars{изображение}->is_file ) {
          say "We have a path to the image";
        }
        
        my $rvars           = \%vars;
        my $ref_html_values = init_values( $title, $path2, $abs );
        my %html_vars       = %$ref_html_values;
        
        # append returned hash from init
        @vars{ keys %html_vars } = values %html_vars;

        If the following is init_values(), do I combine the hashes in a robust way?

        I'm able to get the captions inputed, and the output makes sense with a few different views to it:

        This is with the following sub:

        sub make_russian_crossword { use 5.011; use warnings; use POSIX qw(strftime); use Path::Tiny; use Encode; use open OUT => ':encoding(UTF-8)', ':std'; use Data::Dumper; use utf8; my $rvars = shift; my %vars = %$rvars; my $munge = strftime( "p%d-%m-%Y-%H-%M-%S\.txt", localtime ); my $in_path = path( $vars{rus_captions}, $munge )->touchpath; say "in make russian xword------"; ##Let mother know that you created a file, *verb* a reference: $vars{log_file} = $in_path; ## input use Path::Tiny methods my @lines = $vars{подписи} +->lines_utf8; say "lines are @lines"; my $ref_lines = \@lines; vars{data} = $ref_lines; say "with Dumper------"; say Dumper $ref_lines; say "with Dump------"; use Data::Dump; dd $ref_lines; #print_aoa_utf8($ref_lines); say "using slurp method--------"; my $guts = $vars{подписи}- +>slurp_utf8; say "guts are $guts"; my $width = 10; ## trim off white space beyond ten for every line # trim leading or trailing whitespace return $rvars; }

        Where I'm at is looking for ways to get these data rectangular. In particular, I want to trim the lines to length ten and trim leading and trailing whitespace.

        It seems like something that perl would do elegantly.