in reply to create clone script for utf8 encoding

G'day Aldebaran,

"All of my scripts have a taxonomy of a positive integer followed by a period, followed by a word."

I'd probably aim for something like this in Perl:

$ perl -E 'my $x = "2.X"; my ($y, $z) = ($x =~ /^(\d+)(.*)$/); say $x; + say ++$y . $z' 2.X 3.X

This allows leading zeros in your filenames to be retained when incremented:

$ perl -E 'my $x = "02.X"; my ($y, $z) = ($x =~ /^(\d+)(.*)$/); say $x +; say ++$y . $z' 02.X 03.X

Which you may want to consider for a future filename convention to facilitate sorting. Compare these:

$ perl -E 'say for sort qw{1.X 9.X}' 1.X 9.X $ perl -E 'say for sort qw{1.X 9.X 10.X}' 1.X 10.X 9.X $ perl -E 'say for sort qw{01.X 09.X 10.X}' 01.X 09.X 10.X
"I'd like to write a perl equivalent that would give me freedom to choose the underlying encoding."

There's a variety of ways to do this. You don't say how the encoding is chosen: the following is just general information. See the open pragma and the open function; both contain links to additional information. See any perluni* links in http://perldoc.perl.org/perl.html.

"Does ascii have a representation for Ü?"

I see that AM has answered this. In Unicode, that character is:

U+00DC LATIN CAPITAL LETTER U WITH DIAERESIS

You might want to look at the Unicode Code Charts. That particular character is in (PDF link) "C1 Controls and Latin-1 Supplement". You may find it useful to familiarise yourself with the blocks of characters you deal with most often, perhaps even download a copy for quick reference; alternatively, if this is something you'll only want occasionally, individual internet searches will find the same information (searching for just "Unicode Ü" worked for me).

"I'd show previous attempts, but they look awful."

Showing what you've tried, even unsuccessful attempts, will often help us to better help you. We can see where your thought process (via code logic) was heading and perhaps steer you on a better course. A few minor tweaks may turn "awful" into "awesome". It can also help future readers who might be trying similar things.

— Ken

Replies are listed 'Best First'.
Re^2: create clone script for utf8 encoding
by Aldebaran (Curate) on Dec 19, 2018 at 05:26 UTC
    Showing what you've tried, even unsuccessful attempts, will often help us to better help you. We can see where your thought process (via code logic) was heading and perhaps steer you on a better course. A few minor tweaks may turn "awful" into "awesome". It can also help future readers who might be trying similar things.

    If you really want to see some broken code, I've got one more problem with regex for highest number; it's actually this related regex that is failing for me as it rounds the bend into double digit territory.

    I have a page like a garden-variety webpage, where it's clearly not a finished product, but it won't give me a number beyond ten, using this:

    sub highest_number{ use strict; use File::Basename; use Cwd; my ($aref, $filetype, $word) = @_; my $number; my @matching; my $ext = ".".$filetype; push (@matching, 0); #min returned value for my $file (@{$aref}) { #print "file is $file\n"; if ($file =~ /^$word(\d*)$ext$/){ #print "matching is $file\n"; push (@matching, $1); } } @matching = sort @matching; my $winner = pop @matching; return $winner }

    Did I not promise awful? I don't even got how this code could run without a semi-colon after $winner .

    But I have made progress on the central task, and thank you for your response. Until I can find a meaningful name and place for it, I have called it 1.a.pl:

    $ history | tail - 10 ==> standard input <== 1987 pt 1.a.pl ... 1990 ./1.a.pl 1.k.pl 1991 rm 2.k.pl 1992 ./1.a.pl 1.k.pl 1993 file -i *.pl 1994 cat 1.manifest ... 1996 history | tail - 10 tail: cannot open '10' for reading: No such file or directory $

    This is getting there:

    $ cat 1.a.pl #!/usr/bin/perl -w use 5.011; use Path::Tiny; use Encode; use utf8; # a la François use open OUT => ':encoding(utf8)'; use open ':std'; # This script increments and clones the file in $1. ## enabling cyrillic ## decode argv and current say "argv is @ARGV"; foreach (@ARGV) { say "before decode is $_"; $_ = decode( 'UTF-8', $_ ); say "after decode is $_"; } my (@in_files) = @ARGV; my $current = Path::Tiny->cwd; $current = decode( 'UTF-8', $current ); say "current is $current"; say "-------------"; say "in_file: @in_files"; for (@in_files) { my $tiny_in = path($_); ## use Path::Tiny my $file_contents = $tiny_in->slurp_utf8; $_ =~ m/^(\d+)(.*)$/; my $number = $1; my $rest = $2; my $increment = $number + 1; my $new_base = $increment . $rest; say "new base is $new_base"; ## use Path::Tiny to create new file my $save_file = path( $current, $new_base )->touchpath; say "save path is $save_file"; my $return = $tiny_in->copy($save_file); $return->chmod(0755); say "return is $return"; ## write to local manifest my $manifest_name = "1.manifest"; path($manifest_name)->append_utf8( $new_base . "\n" ); system "cat $manifest_name"; system "cat $save_file"; } $

    It looks like I got the spacing on the manifest file squared away:

    $ cat 1.manifest 2.haukex.pl 3.haukex.pl 4.haukex.pl4.ping3a.pl5.haukex.pl5.ping3a.pl2.k.pl/n2.k.pl 3.k.pl $
      "I don't even got how this code could run without a semi-colon after $winner ."

      It's the last line of "sub highest_number { ... }". Technically, that is terminated by the final brace: while this is valid code, I wouldn't recommend it (certainly not for production code). The problem is that if more code is added you'll have a statement that is neither terminated by a brace nor a semicolon.

      The same applies to arrays (and hashes) and commas. Consider this contrived example:

      my @numbers = ( 'one', 'three', 'two' );

      Oops! I'll just reorder those (a simple ddp sequence in vi):

      my @numbers = ( 'one', 'two' 'three', );

      Now you've got an even bigger "Oops!".

      Having said that, the biggest, single improvement you could make to this code would be the use of consistent indentation. It took me some time, tracking back and forth between opening and closing braces, to see where the related blocks of code were. This type of code is highly error-prone. Perhaps look at perlstyle; and perltidy may prove useful.

      — Ken

        > The problem is that if more code is added you'll have a statement that is neither terminated by a brace nor a semicolon

        In fact, in my own code, I never put a semicolon after return: Adding code after

        return $winner;
        makes no sense, as that code wouldn't be reachable. So I'm glad I'll get some kind of a syntax error.

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

        It also makes nicer diffs when adding lines to the list. You see a line being added instead of a line being changed up top of a line being added.