comment on

a file's encoding is not some kind of metadata attribute secretly attached to a file. A file just contains bytes, and it is up to the reading and writing programs to interpret that sequence of bytes to and from the more abstract concept of "characters" (I am using that term loosely here) on reading and writing.

Am I correct that what my OS is telling me is its best guess as to how to interpret this file and have it make any sense?

$ file -i *.pl
18.clone.pl: text/x-perl; charset=us-ascii
1.a.pl:      text/x-perl; charset=utf-8
1.haukex.pl: text/x-perl; charset=us-ascii
1.k.pl:      text/x-perl; charset=us-ascii
2.haukex.pl: text/x-perl; charset=us-ascii
3.haukex.pl: text/x-perl; charset=utf-8
3.ping3a.pl: text/x-perl; charset=us-ascii
4.haukex.pl: text/x-perl; charset=utf-8
4.ping3a.pl: text/x-perl; charset=us-ascii
5.ping3a.pl: text/x-perl; charset=us-ascii
$
[download]

What seems to be very much the case is that the OS thinks the doc is utf8 if there are utf8 non-ascii characters in it. I did nothing with the #.haukex scripts to change from us-ascii to utf8 but begin to include cyrillic characters, like so with pre tags:

$ ./1.a.pl 3.haukex.pl
argv is 3.haukex.pl
before decode is 3.haukex.pl
after decode is 3.haukex.pl
current is /home/bob/2.scripts/pages/1.cw/template_stuff/translations/rus.cw
-------------
in_file: 3.haukex.pl
new base is 4.haukex.pl
save path is /home/bob/2.scripts/pages/1.cw/template_stuff/translations/rus.cw/4.haukex.pl
return is /home/bob/2.scripts/pages/1.cw/template_stuff/translations/rus.cw/4.haukex.pl
2.haukex.pl
3.haukex.pl
4.haukex.pl#!/usr/bin/perl -w
use 5.011;
use Carp;
use Data::Alias 'alias';
use Data::Dumper;
use utf8;   # a la François
use open OUT => ':encoding(utf8)';
use open ':std';

sub rangeparse {
	local $_ = shift;
	my @o;  #  row1,col1, row2,col2  (-1 = last row/col)
	if (@o=/\AR(0-9+|n)C(0-9+|n):R(0-9+|n)C(0-9+|n)\z/) {}
	elsif (/\AR(0-9+|n):R(0-9+|n)\z/) { @o=($1,1,$2,-1) }
	elsif (/\AC(0-9+|n):C(0-9+|n)\z/) { @o=(1,$1,-1,$2) }
	elsif (/\AR(0-9+|n)C(0-9+|n)\z/) { @o=($1,$2,$1,$2) }
	elsif (/\AR(0-9+|n)\z/) { @o=($1,1,$1,-1) }
	elsif (/\AC(0-9+|n)\z/) { @o=(1,$1,-1,$1) }
	else { croak "failed to parse '$_'" }
	$_ eq 'n' and $_=-1 for @o;
	return \@o;
}



use Test::More tests=>2;



is_deeply rangeparse("RnC2:RnC5"),   -1, 2, -1, 5 ;
is_deeply rangeparse("R3C2:RnCn"),    3, 2, -1,-1 ;

my $data = ['&#1081;', ' ', ' ', '&#1083;', ' ', ' ',  '&#1089;', ' ', ' ', 1..9];

say Dumper $data; 


sub getsubset {
	my ($data,$range) = @_;
	my $cols = @{$$data[0]};
	@$_==$cols or croak "data not rectangular" for @$data;
	$range = rangeparse($range) unless ref $range eq 'ARRAY';
	@$range==4 or croak "bad size of range";
	my @max = (0+@$data,$cols)x2;
	for my $i (0..3) {
		$$range$i=$max$i if $$range$i<0;
		croak "index $i out of range"
			if $$range$i<1 || $$range$i>$max$i;
	}
	croak "bad rows $$range[0]-$$range2" if $$range[0]>$$range2;
	croak "bad cols $$range1-$$range3" if $$range1>$$range3;
	my @cis = $$range1-1 .. $$range3-1;
	return [ map { sub{\@_}->(@{$$data$_}@cis) }
		$$range[0]-1 .. $$range2-1 ]
}

This is a trimmed down version of haukex's result in Selecting Ranges of 2-Dimensional Data. I'm populating it with cyrillic values and hope to run some tests, but I still want to get this clone tool squared away. Still working through other parts of your post....

In reply to Re^2: create clone script for utf8 encoding by Aldebaran
in thread create clone script for utf8 encoding by Aldebaran

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.