Re^2: create clone script for utf8 encoding

a file's encoding is not some kind of metadata attribute secretly attached to a file. A file just contains bytes, and it is up to the reading and writing programs to interpret that sequence of bytes to and from the more abstract concept of "characters" (I am using that term loosely here) on reading and writing.

Am I correct that what my OS is telling me is its best guess as to how to interpret this file and have it make any sense?

$ file -i *.pl
18.clone.pl: text/x-perl; charset=us-ascii
1.a.pl:      text/x-perl; charset=utf-8
1.haukex.pl: text/x-perl; charset=us-ascii
1.k.pl:      text/x-perl; charset=us-ascii
2.haukex.pl: text/x-perl; charset=us-ascii
3.haukex.pl: text/x-perl; charset=utf-8
3.ping3a.pl: text/x-perl; charset=us-ascii
4.haukex.pl: text/x-perl; charset=utf-8
4.ping3a.pl: text/x-perl; charset=us-ascii
5.ping3a.pl: text/x-perl; charset=us-ascii
$
[download]

What seems to be very much the case is that the OS thinks the doc is utf8 if there are utf8 non-ascii characters in it. I did nothing with the #.haukex scripts to change from us-ascii to utf8 but begin to include cyrillic characters, like so with pre tags:

$ ./1.a.pl 3.haukex.pl
argv is 3.haukex.pl
before decode is 3.haukex.pl
after decode is 3.haukex.pl
current is /home/bob/2.scripts/pages/1.cw/template_stuff/translations/rus.cw
-------------
in_file: 3.haukex.pl
new base is 4.haukex.pl
save path is /home/bob/2.scripts/pages/1.cw/template_stuff/translations/rus.cw/4.haukex.pl
return is /home/bob/2.scripts/pages/1.cw/template_stuff/translations/rus.cw/4.haukex.pl
2.haukex.pl
3.haukex.pl
4.haukex.pl#!/usr/bin/perl -w
use 5.011;
use Carp;
use Data::Alias 'alias';
use Data::Dumper;
use utf8;   # a la François
use open OUT => ':encoding(utf8)';
use open ':std';

sub rangeparse {
	local $_ = shift;
	my @o;  #  row1,col1, row2,col2  (-1 = last row/col)
	if (@o=/\AR(0-9+|n)C(0-9+|n):R(0-9+|n)C(0-9+|n)\z/) {}
	elsif (/\AR(0-9+|n):R(0-9+|n)\z/) { @o=($1,1,$2,-1) }
	elsif (/\AC(0-9+|n):C(0-9+|n)\z/) { @o=(1,$1,-1,$2) }
	elsif (/\AR(0-9+|n)C(0-9+|n)\z/) { @o=($1,$2,$1,$2) }
	elsif (/\AR(0-9+|n)\z/) { @o=($1,1,$1,-1) }
	elsif (/\AC(0-9+|n)\z/) { @o=(1,$1,-1,$1) }
	else { croak "failed to parse '$_'" }
	$_ eq 'n' and $_=-1 for @o;
	return \@o;
}



use Test::More tests=>2;



is_deeply rangeparse("RnC2:RnC5"),   -1, 2, -1, 5 ;
is_deeply rangeparse("R3C2:RnCn"),    3, 2, -1,-1 ;

my $data = ['&#1081;', ' ', ' ', '&#1083;', ' ', ' ',  '&#1089;', ' ', ' ', 1..9];

say Dumper $data; 


sub getsubset {
	my ($data,$range) = @_;
	my $cols = @{$$data[0]};
	@$_==$cols or croak "data not rectangular" for @$data;
	$range = rangeparse($range) unless ref $range eq 'ARRAY';
	@$range==4 or croak "bad size of range";
	my @max = (0+@$data,$cols)x2;
	for my $i (0..3) {
		$$range$i=$max$i if $$range$i<0;
		croak "index $i out of range"
			if $$range$i<1 || $$range$i>$max$i;
	}
	croak "bad rows $$range[0]-$$range2" if $$range[0]>$$range2;
	croak "bad cols $$range1-$$range3" if $$range1>$$range3;
	my @cis = $$range1-1 .. $$range3-1;
	return [ map { sub{\@_}->(@{$$data$_}@cis) }
		$$range[0]-1 .. $$range2-1 ]
}

This is a trimmed down version of haukex's result in Selecting Ranges of 2-Dimensional Data. I'm populating it with cyrillic values and hope to run some tests, but I still want to get this clone tool squared away. Still working through other parts of your post....

Comment on Re^2: create clone script for utf8 encoding Download Code

Replies are listed 'Best First'.
Re^3: create clone script for utf8 encoding by haukex (Archbishop) on Dec 19, 2018 at 10:15 UTC
Am I correct that what my OS is telling me is its best guess as to how to interpret this file and have it make any sense? Yes, with the emphasis being that it's just a guess. the OS thinks the doc is utf8 if there are utf8 non-ascii characters in it. Yes, and it might be important to note that there are certain sequences of bytes that are not valid UTF-8 (see e.g. UTF-8), which means that in some cases, it's possible to differentiate between random bytes and valid UTF-8 text. Also, just to nitpick a little, it's not the OS guessing the file's encoding, it's the file tool. I did nothing with the #.haukex scripts to change from us-ascii to utf8 but begin to include cyrillic characters Note that if you have a file that is originally ASCII and you add non-ASCII characters to it, it's up to the editor to choose which encoding it will use when saving the file. Many editors will default to UTF-8, but some may not! with pre tags You may have noticed that when using `<pre>` tags, you have to escape square brackets, `[` is `[` and `]` is `]`. Update: Improved wording of first paragraph.	[reply] [d/l]
Re^4: create clone script for utf8 encoding by Aldebaran (Curate) on Dec 19, 2018 at 21:23 UTC
it might be important to note that there are certain sequences of bytes that are not valid UTF-8 (see e.g. UTF-8), which means that in some cases, it's possible to differentiate between random bytes and valid UTF-8 text. I see. Also, just to nitpick a little, it's not the OS guessing the file's encoding, it's the file tool. Thank you for the delousifying reference at file. I pulled out what I thought was relevant. I've "known" this before, but if you get behind on reading, things change: Read more... (2 kB) You may have noticed that when using pre tags, you have to escape square brackets I do now. Life is like a box of chocolates with pre tags for this particular forrest gump. The engine that parses the xml is gonna look at [ ] and create a hyperlink, isn't it? I think I'm gonna go back to code tags, even when content has cyrillic. Others can make a clean download without having to copy and paste off the screen.	[reply]