Special character not being captured

Lady_Aleena has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone. I am having a problem with the character Ć in the string Ćon Flux not being captured by the first_alpha subroutine below. A space character is being returned. This problem arose when I recently re-encoded my files from Windows-1252 to UTF-8. I am baffled. As always, I appreciate all the help I get.

sub first_alpha {
  my $alpha = shift;
  $alpha = ucfirst($alpha) if $alpha =~ /^\l./;
  $alpha =~ s/\s*\b(A|a|An|an|The|the)(_|\s)//xi;
  if ($alpha =~ /^\d/) {
    $alpha = '#';
  }
  elsif ($alpha !~ /^\p{uppercase}/) {
    $alpha = '!';
  }
  else {
    $alpha =~ s/^(.)(\w|\W)+/$1/;
  }
  return $alpha;
}
[download]

Even substr('Ćon Flux', 0, 1) returns a space character.

The weird thing that I found is that the string Ćon Flux is returned when I ran the data from the file through my make_hash subroutine then ran that hash through my alpha_hash subroutine, both in the same module as first_alpha. (You can see the full module here.)

sub make_hash {
  my %opt = @_;
  my $file = $opt{'file'} && ref($opt{'file'}) eq 'ARRAY' ? data_file(
+@{$opt{'file'}}) : $opt{'file'};
  open(my $fh, '<', $file) || die "Can not open $file $!";

  my @headings = $opt{'headings'} ? @{$opt{'headings'}} : ('heading');

  my %hash;
  while (my $line = <$fh>) {
    chomp $line;
    die "This file is not for Util::Data! Stopped $!" if $line =~ /no 
+Util::Data/i;

    my @values = split(/\|/,$line);
    my $key = scalar @headings > 1 ? $values[0] : shift @values;

    my $n = 0;
    for my $r_heading (@headings) {
      if (defined($values[$n]) && length($values[$n]) > 0) {
        my $split = $r_heading =~ /\+$/ ? 1 : 0;
        (my $heading = $r_heading) =~ s/\+$//;

        my $value = $split == 1 ? [map { $_ =~ s/^ //; $_ } split(/;/,
+$values[$n])] : $values[$n];

        if (scalar @headings > 1) {
          $hash{$key}{$heading} = $value;
        }
        else {
          $hash{$key} = $value;
        }
      }
      $n++;
    }
  }
  return \%hash;
}

sub alpha_hash {
  my ($org_list, $opt) = @_;
  my %alpha_hash;
  for my $org_value (keys %{$org_list}) {
    my $alpha = !$opt->{article} ? first_alpha($org_value) : substr($o
+rg_value, 0, 1);
    $alpha_hash{$alpha}{$org_value} = $org_list->{$org_value};
  }
  return \%alpha_hash;
}
[download]

The following is truncated output from alpha_hash.

          'A' => [
                   'Alphas',
                   'Arrow',
                   'Ash vs. Evil Dead'
                 ],
          '&#65533;' => [ # in my terminal, there is just a blank spac
+e between the quotes.
                   'Ćon Flux (1991)'
                 ],
          'I' => [
                   'I Spy (1965)',
                   'The Invisible Man (2000)'
                 ],
[download]

As I said earlier, any and all help is appreciated.

No matter how hysterical I get, my problems are not time sensitive. So, relax, have a cookie and a very nice day!

Lady Aleena

Comment on Special character not being captured Select or Download Code

Replies are listed 'Best First'.

Re: Special character not being captured
by choroba (Cardinal) on Jun 17, 2019 at 19:31 UTC

> Even substr('Ćon Flux', 0, 1) returns a space character.

No, it works correctly. But you need to handle the encoding appropriately: tell Perl what encoding is used on input (binmode on *DATA below), what encoding is used in the source code (here utf8), and what encoding is used on output (binmode on *STDOUT). Configure the terminal to use the output encoding.

#!/usr/bin/perl
use warnings;
use strict;
use utf8;
use feature qw{ say };

binmode *STDOUT, ':encoding(UTF-8)';
say substr('Ćon Flux', 0, 1);

binmode *DATA, ':encoding(UTF-8)';
say substr <DATA>, 0, 1;

__DATA__
Ćon Flux
[download]

Output:

Ć
Ć
[download]

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

[reply]
[d/l]
[select]

Re^2: Special character not being captured

by vr (Curate) on Jun 18, 2019 at 12:49 UTC

I'm curious how the /x{FFFD} became hash key?

>perl -MEncode=decode -MData::Dump=dd -E "dd decode q(UTF-8), substr q
+q(\xC3\x86),0,1"
"\x{FFFD}"
[download]

but I don't see Lady_Aleena decoding anything.

[reply]
[d/l]
[select]

Re^3: Special character not being captured

by choroba (Cardinal) on Jun 18, 2019 at 12:59 UTC

decode

You need

Encode::encode("UTF-8", substr(Encode::decode("UTF-8", "\xC3\x86"),0,1
+))
[download]

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

[reply]
[d/l]
[select]

Re^4: Special character not being captured

by vr (Curate) on Jun 18, 2019 at 13:09 UTC

Re^5: Special character not being captured

by choroba (Cardinal) on Jun 18, 2019 at 13:17 UTC

Re^3: Special character not being captured

by Lady_Aleena (Priest) on Jun 18, 2019 at 20:39 UTC

Do you want to see step-by-step how I got to the point where the special character Ć got into first_alpha with all code along the way?

No matter how hysterical I get, my problems are not time sensitive. So, relax, have a cookie, and a very nice day!

Lady Aleena

[reply]
[d/l]

Re^4: Special character not being captured

by vr (Curate) on Jun 19, 2019 at 08:50 UTC

Re^2: Special character not being captured

by Lady_Aleena (Priest) on Jun 17, 2019 at 19:57 UTC

Please, would you give me your opinion why I did not need to specify encoding while the original data file was encoded as Windows-1252?

No matter how hysterical I get, my problems are not time sensitive. So, relax, have a cookie, and a very nice day!

Lady Aleena

[reply]

Re^3: Special character not being captured

by choroba (Cardinal) on Jun 17, 2019 at 20:09 UTC

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

[reply]
[d/l]

Re^3: Special character not being captured

by ikegami (Patriarch) on Jun 24, 2019 at 22:26 UTC

First of all, ignore any explanation that mentions latin-1. Perl doesn't know anything about latin-1.

The lack of decoding of inputs plus the lack of encoding of outputs means the bytes were copied through. So if your input was encoded using cp1252, and if this is output to a terminal (or browser) expecting cp1252, it works.^[1]

The problem with this approach is that lots of tools expect decoded text (strings of Unicode Code Points), not encoded text (string of cp1252 bytes).

For example,

/\w/ will fail to work properly.
uc will fail to work properly.
length might not do what you want (for some encodings).
substr might not do what you want (for some encodings).

In Detail:
Perl expects the source file to be encoded using ASCII (no utf8;) or UTF-8 (use utf8;).^[2] That said, when expecting ASCII (no utf8;), bytes outside of ASCII in string literals produce a character with the same value in the resulting string.
For example, say Perl expects ASCII (no utf8;) and it encounters a string literal that contains byte 80. This is illegal ASCII, but it's "€" in cp1252. Perl will produce a string that contains character 80. If you were to later print this out to a terminal expecting cp1252 (without doing any form of encoding), you'd see "€".
EBCDIC machines expect EBCDIC and UTF-EBCDIC rather than ASCII and UTF-8.

[reply]
[d/l]
[select]

Re^2: Special character not being captured

by Anonymous Monk on Jun 17, 2019 at 20:04 UTC

Note that `use utf8;` also applies to the DATA section, so the binmode on *DATA isn't needed.

[reply]

Re^2: Special character not being captured

by ikegami (Patriarch) on Jun 24, 2019 at 22:07 UTC

DATA is the handle used by Perl to read the source file. As a result, use utf8; affects not just the source file, but DATA as well. Specifically, it adds a :utf8 layer to DATA. Since DATA already has a :utf8 layer, so adding :encoding(UTF-8) is incorrect (though harmless).

Furthermore, use open ':std', ':encoding(UTF-8)'; adds :encoding(UTF-8) to not just STDOUT, but also to STDIN and STDERR. (It also causes instances of open in scope to add that layer by default.) And it does so a compile-time. This is usually the better route.

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use utf8;
use open ':std', ':encoding(UTF-8)';

say substr('Ćon Flux', 0, 1);
say substr <DATA>, 0, 1;

__DATA__
Ćon Flux
[download]

[reply]
[d/l]
[select]

Re^3: Special character not being captured

by Lady_Aleena (Priest) on Jun 25, 2019 at 18:13 UTC

I'm not using the DATA handle. The strings the get passed to first_alpha come from hashes most of the time. I asked earlier in this thread, but do I need to post the entire process that lead to the problem with this one character not being "seen" properly by first_alpha?

No matter how hysterical I get, my problems are not time sensitive. So, relax, have a cookie, and a very nice day!

Lady Aleena

[reply]
[d/l]
[select]

Re^2: Special character not being captured

by Lady_Aleena (Priest) on Jun 20, 2019 at 18:15 UTC

choroba, It confuses me why when I use make_hash, it returns the correct strings as keys and values without having to specify an encoding; but when I go to get the first character with either the first_alpha subroutine or substr, I suddenly need to specify the encoding. All of these subroutines are in the same module where encoding is not specified anywhere. Some subroutines return the correct strings without having to specify encoding while others do not is confusing.

If this helps, I am including my locale.

me@office:~$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
[download]

As an aside, I rewrote first_alpha. The original horrified me. I hope the rewrite is cleaner.

Original Rewrite

sub first_alpha {
  my $alpha = shift;
  $alpha = ucfirst($alpha) if $alpha =~ /^\l./;
  $alpha =~ s/\s*\b(A|a|An|an|The|the)(_|\s)//xi;
  if ($alpha =~ /^\d/) {
    $alpha = '#';
  }
  elsif ($alpha !~ /^\p{uppercase}/) {
    $alpha = '!';
  }
  else {
    $alpha =~ s/^(.)(\w|\W)+/$1/;
  }
  return $alpha;
}
[download]

sub first_alpha {
  my $string = shift;
  $string =~ s/\s*\b(A|a|An|an|The|the)(_|\s)//xi;

  my $alpha = uc substr($string, 0, 1);
  if ($alpha =~ /^\d/) {
    $alpha = '#';
  }
  elsif ($alpha !~ /^\p{uppercase}/) {
    $alpha = '!';
  }
  return $alpha;
}
[download]

No matter how hysterical I get, my problems are not time sensitive. So, relax, have a cookie, and a very nice day!

Lady Aleena

[reply]
[d/l]
[select]

Re^3: Special character not being captured

by choroba (Cardinal) on Jun 21, 2019 at 07:15 UTC

> when I go to get the first character (...) I suddenly need to specify the encoding

UTF-8 is a multi-byte encoding. It means that some characters, Ć being one of them, are encoded by more than one byte (in this case, two bytes: 0xC3 0x86). If a string starts with such a character, but Perl doesn't know the encoding, it assumes Latin-1, which is a single byte encoding. First character then corresponds to the first byte only, which is 0xC3. It doesn't have any meaning in UTF-8, so it's transformed into �, the replacement character.

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

[reply]
[d/l]

Re^4: Special character not being captured

by Lady_Aleena (Priest) on Jun 23, 2019 at 17:47 UTC

Re^5: Special character not being captured

by choroba (Cardinal) on Jun 24, 2019 at 07:19 UTC


laziness, impatience, and hubris
	PerlMonks