Norah has asked for the wisdom of the Perl Monks concerning the following question:

I know this type of question has been asked many times, but none of the answers are working for me. Here is what I want to do: 1) read a file in UTF-8 encoding 2) do a bunch of processing - character by character, concatenations, substrings, concatenating it with other variables and other constants 3) write the resulting record out in UTF-8 I have tried using "< :encoding(UTF-8)" on the open. It reads it won't read a single line - there is no error message - the program hangs and aborts. I have tried reading a record without encoding and then doing a decode("UTF-8",$_) (note I have put $_ into a variable, and I do chomp it). This works, but reads the file and processes the record just as if it was ASCII. As I understand it, PERL has its own internal encoding. All the variables are in that encoding. If I don't specify a thing I assume the reads and writes default to the machine's encoding which is ASCII (Windows is what I want it to run on). So I think all I have to do is read the input file using encoding, do my work, then encode the result and write the resulting record. It isn't working at all. I did add the "use" as well. And note - I am doing the read with the diamond operator in a while... while (<INFILE>) Any sample program or advice would be much appreciated. I cannot even find a sample program on the internet yet that reads a file and writes some output.

Replies are listed 'Best First'.
Re: Read and write UTF-8
by Norah (Novice) on Oct 14, 2016 at 22:44 UTC

    Here is an example:

    #!/usr/bin/perl use Encode qw/encode decode/; open (INFILE, "< :encoding(UTF-8)", "utf8.txt") || die "blah blah blah +"; open (OUTFILE, "> :encoding(UTF-8)", "oututf8.txt") || die "blah blah" +; while (<INFILE>) { $line = $_; chomp ($line); $linestart = substr($line,0,20); $outline = "First 20: "."$linestart"; print OUTFILE "$outline\n"; } close (INFILE);

    Actually this one reads and writes the non-ASCII characters, but when there is a non-ASCII character in the record it doesn't count the correct # of characters.

      You talk about characters - when using UTF-8, length and substr count characters, not octets, so in the output, you can find more than 21 octets. If you were already talking about characters, not octets, can you please show some short example input that exhibits the problem, preferrably together with a hexdump of the relevant portion of the file?

        If you want to extract a given number of bytes from a UTF-8 string, use bytes::substr :
        #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use open IO => ':encoding(UTF-8)', ':std'; my $string = join q(), map chr, 9312 .. 9321; say $string; say substr $string, 0, 7; say bytes::substr $string, 0, 7;

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      it doesn't count the correct # of characters.

      With no sample data, that's pretty hard to verify. Have you read length() miscounting UTF8 characters? There's lots of good info in there (length works analagously to substr) along with easy examples of how to provide code complete with data which demonstrates the issue.

      And to be clear - what it did here - it did the exact same thing when I removed the encoding from the open statements.
      Here is simple input data:
      Year*JEDocSrcP_USERE_DATE P_DATE CurLine 2011617 GJ448 Bruce12/20/1101/01/11USD1500 2011617 GJ349áBruce12/20/1101/01/11USD1500 2011617 GJ350 Bruce12/20/1101/01/11USD1500 2011617 GJ351 Bruce12/20/1101/01/11USD1500
      The output looks like this:
      First 20: Year*JEDocSrcP_ First 20: 2011617 GJ448 Bruce1 First 20: 2011617 GJ349áBruce First 20: 2011617 GJ350 Bruce1 First 20: 2011617 GJ351 Bruce1
      Note that the asterisk * is really the UTF-8 heart symbol ♥. It wasn't displaying correctly here so I just put an asterisk there. I will work on the hex dump for you.

      but I think it is counting octets

        Note that the hexdump of your input data has a Byte Order Mark (BOM) at the front of it, which Perl counts at least as some characters.

        Discounting the BOM, I get the expected output with the following program:

        #!/usr/bin/perl -w use strict; use Encode qw/encode decode/; open (INFILE, "<:encoding(UTF-8)", "utf8.txt") || die "blah blah blah" +; open (OUTFILE, ">:encoding(UTF-8)", "oututf8.txt") || die "blah blah"; binmode STDOUT, ':encoding(UTF-8)'; print "Ruler : [12345678901234567890]\n"; while (my $line = <INFILE>) { chomp ($line); print "Input : [$line]\n"; my $linestart = substr($line,0,20); my $outline = $linestart; print "20 : [$outline]\n"; print "---\n"; print OUTFILE "$outline\n"; } close (INFILE);

        To remove the BOM at the start of your file, use maybe simply

        $line =~ s!^\N{BYTE ORDER MARK}!!;
Re: Read and write UTF-8
by fishmonger (Chaplain) on Oct 15, 2016 at 00:10 UTC

    I have no experience writing scripts for utf8 but I recently viewed several YAPC 2016 youtube videos on the subject and they are very interesting and should answer your question. Ricardo Signes is one of the leading perl unicode experts and his talks are very good and humorous.

    Here is one of them. https://www.youtube.com/watch?v=TmTeXcEixE.

Re: Read and write UTF-8
by Norah (Novice) on Oct 15, 2016 at 17:50 UTC
    The hex for the first line of code (I think this should be enough):

    EF BB BF 59 65 61 72 E2 99 A5 4A 45 44 6F 63 53 72 63 50 5F 55 53 45 5F 44 41 54 45 20 20 50 5F 44 41 54 45 20 20 43 75 72 4C 69 6E 65 0D 0A

Re: Read and write UTF-8
by Anonymous Monk on Oct 17, 2016 at 20:33 UTC
    All

    Thank you for the thoughts. It works now but it was a strange sequence of issues. One issue was I had to remove the hyphen in "UTF-8". Whatever version I am using simply doesn't like that hyphen. Second, I had tabs in the input file and kept thinking they were a series of blanks. So it was counting characters correctly all along (silly me).

    Thank you for the suggestion to remove the BOM. That was important as it was counting that as a character too.

    Thanks again.

      Note that UTF8 and UTF-8 aren't equivalent:
       $ perl -lwE 'binmode STDOUT, ":encoding(UTF8)"; print chr 10240000;'
      Code point 0x9C4000 is not Unicode, may not be portable at -e line 1.
      �����
       $ perl -lwE 'binmode STDOUT, ":encoding(UTF-8)"; print chr 10240000;'
      Code point 0x9C4000 is not Unicode, may not be portable at -e line 1.
      "\x{9c4000}" does not map to utf8 at -e line 1.
      \x{9C4000}
      

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,