Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Using pragma utf8::all in processing non-utf data.

by humble (Acolyte)
on Sep 02, 2013 at 09:57 UTC ( [id://1051913]=perlquestion: print w/replies, xml ) Need Help??

humble has asked for the wisdom of the Perl Monks concerning the following question:

Good time of the day.

I use utf8::all everywhere now in my scripts. And i want the scripts to: a) warn me when i open a file that contains non-utf characters rather than die at it; b) be able to process data that is not in unicode (to save a result of the search by linux command "find" that gives my paths in unicode and non-unicode char.s; regexp on the data, etc). - Have i to stop using the utf8 module while opening the file and then enable and convert all the data to unicode? Or there are other ways?

  • Comment on Using pragma utf8::all in processing non-utf data.

Replies are listed 'Best First'.
Re: Using pragma utf8::all in processing non-utf data.
by daxim (Curate) on Sep 02, 2013 at 10:08 UTC
    utf8::all has no unimport routine, so you cannot disable it with no. Either do not use it when you do not need it, or undo its effects on filehandles:

    binmode $fh, ':pop';

    See PerlIO.

      Can't override utf8?
      Does the "use bytes" pragma override utf8?... um... sure hoping so.. have progs that depend on that.
        The question was about utf8::all. utf8 is only one of its many effects.
Re: Using pragma utf8::all in processing non-utf data.
by farang (Chaplain) on Sep 05, 2013 at 05:39 UTC

    If you have a file with a known encoding, for instance koi8-r, just use

    open( $fh, '< :encoding(koi8-r)', "in_file");
    and it should work fine under utf8::all.
    And i want the scripts to: a) warn me when i open a file that contains non-utf characters rather than die at it;
    I don't think it's possible to do it "when the file is opened" because the error arises when some non-Unicode utf8 sequence is read into Perl's internals. One way to do it is to use eval while reading the file line-by-line and trap the error. Here is some code which does that, trying first in utf8 and if that fails to be valid, warns and retries with koi8-r.
    use strict; use warnings; use utf8::all; open(my $fh, '<', "in_file") or die "cannot open in_file: $!"; eval { process_file_by_line() }; if ( $@ =~ /does not map to Unicode/ ) { warn $@; print "...trying encoding koi8-r instead of utf8\n\n"; close $fh; open( $fh, '< :encoding(koi8-r)', "in_file") or die "cannot open i +n_file: $!"; process_file_by_line(); } elsif ( $@ ne '' ) { die $@; # bail out on other eval errors } sub process_file_by_line { while ( <$fh> ) { print; # whatever else... } }

Re: Using pragma utf8::all in processing non-utf data.
by remiah (Hermit) on Sep 04, 2013 at 09:01 UTC

    Hello humble and monks.

    As daxim says, utf8::all dies because it imports its setting to your current code. In this case, "use warnings FATAL =>'utf8'" is imported from utf8::all and it dies. Below dies with utf8::all.

    print "before shiftjis\n"; open( my $fh, "<", "107.shiftjis.txt") or die $!; print join('', <$fh>); close $fh;
    If you disable "use warnings FATAL =>'utf8'" temporally, I guess you will not die.
    use warnings NONFATAL =>'utf8'; #disable print "before shiftjis\n"; open( my $fh, "<", "107.shiftjis.txt") or die $!; print join('', <$fh>); close $fh; "use warnings FATAL =>'utf8'"; #enable
    regards

      Just unfataling or disabling the warning makes no sense. The :encoding(UTF-8) IO layer will still be active and the readline function will produce garbage input.

        Hello daxim.

        As you says, we have to detect it's encode and decode its 'characters' into perl's internal encoding. ":pop" will give you un-encoded $fh. It also makes no sense.

        And one more for utf8:all. Please refer its problem in this thread "How can I safely unescape a string.". After all, utf8:all never frees you from encoding troubles.

        regards

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1051913]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-18 23:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found