toohoo has asked for the wisdom of the Perl Monks concerning the following question:

Hello everybody

I have an issue with Encode::Guess. I don't know what I'm doing wrong.

I want to guess the encoding of a file.

#!/usr/bin/perl use strict; use warnings; print "2utf8.pl\n"; use Encode; use Encode::Guess; #my @encs = Encode->encodings(":all"); #print join("--",@encs); push( @INC, '.'); #use sniver; our $data; if( $ARGV[0] eq '') { print "no ARGV[0]?: $ARGV[0]\n"; usage(); } elsif ($ARGV[0] ne '') { print "IS ARGV[0]: $ARGV[0]\n"; $data = getfile($ARGV[0]); if($data){ print length($data),"--\n"; print "length(data)>0\n"; }else{ print "lenght(data)<=0\n"; #my EIN; abor("getfile no success"); } } else { print "no ARGV[0]?: $ARGV[0]\n"; } my $encodings_test = 'ascii cp1252 cp437 cp850 iso-8859-1 utf-8-strict + utf8'; my $decoder = guess_encoding($data, qw/$encodings_test/); print "decoder: $decoder\n"; sub usage { print "2utf8.pl - thomas hofmann (c) Apr 2020\nUSAGE: perl 2utf8.p +l (file)\n"; } sub mes { my $mes = shift; if($mes){ print "$mes"; if($mes !~ m/\n$/){print "\n";} } } sub abor { my $mes = shift; if($mes){ mes( "Error: $mes" ); if($mes !~ m/\n$/){print "\n";} } usage(); exit(1); } sub getfile { my ($filepath, @rest) = @_; my $content = undef; my $orgein = $/; local (*GETFILEDAT); if (!open(GETFILEDAT, $filepath)) { return ($content); } undef ($/); binmode(GETFILEDAT); $content = <GETFILEDAT>; close (GETFILEDAT); $/ = $orgein; return ($content); }

The used file is the following ABC.dcm file

* Datensatz: XXX * Erzeugt von: Hofmann, Thomas * Datum: 24.05.2019 08:06:27 * * ASE DCEnv ComponentV2.6.0.2243 KONSERVIERUNG_FORMAT 2.0 FESTWERT AAA LANGNAME "Querbeschleunigunge" EINHEIT_W "-" WERT 0.8000000119 END FESTWERT BBB LANGNAME "Rollzentrumshoehe" EINHEIT_W "m" WERT 0.4600000083 END FESTWERT CCC LANGNAME "Sportmodus" EINHEIT_W "m" WERT 0.4699999988 END FESTWERT DDD LANGNAME "Aufbaumasse" EINHEIT_W "kg" WERT 1850.000000 END FESTWERT EEE LANGNAME "Federsteifigkeit" EINHEIT_W "N/mm" WERT 25 END FESTWERT FFF LANGNAME "Vorderachse" EINHEIT_W "N/mm" WERT 22 END FESTWERT GGG LANGNAME "Momentenverteilung abhängig" EINHEIT_W "m/s^2" WERT 0.5 END FESTWERT HHH LANGNAME "abhängig der Querbeschleunigung" EINHEIT_W "-" WERT 0.25 END

Thanks in advance

Replies are listed 'Best First'.
Re: issue with Encode::Guess
by Corion (Patriarch) on Apr 04, 2020 at 18:24 UTC

    So, how does your program behave? And how does that differ from what you expect?

    Also, what encoding is your input file in? Latin-1 ? UTF-8? ISO-8859-15?

      Hello @Corion,

      in this first step my programm should behave in the following way:
      it should in this first step name me in what encoding the input file is.
      In a further step it should convert from input encoding to UTF-8. But this is not programmed yet.

      Because in the future my programm will be called with many input files, I don't know, what encoding the input files will be.
      If you point at my example code so asume it is encoding ANSI. The special case are the german Umlauts like entity auml.

      Thanks

        This code:

        my $encodings_test = 'ascii cp1252 cp437 cp850 iso-8859-1 utf-8-strict + utf8'; my $decoder = guess_encoding($data, qw/$encodings_test/); print "decoder: $decoder\n";

        ... does not what you think it does. qw does not interpolate strings into lists. You likely want:

        my @encodings_test = qw(ascii cp1252 cp437 cp850 iso-8859-1 utf-8-stri +ct utf8); my $decoder = guess_encoding($data, @encodings_test);

        ... or if you want to keep your list of encodings as a string (why?!), split it into a list:

        my $encodings_test = 'ascii cp1252 cp437 cp850 iso-8859-1 utf-8-strict + utf8'; my @encodings_test = split /\s+/, $encodings_test; my $decoder = guess_encoding($data, @encodings_test); print "decoder: $decoder\n";
Re: issue with Encode::Guess
by 1nickt (Canon) on Apr 04, 2020 at 20:14 UTC

    Hi, welcome to the monastery!

    I took a look at your post and I have some thoughts about both the post and the code.

    First, regarding your post. The following things about it make it difficult to help you:

    • "I have an issue with Encode::Guess" -- you do not say what your issue is.
    • You do not show your expected output.
    • You do not show your actual output.
    • You posted a great deal of irrelevant code, which makes it hard to see what's going on.

    For tips on how to get better help, please see:

    With regard to your code, while there are lots of things about it that could be improved, both it and the following Short, Self-Contained, Correct Example "work" and produce the same essential output.

    use strict; use warnings; use Encode::Guess; my $encodings_test = 'ascii cp1252 cp437 cp850 iso-8859-1 utf-8-strict + utf8'; guess_encoding('some text', qw/$encodings_test/);
    Output:
    Unknown encoding: $encodings_test at /Users/1nickt/perl5/perlbrew/perl +s/perl-5.30.1/lib/5.30.1/darwin-2level/Encode/Guess.pm line 120.

    An even simpler SSCCE :

    use strict; use warnings; use Test::More tests => 2; my $str = 'foo bar baz'; my @arr = ('foo', 'bar', 'baz'); is_deeply( [ qw/foo bar baz/ ], [ @arr ] ); is_deeply( [ qw/$str/ ], [ @arr ] );
    Output:
    11115052-3.pl .. 1..2 ok 1 not ok 2 # Failed test at 11115052-3.pl line 8. # Structures begin differing at: # $got->[0] = '$str' # $expected->[0] = 'foo' # Looks like you failed 1 test of 2. Dubious, test returned 1 (wstat 256, 0x100) Failed 1/2 subtests Test Summary Report ------------------- 11115052-3.pl (Wstat: 256 Tests: 2 Failed: 1) Failed test: 2 Non-zero exit status: 1 Files=1, Tests=2, 0 wallclock secs ( 0.02 usr 0.00 sys + 0.05 cusr + 0.00 csys = 0.07 CPU) Result: FAIL
    See qw.

    A couple of other observations I had:

    • Your code looks as though you first spent time on your argument parsing, error messaging, copyright printing etc., before actually proving that your algorithm or core implementation worked. Generally it's easier to work from the middle outwards.
    • Your script, judging by its name, appears actually to be trying to decide if a file is encoded in UTF-8. Like Corion, I wonder why you wouldn't know that already. If that is what you are trying to do, you might like to take a look also at Test::utf8.

    Hope this helps!


    The way forward always starts with a minimal test.

      Hello @lnickt,

      thanks first.
      Concerning your welcome to monastery I am many years into it. But thanks nonetheless.

      An early concern about late mention before I forget it: Test::utf8

      <cite>This module is a collection of tests useful for dealing with utf8 strings in Perl.</cite>

      I read: "dealing with utf8 strings"
      This is not, what I want. I want to deal with "non-utf8" strings. They should become utf8 strings but I don't want to deal with it. I guess that's quite different.

      To do an early "over all": I don't understand everything you say, I am also not native english.
      So I try to answer things you name as far as I can. Something you write is irritating for me. Besides I think you're running to make the things too big, they aren't as big. They are small.

      to answer your mentions:

      • "I have an issue with Encode::Guess" -- you do not say what your issue is.
        my issue - at this time - is to guess what encoding the input file is.
      • You do not show your expected output.
        beg your pardon
        my expected out put is: encoding is: XYZ (for XYZ replace with the correct encoding)
      • You do not show your actual output.
        2utf8.pl
        IS ARGV[0]: Quell-DCM-new.dcm
        862--
        length(data)>0
        Unknown encoding: $encodings_test at C:/EC-Apps/strawberry/perl/lib/Encode/Guess.pm line 119.
      • You posted a great deal of irrelevant code, which makes it hard to see what's going on.
        sorry for that

      I've had a look on SSCCE.

      <cite>An even simpler SSCCE</cite>
      I don't know, what you want to tell me with this. Actually I am not able to see a relation to my case. If there is one, then sorry.

      <cite>Your code looks as though you first spent time</cite>
      I'm aware that different people have different strategies for software development. As for my programming experience since 1986 I believe that's a good idea to move emerging problems to solved section, every thing that emerges that is solved will not be a problem in the future.

      <cite>Your script, judging by its name, appears actually to be trying to decide if a file is encoded in UTF-8.</cite>
      No. my script in the future should convert some hundred files from any encoding to UTF-8. Because the files are output of at mindst 5 different tools it is no way to decide by hand what encoding they are. It has to be done by a guess.
      But in this first step I want to do this first step and guess what encoding they are. And for this I tought it would be a good idea to use a standard package. Maybe I've choosen the wrong.

      There's no more to say. Regards.

        Hello again!

        Something you write is irritating for me.

        I apologize. I did not mean to seem condescending. I think I shared some valuable tips that will help you get better help more quickly.

        <cite>An even simpler SSCCE</cite>
        I don't know, what you want to tell me with this. Actually I am not able to see a relation to my case. If there is one, then sorry.

        The bug in your code was that you were misusing qw. I showed the simplest test that would prove that. (Also, note that the error you were originally getting was quite explicit, quoting the literal string "$encodings_test" as what you passed to the function.)

        my issue - at this time - is to guess what encoding the input file is.

        I'm sorry -- and I do not think this is a question of language -- but that is not your issue. That is your objective. Your issue is (was) the thing that was causing your current code to fail. Now, since you did not know what that was, you could not state it. But you could have stated the output that you did not expect from your program.

        This is not, what I want. I want to deal with "non-utf8" strings. They should become utf8 strings but I don't want to deal with it. I guess that's quite different.

        Very true. I mentioned Test::utf8 because it contains functions for trying to verify that what you think is encoded in UTF-8 really is. If you encode something to UTF-8 based on the guessed encoding of the source, you may want to check the result. But again I have to ask -- is it not possible for you to know the encoding of the data you are working with?

        Hope this helps!


        The way forward always starts with a minimal test.