in reply to Re^6: Getting mad with CGI::Application and utf8
in thread Getting mad with CGI::Application and utf8

Let me repeat the question: I get some data from a foreign Perl module (let's say a file parser), and that module doesn't document what it returns.

But other parts of the code have to deal with text strings (for example because they query unicode properties).

What should I do? I only need to know that once, at write/debug time.

My current approach is to try to get some data with high codepoints (outside latin-1 range) out of the foreign module, and check with Devel::Peek or utf8::is_utf8 if that stupid flag is that.

Is there a better, more reliable approach? And is that really an abuse?

  • Comment on Re^7: Getting mad with CGI::Application and utf8

Replies are listed 'Best First'.
Re^8: Getting mad with CGI::Application and utf8
by Juerd (Abbot) on Feb 27, 2008 at 11:09 UTC

    Let me repeat the question

    Do you expect me to repeat the answer too? :)

    If your subroutine or module specifically only handles binary strings, I'd recommend documenting it as such, and downgrading the string that you receive:

    my $copy = $foo; utf8::downgrade($copy) or utf8::encode($copy) && carp "Wide character +in operation";
    That's more or less what Perl does in its binary operators, like print.

    Whatever you do, though, never assume that the absence of the flag means it's not a text string!

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      Do you expect me to repeat the answer too?

      No. Your anser ("you can't") is probably right, but not very helpful for the situation I described. When I need to solve a problem, and I can't get the 100% complete solution, I try to approximate.

      So I suggested an approximation, and asked if there's a better way. So, is there one? Or is it already as close as I can get, without having to read a foreign module, possibly tracing strings manually through thousands of lines of code?

        Approximations, also known as heuristics, are bound to go wrong at some point. Making decisions (other than downgrading) based on the utf8 flag is bad, and there is no "better" way to do the impossible.

        In any case, document the problem. If at all possible, try to get the data source and/or its documentation to communicate the type to you.