in reply to Re^2: Problem displaying unicode for certain websites
in thread Problem displaying unicode for certain websites

But how do I know if a given variable stores a unicode or ascii string?

It contains what you put in it. What did you put in it?


You have strings of (Unicode) characters and strings of bytes.

If the string contains chr(0x2660), it's obviously not a string of bytes. If the string contains chr(0x41), it could be anything. ASCII 'A', the number 65, or something completely different.

If you pass a string with chr(0x41) in it to a function, you're not gonna get much information out of it. What you do is pass a string with something that can't be a byte in it. If it works, you know it's expecting characters.

  • Comment on Re^3: Problem displaying unicode for certain websites

Replies are listed 'Best First'.
Re^4: Problem displaying unicode for certain websites
by Anonymous Monk on Dec 12, 2009 at 11:14 UTC
    Thanks ikegami! Your code gives me -
    Oj, f\x{00e5}r vi ingen mat?!
    instead of -
    Oj, får vi ingen mat?!
    How come?
      I updated all my modules and it seems the line -
      $decoded_text = decode_entities($decoded_text);
      isn't doing anything. I'm getting the following output for "www.expressen.se" with your code -
      Spela Uno! Det klassiska kortspelet i digital form. Redo att byta jobb? H\x{00e4}r kan du s\x{00f6}ka bland m\x{00e4}ngder av annonser. L\x{00f6}rdag 12 december 2009 Tipsa Expressen
        I get
                Spela Scrabble!
            
                Bilda så bra ord som 
        
        möjligt med brickorna.
            
        
        
            
            
                Lördag 12 december 2009
                
                             Tipsa Expressen
        

        It looks like your locale isn't detected correctly. Try using binmode on STDOUT instead of using use open.