jai_dgl has asked for the wisdom of the Perl Monks concerning the following question:

Hi I'm using HTML::TreeBuilder to convert a HTML source to a HTML tree, but for some reasons the source special characters are converted into some other codes
Example:
Greek: Αφροδίτη της Μήλου, Aphroditē tēs Mēlou
is Converted to
Greek: Αφροδίτη της Μήλου, Aphroditē tēs Mēlou
I need the same special characters to be retained.
please help me out

Thanks
Jey

Replies are listed 'Best First'.
Re: HTML::Treebuilder Special characters
by Sewi (Friar) on Sep 08, 2009 at 14:27 UTC
    Looks like you ran into a UTF-problem.

    Are the chars HTML-encoded or are they written as plain chars?
    HTML-encoded: ü
    Plain: ü

    You should also check the charset-setting of your HTML page

      I get the proper content from the page with the same look and feel
      plain text :
      but when the HTML content is parsed using HTML::TreeBuilder the
      plain text is converted into HTML codes.

      Thanks
      Jey
        This function helped me to solve the issue
        sub encode_entities_decimal { my $text = shift; $text =~ s{([^\0-\x7f])}{sprintf("&#%d;",ord($1))}ge; $text; }