robinbowes has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm stumped by this problem - I hope someone can help me.

I'm replacing "bad" characters in HTML files.

An example of the sort of characters that I'm replacing is:

Ÿ (Not sure how that will come out - it appears as a square box in my source file but appears to get converted to Ÿ on here)

I've got it in my code as:

from => qq{Ÿ}, to => q{Ÿ},
However, I'd really like to be able to produce the from "string" programatically, i.e. represent it using pure ascii character.

I've hex-dumped the source file and the thing between the {} appears as:

000018c0 20 20 46 72 6f 6d 20 20 3d 3e 20 71 71 7b c2 9f | From = +> qq{..|
i.e. the character is represented by the c2 9f hex digits.

If I do a test script, and do something like:

binmode(STDOUT, ":utf8"); my $teststring = qq{Ÿ}; print "$teststring\n";
And pipe the output to hexdump, I see that $teststring has 4 characters:

c3 82 c2 9f
So, how do I produce a data structure that holds exactly the same content as "from => qq{Ÿ}" without having the binary character in my source code?

Thanks for any help. R.

--

Robin Bowes | http://robinbowes.com

Replies are listed 'Best First'.
Re: Representing "binary" character in code?
by graff (Chancellor) on Nov 05, 2006 at 05:57 UTC
    I think this is rephrasing GrandFather's question, but... What do you want as a replacement for the "bad" characters?

    When you say you "hex-dumped the source file" (and saw "c2 9f"), were you talking about the original html file (and "c2 9f" was/were the "bad" characters)? Or were you talking about your perl script? If you were talking about your perl script (which is what I'm guessing), then what do the "bad" characters in the html file look like when you hex dump that?

    Let's suppose the html file has a literal "0x9f" character ("capital letter Y with diaeresis" in the Windows CP1252 encoding). Let's also suppose that you actually want this converted to the utf8 encoding for this letter:

    use Encode; # ... read the html file into $html, and then: from_to( $html, "cp1252", "utf8" ); # now $html contains utf8 data instead of cp1252 data
    And another way to do that, without using Encode:
    open( HTML, "<:encoding(cp1252)", $filename ); # now text will be converted from cp1252 to utf8 # as it is read from the file.
    If you are using a utf8 text editor to create your scripts, and you try to put literal wide characters within quoted strings in your script, you'll want to say "use utf8;" next to "use strict;", so that the perl interpreter will know that the script itself contains utf8 wide characters. That way, as your quoted strings are assigned to variables, those variables will have their "utf8 flag" set. This is important when you set an output file handle to utf8 mode: scalars with the utf8 flag will be output correctly as utf8 data.

    If a scalar contains some bytes with the 8th-bit set, but the utf8 flag is not set, printing the string to a utf8-mode file will cause those bytes to be interpreted as "Latin-1" single-byte characters, and they will be "promoted" to utf8 wide characters -- e.g. 0x9f becomes the two-byte sequence "c2 9f"; another example: the two byte sequence "c2 9f" becomes the four-byte seqeunce "c3 82 c2 9f". (Look at perldoc perlunicode, and find the section titled "Unicode Encodings" to see the reasoning behind that).

      Hi all.

      Thanks for the replies.

      I guess the problem I have is understanding exactly what it is I'm trying to search for and replace.

      The context of this is not quite as simple as parsing html files. The text replacement is actually done by the apache module mod_publisher; my perl script is a mod_rewrite rewritemap script which creates the mod_publisher replacement rules dynamically.

      The problem I'm trying to solve is when webpages contain illegal MS Windows characters - I need to replace then with valid HTML chars. The "bad" chars work OK in browsers if the correct font is installed, but when proxying everything through mod_publisher, all content gets converted to utf8 format and the chars don't show up right in the proxied page.

      The only way I've got this to work so far is by pasting in the binary character I want to replace.

      I guess I'll just leave things as they are for now. Thanks, R.

      --

      Robin Bowes | http://robinbowes.com

        OK, I've had another stab at this.

        I used wget to pull down the proxied page that shows the problem.

        Looking at the html file with less, I see this where the "bad" char is:

        | <a href="/2/about.html">About Us</a> | <a href="/2/service.h +tml">We <U+0092>re About Service</a>
        If I hexdump the file, the same fragment looks like this:

        00002110 2f 32 2f 73 65 72 76 69 63 65 2e 68 74 6d 6c 22 |/2/servic +e.html"| 00002120 3e 57 65 c2 92 72 65 20 0d 0a 20 20 20 20 20 20 |>We..re . +. | 00002130 20 20 41 62 6f 75 74 20 53 65 72 76 69 63 65 3c | About S +ervice<|
        Now, if I copy the <U+0092> binary character directly from my browser and paste it into the perl script, the search and replace works - it finds that character and I can replace it with the correct HTML entity - &#8217; in this case.

        Here's some more code. This is how I am generating the rules, and using the utf8 character:

        my @rules = ( { From => qq{<92>}, To => q{&#8217;}, Flags => q{he}, }
        In the example above, <92> is a binary character that shows up in vim as "<92>". If I use less to display the file, it shows up as <U+0092> What I have so far failed to do is to create that binary character programmatically, i.e. using \x{} escapes, or pack(...) or any other techniques.

        It seems that if I use the utf8 character directly, perl does the write thing when I print it, but when I try to create the character indirectly I never qute get it right.

        R.

        --

        Robin Bowes | http://robinbowes.com

Re: Representing "binary" character in code?
by ysth (Canon) on Nov 05, 2006 at 05:35 UTC
    From perlop:

    The following escape sequences are available in constructs that interpolate and in transliterations.

    ... \033 octal char (ESC) \x1b hex char (ESC) \x{263a} wide hex char (SMILEY) \c[ control char (ESC) \N{name} named Unicode character

    So
    from => "\x9f", to => "\x{178}"
    may be what you want. Though it looks like you have some encoding issues to think about too.
Re: Representing "binary" character in code?
by GrandFather (Saint) on Nov 05, 2006 at 04:40 UTC

    How would you prefer to represent the character in your code? As the entity? As a numeric code point (the value in the entity representation)? In some other form?

    You may find HTML::Entities of some help.


    DWIM is Perl's answer to Gödel
Re: Representing "binary" character in code?
by Errto (Vicar) on Nov 05, 2006 at 05:07 UTC
    If you want to use UTF-8 literally in your code, then you need to include a use utf8 declaration in your source file, preferably near the top. You can also use the built-in chr function to produce a character with a certain numeric code point.