Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

High-bit ISO Latin character conversion problem.

by true (Pilgrim)
on Sep 06, 2003 at 17:16 UTC ( [id://289484]=perlquestion: print w/replies, xml ) Need Help??

true has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to convert these slanted quotes to the numeric code for html output. Gotten some great advice from jeffa and bart. But my lowly perl mind is still boggled by some of there one-lining magick. I was going to use HTML::Entities but i need this to run on Win32 also. ppm fails on install. Looked at the source for HTML::Entities as a guide, but more advanced perl was clouding my understanding. I've been told my solution involves something like:
my %subst = map { chr($_) => qq|&#$_;|} 0..255;
But any attempts to put this in effect inside a script are dead-ends. And help would be appreciated.
my $TXT = <<EOM; Those oranges are 50¢ EOM print &convert($TXT); sub convert{ my $word = $_[0]; #ARGH! return $word; }###########################
So my output should be
Those oranges are 50&#162;

Replies are listed 'Best First'.
Re: High-bit ISO Lating character conversion problem.
by bart (Canon) on Sep 06, 2003 at 19:18 UTC
    Let's finish what we started, shall we? Let's begin with what you got, add some more specific entities, and finally build a convertor with it. You got all those elements already via the Chatterbox, but perhaps a few details got lost. The conversion table for the Windows comes from this file: note that it only differes from ISO-Latin-1/Unicode in the range 128-159.
    # preparation my %subst = map({ chr($_) => "&#$_;" } 0 .. 255), # a few special ones '<' => '&lt;', '>' => '&gt;', '&' => '&amp;', '"' => '&quot;', # Windows specific map({ chr($_->[0]) => "&#$_->[1];" } [0x80 => 0x20AC], [0x82 => 0x201A], [0x83 => 0x0192], [0x84 => 0x201E], [0x85 => 0x2026], [0x86 => 0x2020], [0x87 => 0x2021], [0x88 => 0x02C6], [0x89 => 0x2030], [0x8A => 0x0160], [0x8B => 0x2039], [0x8C => 0x0152], [0x8E => 0x017D], [0x91 => 0x2018], [0x92 => 0x2019], [0x93 => 0x201C], [0x94 => 0x201D], [0x95 => 0x2022], [0x96 => 0x2013], [0x97 => 0x2014], [0x98 => 0x02DC], [0x99 => 0x2122], [0x9A => 0x0161], [0x9B => 0x203A], [0x9C => 0x0153], [0x9E => 0x017E], [0x9F => 0x0178])); # sample string $_ = "maître d'hôtel"; # for the substitution, for each string, do: s/([&<>'"\177-\377])/$subst{$1}/g; print;
    Result:
    ma&#238;tre d&#39;h&#244;tel

    n.b. Note that this code is developed for perl 5.005, i.e. pre built-in Unicode support in perl.

    And of course I tested it with Windows-specific characters, like "€".

      That broke the chameau's back - I'm starting a PM snips file! "Subroutines, Snips, Clues and just plain Wow"

      I've seen the technique "initialize entire range of values for hash and then selectively replace special instances as needed" before, but this beautifully emphasizes the subject data, and reads naturally in order of increasing 'specialization'.   Oh, yeah, and the code's useful, too.   (ğ)

Re: High-bit ISO Lating character conversion problem.
by true (Pilgrim) on Sep 06, 2003 at 17:42 UTC
    ok so the light is spreading. And my question needs refinement. I was hoping to make it a simple question for the readers out there:

    Filemaker pro exports a tab-delimited with quotes file from my client's database. The columns use all sorts of foreign characters (accented e's, u's, leaning single and double quotes, cent signs, etc). I want to convert the leaning quotes to regular quotes, (after i split the file correctly by columns and rows). My first post about the cent sign example does not indicate the slanting quotes problem.

    My previous solution involved a flat text file with all the special characters. When i compared my text to a slanted single quote (backtick?) i got a regex error. I tried slashing it and q'ing it a couple of different ways to no avail. Still get the regex error.

Re: High-bit ISO Lating character conversion problem.
by true (Pilgrim) on Sep 06, 2003 at 18:31 UTC
    finally got sick of trying stuff. Downloaded the source for HTML::Entities from my Linux machine. Stuck the Entities.pm file in the same directory as my perl script on my Win32. and did this:
    #!/usr/bin/perl use HTML::Entities; my $input = <<EOM; “And all was well with the ‘ol’ world!” drüben, Straße EOM print encode_entities($input); exit;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://289484]
Approved by herveus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2024-04-26 08:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found