wfsp has asked for the wisdom of the Perl Monks concerning the following question:

I have a Word doc with the following line

“a”

Method 1
Using Win32::OLE to extract the line
my $encoded = encode_entities($line); print "$encoded\n";
Outputs
“a”
Method 2
Using Win32::OLE to first save as utf8
my $text; { open my $fh, '<:utf8', 'utf8.txt'; $text = <$fh>; close $fh; } my $encoded = encode_entities($text); print "$encoded\n";
Produces:
&ldquo;a&rdquo;
I don't believe method 1 output conforms to HTML4.01 but that method 2 does.
I intend to use method 2 but is there a better way to do it?

Update:

For what it's worth the text extracted by method 2 also displays correctly in a tk text widget. :-)
Update 2: Corrected the HTML spec.

Replies are listed 'Best First'.
Re: Extracting MS Word text and encoding HTML entities
by Joost (Canon) on Mar 13, 2005 at 14:17 UTC
Re: Extracting MS Word text and encoding HTML entities
by PodMaster (Abbot) on Mar 13, 2005 at 11:18 UTC
    I intend to use method 2 but is there a better way to do it?
    The best way is not to do it. You can read about HTML4.1 at the w3 website.

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.