Extracting MS Word text and encoding HTML entities

wfsp has asked for the wisdom of the Perl Monks concerning the following question:

I have a Word doc with the following line

“a”

Method 1
Using Win32::OLE to extract the line

my $encoded = encode_entities($line);
print "$encoded\n";
[download]

Outputs

&#147;a&#148;
[download]

Method 2
Using Win32::OLE to first save as utf8


my $text;
{
  open my $fh, '<:utf8', 'utf8.txt';
  $text = <$fh>;
  close $fh;
}

my $encoded = encode_entities($text);
print "$encoded\n";
[download]

Produces:

&ldquo;a&rdquo;
[download]

I don't believe method 1 output conforms to HTML4.01 but that method 2 does.
I intend to use method 2 but is there a better way to do it?

Update:

For what it's worth the text extracted by method 2 also displays correctly in a tk text widget. :-)
Update 2: Corrected the HTML spec.

Comment on Extracting MS Word text and encoding HTML entities Select or Download Code

Replies are listed 'Best First'.

Re: Extracting MS Word text and encoding HTML entities
by Joost (Canon) on Mar 13, 2005 at 14:17 UTC

see the UTF-8 and unicode FAQ.

"What should it profit a man, if he should win a flame war, yet lose his cool?"

[reply]

Re: Extracting MS Word text and encoding HTML entities
by PodMaster (Abbot) on Mar 13, 2005 at 11:18 UTC

I intend to use method 2 but is there a better way to do it?

HTML4.1 at the w3 website

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]