in reply to Approaches to produce word docs

Do you have Word? If so, it may be possible to use Word's alleged ability to read HTML files to get what you want. I say "alleged", because Word will only read HTML files of a certain format. I don't speak HTML, so I haven't managed to get any code working to do this, but the automation of Word is not that difficult. I opened a Word instance and saved a blank document as HTML. This generated most of the code below, which is nearly working, i.e. it doesn't work. The problem seems (remember, I don't speak HTML) to have something to do with there being head and body tags from both the existing HTML document and the word top and tail. The temp file created therefore gets rejected by Word when it tries to open it. If anyone knows enough about HTML to get an HTML file into what Word will accept, this might be a way forward for you - if you have Word!

Regards,

John

use strict; use warnings; use Win32::OLE; use Win32::OLE::Const 'Microsoft Word'; my $htmltop = "<html xmlns:o=\"urn:schemas-microsoft-com:office:office +\" xmlns:w=\"urn:schemas-microsoft-com:office:word\" xmlns=\"http://www.w3.org/TR/REC-html40\"> <head> <meta http-equiv=Content-Type content=\"text/html; charset=windows-125 +2\"> <meta name=ProgId content=Word.Document> <meta name=Generator content=\"Microsoft Word 10\"> <meta name=Originator content=\"Microsoft Word 10\"> <link rel=File-List href=\"Blank_files/filelist.xml\"> <!--[if gte mso 9]><xml> <o:DocumentProperties> <o:Author>Davies</o:Author> <o:LastAuthor>Davies</o:LastAuthor> <o:Revision>1</o:Revision> <o:TotalTime>1</o:TotalTime> <o:Created>2011-02-01T14:47:00Z</o:Created> <o:LastSaved>2011-02-01T14:48:00Z</o:LastSaved> <o:Pages>1</o:Pages> <o:Lines>1</o:Lines> <o:Paragraphs>1</o:Paragraphs> <o:Version>10.2625</o:Version> </o:DocumentProperties> </xml><![endif]--><!--[if gte mso 9]><xml> <w:WordDocument> <w:Compatibility> <w:BreakWrappedTables/> <w:SnapToGridInCell/> <w:WrapTextWithPunct/> <w:UseAsianBreakRules/> </w:Compatibility> <w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel> </w:WordDocument> </xml><![endif]--> <style> <!-- /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-parent:\"\"; margin:0cm; margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:12.0pt; font-family:Arial; mso-fareast-font-family:\"Times New Roman\"; mso-bidi-font-family:\"Times New Roman\";} \@page Section1 {size:595.3pt 841.9pt; margin:72.0pt 90.0pt 72.0pt 90.0pt; mso-header-margin:35.4pt; mso-footer-margin:35.4pt; mso-paper-source:0;} div.Section1 {page:Section1;} --> </style> <!--[if gte mso 10]> <style> /* Style Definitions */ table.MsoNormalTable {mso-style-name:\"Table Normal\"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:\"\"; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:\"Times New Roman\";} </style> <![endif]--> </head> <body lang=EN-GB style='tab-interval:36.0pt'>"; my $htmltail = "</body> </html>"; my $infile = shift; my $tempfile = $infile; $tempfile =~ s/\./tmp\./; my $outfile = $infile; $outfile =~ s/.html?/.doc/; my $fhi; my $fht; open($fhi, "<", $infile) or die "Can't open input file"; open($fht, ">", $tempfile) or die "Can't open temp file"; print {$fht} $htmltop; while (my $line = <$fhi>) { print {$fht} $line; } print {$fht} $htmltail; close $fhi; close $fht; my $word = Win32::OLE->new('Word.Application'); my $doc = $word->Documents->Open($tempfile) or die "Dying $!"; $doc->SaveAs({FileName => $outfile, FileFormat => wdFormatDocument}); $doc->close(); $word->Quit();

Replies are listed 'Best First'.
Re^2: Approaches to produce word docs
by LanX (Saint) on Feb 01, 2011 at 18:01 UTC
    Thats comparable to the HTML export of Powerpoint in including MS-only information.

    Namely all these mso styles and xml infos. This helps IE to call office products in the background for rendering.

    (though with PPT it's more extreme)

    Anyway using MS-Word to import HTML is only my last resort. :)

    Thanks anyway I will consider doing this task from windows...

    Cheers Rolf