elavarasan has asked for the wisdom of the Perl Monks concerning the following question:

This node falls below the community's threshold of quality. You may see it by logging in.
  • Comment on How to generate HTML from Word document ?

Replies are listed 'Best First'.
Re: How to generate HTML from Word document ?
by marto (Cardinal) on May 29, 2013 at 17:28 UTC

    So we've been here a couple of times (How to convert Word Document to HTML ? and How to Convert Word Document to HTML) where you try to install MSWord::ToHTML to achieve this task. On each of the previous occasions you fail to provide enough information for anyone to effectivly help resolve the installation problem. I've suggested in response (and via discussions in the chatterbox) how to provide the information required to help resolve the problem. One option would be to do this and help us to help you.

    If you've given up on this module I can only suggest you attempt to automate Libre/OpenOffice to convert MS Word documents to PDF.

      already i have tried this module, but it is not installing in my linux, for install i used cpan -i MSWord::ToHTML; it reurns following output with error, please try to figure what the problem is

      cpan -i MSWord::ToHTML; CPAN: Storable loaded ok (v2.20) Going to read '/root/.cpan/Metadata' Database was generated on Mon, 27 May 2013 11:17:03 GMT CPAN: LWP::UserAgent loaded ok (v5.835) CPAN: Time::HiRes loaded ok (v1.9719) Warning: no success downloading '/root/.cpan/sources/authors/01mailrc. +txt.gz.tmp2664'. Giving up on it. at /usr/share/perl/5.10/CPAN/Index. +pm line 225. Fetching with LWP: http://www.perl.org/CPAN/authors/01mailrc.txt.gz CPAN: YAML loaded ok (v0.84) Going to read '/root/.cpan/sources/authors/01mailrc.txt.gz' CPAN: Compress::Zlib loaded ok (v2.02) ...................................................................... +......DONE Fetching with LWP: http://www.perl.org/CPAN/modules/02packages.details.txt.gz Going to read '/root/.cpan/sources/modules/02packages.details.txt.gz' Database was generated on Thu, 30 May 2013 04:29:03 GMT .............. New CPAN.pm version (v2.00) available. [Currently running version is v1.9402] You might want to try install CPAN reload cpan to both upgrade CPAN.pm and run the new version without leaving the current session. ..............................................................DONE Fetching with LWP: http://www.perl.org/CPAN/modules/03modlist.data.gz Going to read '/root/.cpan/sources/modules/03modlist.data.gz' ...................................................................... +......DONE Going to write /root/.cpan/Metadata Running install for module 'MSWord::ToHTML' Running make for A/AM/AMIRI/MSWord-ToHTML-0.006.tar.gz CPAN: Digest::SHA loaded ok (v5.60) Checksum for /root/.cpan/sources/authors/id/A/AM/AMIRI/MSWord-ToHTML-0 +.006.tar.gz ok CPAN: Archive::Tar loaded ok (v1.52) MSWord-ToHTML-0.006/ MSWord-ToHTML-0.006/MANIFEST MSWord-ToHTML-0.006/Changes MSWord-ToHTML-0.006/META.yml MSWord-ToHTML-0.006/t/ MSWord-ToHTML-0.006/t/constraints.t MSWord-ToHTML-0.006/t/sanity.t MSWord-ToHTML-0.006/t/data/ MSWord-ToHTML-0.006/t/data/From Iron MInes to Iron Bars.docx MSWord-ToHTML-0.006/t/data/california.doc MSWord-ToHTML-0.006/t/data/transparent.png MSWord-ToHTML-0.006/t/data/Greece(Giorgos).doc MSWord-ToHTML-0.006/t/data/Historical Moment.doc MSWord-ToHTML-0.006/t/data/Turkey2.docx MSWord-ToHTML-0.006/t/data/InsurgentNotes(Cover).doc MSWord-ToHTML-0.006/t/data/Madison.doc MSWord-ToHTML-0.006/t/data/Turkey.doc MSWord-ToHTML-0.006/t/data/Henri_Crisis Final JG.doc MSWord-ToHTML-0.006/t/data/Greece(Giorgos).copy.doc MSWord-ToHTML-0.006/t/data/InsurgentNotes(Masthead).doc MSWord-ToHTML-0.006/t/data/Wisconsin.doc MSWord-ToHTML-0.006/t/data/fresh_idea.docx MSWord-ToHTML-0.006/t/data/Jackson_2nd rewrite (1).doc MSWord-ToHTML-0.006/t/data/s-artesian2.docx MSWord-ToHTML-0.006/t/data/Turkey2.copy*Where is it?.docx MSWord-ToHTML-0.006/t/convert.t MSWord-ToHTML-0.006/lib/ MSWord-ToHTML-0.006/lib/MSWord/ MSWord-ToHTML-0.006/lib/MSWord/ToHTML/ MSWord-ToHTML-0.006/lib/MSWord/ToHTML/DocX.pm MSWord-ToHTML-0.006/lib/MSWord/ToHTML/Roles/ MSWord-ToHTML-0.006/lib/MSWord/ToHTML/Roles/HasHTML.pm MSWord-ToHTML-0.006/lib/MSWord/ToHTML/HTML.pm MSWord-ToHTML-0.006/lib/MSWord/ToHTML/Doc.pm MSWord-ToHTML-0.006/lib/MSWord/ToHTML/Types/ MSWord-ToHTML-0.006/lib/MSWord/ToHTML/Types/Library.pm MSWord-ToHTML-0.006/lib/MSWord/ToHTML.pm MSWord-ToHTML-0.006/Makefile.PL MSWord-ToHTML-0.006/README MSWord-ToHTML-0.006/inc/ MSWord-ToHTML-0.006/inc/Module/ MSWord-ToHTML-0.006/inc/Module/AutoInstall.pm MSWord-ToHTML-0.006/inc/Module/Install.pm MSWord-ToHTML-0.006/inc/Module/Install/ MSWord-ToHTML-0.006/inc/Module/Install/AutoInstall.pm MSWord-ToHTML-0.006/inc/Module/Install/Makefile.pm MSWord-ToHTML-0.006/inc/Module/Install/External.pm MSWord-ToHTML-0.006/inc/Module/Install/ReadmeFromPod.pm MSWord-ToHTML-0.006/inc/Module/Install/Base.pm MSWord-ToHTML-0.006/inc/Module/Install/Repository.pm MSWord-ToHTML-0.006/inc/Module/Install/Metadata.pm MSWord-ToHTML-0.006/inc/Module/Install/WriteAll.pm MSWord-ToHTML-0.006/inc/Module/Install/Fetch.pm MSWord-ToHTML-0.006/inc/Module/Install/Win32.pm MSWord-ToHTML-0.006/inc/Module/Install/Can.pm MSWord-ToHTML-0.006/inc/Module/Install/Include.pm CPAN: File::Temp loaded ok (v0.22) CPAN.pm: Going to build A/AM/AMIRI/MSWord-ToHTML-0.006.tar.gz Cannot determine perl version info from lib/MSWord/ToHTML.pm Locating bin:abiword... found at /usr/bin/abiword. Locating bin:tidy... found at /usr/bin/tidy. *** Module::AutoInstall version 1.06 *** Checking for Perl dependencies... *** Since we're running under CPAN, I'll just let it take care of the dependency's installation later. [Core Features] - Test::Most ...loaded. (0.31 >= 0.23) - Archive::Zip ...loaded. (1.30 >= 1.30) - Archive::Zip::MemberRead ...loaded. (1.30 >= 1.3) - CSS ...loaded. (1.09 >= 1.09) - Carp ...loaded. (1.26) - Digest::SHA1 ...loaded. (2.13 >= 2.13) - Encode ...loaded. (2.51 >= 2.42) - Encode::Guess ...loaded. (2.05 >= 2.04) - File::Basename ...loaded. (2.77) - File::Path ...loaded. (2.09 >= 2.08_01) - File::Spec ...loaded. (3.40) - HTML::Entities ...loaded. (3.69 >= 3.68) - HTML::HTML5::Writer ...loaded. (0.201 >= 0.102) - HTML::TreeBuilder ...loaded. (5.03 >= 4.2) - IO::All ...loaded. (0.46 >= 0.41) - IO::All::File ...loaded. (undef) - List::MoreUtils ...loaded. (0.33 >= 0.30) - Module::Find ...loaded. (0.11 >= 0.10) - Moose ...loaded. (2.0801 >= 2.0000) - Moose::Role ...loaded. (2.0801 >= 2.0000) - Moose::Util::TypeConstraints ...loaded. (2.0801 >= 2.0000) - MooseX::Method::Signatures ...loaded. (0.44 >= 0.36) - MooseX::Types ...loaded. (0.35 >= 0.25) - MooseX::Types::IO::All ...loaded. (0.03 >= 0.03) - MooseX::Types::Path::Class ...loaded. (0.06 >= 0.05) - MooseX::Types::Moose ...loaded. (0.35 >= 0.25) - Path::Class::Dir ...loaded. (0.32 >= 0.23) - Text::Extract::Word ...loaded. (0.02 >= 0.02) - Try::Tiny ...loaded. (0.12 >= 0.09) - XML::LibXML ...loaded. (1.70 >= 1.70) - XML::LibXSLT ...loaded. (1.70 >= 1.70) - autodie ...loaded. (2.19 >= 2.10) - namespace::autoclean ...loaded. (0.13 >= 0.12) - strictures ...loaded. (1.004004) *** Module::AutoInstall configuration finished. Checking if your kit is complete... Looks good Writing Makefile for MSWord::ToHTML Writing MYMETA.yml and MYMETA.json cp lib/MSWord/ToHTML/Doc.pm blib/lib/MSWord/ToHTML/Doc.pm cp lib/MSWord/ToHTML/Types/Library.pm blib/lib/MSWord/ToHTML/Types/Lib +rary.pm cp lib/MSWord/ToHTML/DocX.pm blib/lib/MSWord/ToHTML/DocX.pm cp lib/MSWord/ToHTML/HTML.pm blib/lib/MSWord/ToHTML/HTML.pm cp lib/MSWord/ToHTML/Roles/HasHTML.pm blib/lib/MSWord/ToHTML/Roles/Has +HTML.pm cp lib/MSWord/ToHTML.pm blib/lib/MSWord/ToHTML.pm Manifying blib/man3/MSWord::ToHTML.3pm AMIRI/MSWord-ToHTML-0.006.tar.gz /usr/bin/make -- OK Running make test PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_h +arness(0, 'inc', 'blib/lib', 'blib/arch')" t/constraints.t t/convert. +t t/sanity.t t/constraints.t .. Can't locate Devel/Dwarn.pm in @INC (@INC contains: + /root/.cpan/build/MSWord-ToHTML-0.006-jKHYc_/inc /root/.cpan/build/M +SWord-ToHTML-0.006-jKHYc_/blib/lib /root/.cpan/build/MSWord-ToHTML-0. +006-jKHYc_/blib/arch /etc/perl /usr/local/lib/perl/5.10.1 /usr/local/ +share/perl/5.10.1 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.10 +/usr/share/perl/5.10 /usr/local/lib/site_perl .) at t/constraints.t l +ine 7. BEGIN failed--compilation aborted at t/constraints.t line 7. # Looks like your test exited with 2 before it could output anything. t/constraints.t .. Dubious, test returned 2 (wstat 512, 0x200) No subtests run t/convert.t ...... Can't locate Devel/Dwarn.pm in @INC (@INC contains: + /root/.cpan/build/MSWord-ToHTML-0.006-jKHYc_/inc /root/.cpan/build/M +SWord-ToHTML-0.006-jKHYc_/blib/lib /root/.cpan/build/MSWord-ToHTML-0. +006-jKHYc_/blib/arch /etc/perl /usr/local/lib/perl/5.10.1 /usr/local/ +share/perl/5.10.1 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.10 +/usr/share/perl/5.10 /usr/local/lib/site_perl .) at t/convert.t line +7. BEGIN failed--compilation aborted at t/convert.t line 7. # Looks like your test exited with 2 before it could output anything. t/convert.t ...... Dubious, test returned 2 (wstat 512, 0x200) No subtests run t/sanity.t ....... ok Test Summary Report ------------------- t/constraints.t (Wstat: 512 Tests: 0 Failed: 0) Non-zero exit status: 2 Parse errors: No plan found in TAP output t/convert.t (Wstat: 512 Tests: 0 Failed: 0) Non-zero exit status: 2 Parse errors: No plan found in TAP output Files=3, Tests=6, 3 wallclock secs ( 0.02 usr 0.02 sys + 1.24 cusr + 0.18 csys = 1.46 CPU) Result: FAIL Failed 2/3 test programs. 0/6 subtests failed. make: *** [test_dynamic] Error 255 AMIRI/MSWord-ToHTML-0.006.tar.gz /usr/bin/make test -- NOT OK //hint// to see the cpan-testers results for installing this module, t +ry: reports AMIRI/MSWord-ToHTML-0.006.tar.gz Running make install make test had returned bad status, won't install without force

        Finally, some output ;) As choroba points out:

        Can't locate Devel/Dwarn.pm

        The module tests requires a module which is not marked as a dependency. This is why cpan doesn't install this module for you, and why you're getting the error. The solution is to install Devel::Dwarn before installing MSWord::ToHTML:

        cpan Devel::Dwarn cpan MSWord::ToHTML

        I appreciate that there can be a lot of output generated when installing modules, you need to get used to reading and understanding their output.

        Update: I've submitted a patch to have Devel::Dwarn added as a prerequisite.

        Have you read the log? Have you tried installing Devel::Dwarn?
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: How to generate HTML from Word document ?
by ww (Archbishop) on May 29, 2013 at 17:04 UTC
    downvoted:
    • Lazy. This question has been asked (and better) repeatedly. cf Super Search or Google
    • "perfect" may be any one of many different possibilites... depending on data about your use case and infrastructure you didn't bother to detail. I know what I mean. Why don't you?

    If you didn't program your executable by toggling in binary, it wasn't really programming!

Re: How to generate HTML from Word document ?
by rpnoble419 (Pilgrim) on May 30, 2013 at 03:08 UTC

    How complex is your Word document? Is it simple text with styling or complex page layout with tables and embedded graphics? Can the HTML output from Word be used for your purpose? The HTML created by Word is a pig and very bloated.

    Can you convert the Word doc into RTF and then parse that? That would give you simple styling queues to then convert into HTML. If the Word document its complex or the HTML output from Word works for you final output, you may want to try Win32::OLE to remote control Word to create the HTML for you.

      "...you may want to try Win32::OLE to remote control Word to create the HTML for you."

      This isn't an option, they're running Linux.