Milti has asked for the wisdom of the Perl Monks concerning the following question:

Where can I find a Perl program that will convert plain text to HTML? I have a routine that will do the basics, but I need one that will recognize various lists in a Word document and indent the items in the list the same as <ul><li>blah, blah</li><li>ha ha, ha ha</li></ul> does.

Any and all suggestions will be appreciated.

Replies are listed 'Best First'.
Re: Plain Text To HTML
by GrandFather (Saint) on Sep 18, 2024 at 21:13 UTC

    Last I looked Word had an "export as HTML" option. It produces rubbish HTML just cram packed with cruft, but Perl can help clean that up. You could start with HTML::Normalize inspired by Cleaning up HTML or HTML::Tidy.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: Plain Text To HTML
by davies (Monsignor) on Sep 18, 2024 at 21:51 UTC
Re: Plain Text To HTML
by Anonymous Monk on Sep 18, 2024 at 20:25 UTC
Re: Plain Text To HTML (docx)
by LanX (Saint) on Sep 18, 2024 at 22:09 UTC
    Plain text or docx or both?

    Anyway, docx is basically zipped XML with plain text between the tags.

    UnZip it and use an XML parser to grab the data you need.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

      How does extracting plain text from XML answer the need for converting plain text to HTML?

Re: Plain Text To HTML
by Anonymous Monk on Sep 18, 2024 at 20:32 UTC
Re: Plain Text To HTML
by Milti (Beadle) on Sep 19, 2024 at 13:01 UTC

    Let me clarify.

    I use a form to input information to a program to display a webpage. One part of that form is a text-area into which one can type input, or simply paste a Word document. I need a routine within the main program which will read the input from the text-area (call it $Description) and convert it to HTML for display in the webpage as directed. The routine I am presently using will do all of this except it does not indent the items in a bulleted or numbered list. Here is the present routine:

    my $newrecord=""; my $fn=$Description; $fn =~ s/-RET-/<br>/g; $fn =~ s/\n/<br>/g; $fn =~ s/_/((22&%&%22))/g; $newrecord .= "$fn"; $Description=$newrecord;

    Suggestions as to what is missing will be greatly appreciated.

      > simply paste a Word document.

      What does that mean?

      Others already pointed you to SSCCE it's a good way to "clarify" input, code and expected output.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

        I use a form to input information to a program to display a webpage. One part of that form is a text-area into which one can type input, or simply paste a Word document.

        This suggests to me that the idea is not to read a .docx file but that a user pastes content (with or without formatting?) into a webpage, which is backed by a Perl CGI script.

      More clarification.

      The form input is sent to a program on a web server to be processed and displayed as a part of a webpage that the site user will see. The form input might be simply typed into the 'text area' or a document might be copied and pasted into the 'text area'. In either case the final product for display will require some HTML formatting. Anyone that is authorized can post info for display at our website and the method must be simple, either type info into the form or copy and paste something into the form. In either case it is unlikely there will be any HTML formatting. Consequently we need to be able to format the input with the program that is accepting the input before it is displayed.

      Hope this makes clear what it is that I am trying to do.

        It would aid your cause greatly if you were to show sample input as pasted into the form (say 3 lines max) and the equivalent desired HTML of that input once it has been transformed. At the moment everyone is left to guess what it is that you actually want to happen during this transformation.

        When you have a moment, perhaps a read of How to ask better questions using Test::More and sample data will be of help.


        🦛

Re: Plain Text To HTML
by stevieb (Canon) on Sep 23, 2024 at 04:33 UTC

    Microsoft Word one of the banes of my existence. I get Word docs sent to me all the time, and it frustrates me to no end, because on my Mac (my primary work machine), I refuse to (once again) buy a license to properly read/write their proprietary format.

    My suggestion, since you asked? Ask Microsoft to standardize on a globally recognized format. Otherwise, write a CPAN distribution that handles their format that you consistently update when, at the whim of Microsoft, their proprietary format changes. My other suggestion? Demand people not send you files in a Microsoft Word format.

      The Office_Open_XML format has been the standard since 2007, and you don't need proprietary software to work with it.

        Plus there are various open source readers like LibreOffice available which don't require a licence.

        Alternatively web applications like Google Docs.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        see Wikisyntax for the Monastery

        The Office_Open_XML format has been the standard since 2007, and you don't need proprietary software to work with it.

        There is a lot to unpack in that assertion, it's all the harder since Groklaw.net is now offline. You don't need proprietary software to work with OOXML but then OOXML is not .docx and part of the question here is about MS Word documents which default to the proprietary .docx series.

        Yes, OOXML aka ISO/IEC 29500 is one format standard, it was whipped in great haste up to compete with the actual universal format, OpenDocument Format aka ISO 26300. Both are technically open standards, but while OOXML weighs in at well over 6,000 pages it is incompletely documented and no-one not even Microsoft implements it or even can implement it. In contrast ODF is fully documented, and fully implemented in Calligra, LibreOffice, and several others. ODF is already partially implemented in MSO, but that work appears to have stalled as MS has gone back to proprietary formats like the .docx series. Also, OOXML suffers from a tremendous amount of NIH while ODF re-uses many existing standards for components.

        As for the original question, converting from markdown to HTML would be one, as mentioned by nerdvana and anonymous monk. Markdown is rather close to plain text with minimal structure and it is easy to convert between markdown and HTML using Perl. However, structure is the key and one can go from more to less but one cannot automatically produce more detail from less detail.

        Milti, could you please explain more about the task?