This comes up regularly so here is short script to convert a text file into HTML so that it will display correctly in a browser - ie looks like it does in a text editior. The new file has a .htm extension added.

HTML special chars are escaped and tabs rendered as 4 spaces. Sequences of spaces longer that 1 are converted to corresponding number of &nbsp; so that the whitespace formatting is retained (not required if the output is wrapped in <pre> tags - but does not hurt). In the exapmple <pre> tags are wrapped around the escaped output so newlines retain their literal meaning. If you wanted to use other tags (say <tt>) you will need to uncomment the s/\n/<br>\n/g line to complete the escape process.

Yes there are modules out there that will do this. This is what they do in a nutshell for the curious. They probably won't do the [PerlMonks] escapes though :-)

#!/usr/bin/perl -w my $text = "c:/text.pl"; open TEXT, $text or die "Oops can't open $text $!"; open HTML, ">$text.htm" or die "Oops can't write $text.htm $!"; print HTML "<pre>\n"; while (<TEXT>) { $_ = escapeHTML($_); print HTML $_; } print HTML "</pre>\n"; close HTML; close TEXT; sub escapeHTML { local $_ = shift; # make the required escapes s/&/&amp/g; s/"/&quot;/g; s/</&lt;/g; s/>/&gt;/g; # change tabs to 4 spaces s/\t/ /g; # make the whitespace escapes - not required within <pre> tags s/( {2,})/"&nbsp;" x length $1/eg; # make the brower bugfix escapes; s/\x8b/&#139;/g; s/\x9b/&#155;/g; # make the PERL MONKS escapes (if desired) s/\[/&#091;/g; s/\]/&#093;/g; # change newlines to <br> if desired - not required with <pre> # s/\n/<br>\n/g; return $_; }

Replies are listed 'Best First'.
Re: Text to HTML
by shenme (Priest) on Aug 24, 2003 at 02:18 UTC
    Wandering about the universe trying to track down the reason for a change in CGI.pm and came across this node.

    Lincoln Stein changed the code in escapeHTML from using &#139;/&155; to using &#8249;/&#8250;.   Change comment was:

    • Version 2.84
      HTML escaping code now replaced 0x8b and 0x9b with unicode references &#8249; and &#8250;
    So the equivalent code would become:
    s{\x8b}{&#8249;}gso; s{\x9b}{&#8250;}gso;
    Still haven't found _why_ the change ... I guess the Unicode values were more correct than assuming a particular charset?