Text to HTML

This comes up regularly so here is short script to convert a text file into HTML so that it will display correctly in a browser - ie looks like it does in a text editior. The new file has a .htm extension added.

HTML special chars are escaped and tabs rendered as 4 spaces. Sequences of spaces longer that 1 are converted to corresponding number of   so that the whitespace formatting is retained (not required if the output is wrapped in <pre> tags - but does not hurt). In the exapmple <pre> tags are wrapped around the escaped output so newlines retain their literal meaning. If you wanted to use other tags (say <tt>) you will need to uncomment the s/\n/<br>\n/g line to complete the escape process.

Yes there are modules out there that will do this. This is what they do in a nutshell for the curious. They probably won't do the [PerlMonks] escapes though :-)

#!/usr/bin/perl -w

my $text = "c:/text.pl";

open TEXT, $text or die "Oops can't open $text $!";
open HTML, ">$text.htm" or die "Oops can't write $text.htm $!";
print HTML "<pre>\n";
while (<TEXT>) {
    $_ = escapeHTML($_);
    print HTML $_;
}
print HTML "</pre>\n";
close HTML;
close TEXT;

sub escapeHTML {
    local $_ = shift;
    # make the required escapes
    s/&/&amp/g;
    s/"/&quot;/g;
    s/</&lt;/g;
    s/>/&gt;/g;
    # change tabs to 4 spaces
    s/\t/    /g;
    # make the whitespace escapes - not required within <pre> tags  
    s/( {2,})/"&nbsp;" x length $1/eg;
    # make the brower bugfix escapes;
    s/\x8b/&#139;/g;
    s/\x9b/&#155;/g;
    # make the PERL MONKS escapes (if desired)
    s/\[/&#091;/g;
    s/\]/&#093;/g;
    # change newlines to <br> if desired - not required with <pre>
    # s/\n/<br>\n/g;
  return $_;
}
[download]

Comment on Text to HTML Download Code

Replies are listed 'Best First'.
Re: Text to HTML by shenme (Priest) on Aug 24, 2003 at 02:18 UTC
Wandering about the universe trying to track down the reason for a change in CGI.pm and came across this node. Lincoln Stein changed the code in escapeHTML from using /&155; to using ‹/›. Change comment was: Version 2.84 HTML escaping code now replaced 0x8b and 0x9b with unicode references ‹ and › So the equivalent code would become: `s{\x8b}{‹}gso; s{\x9b}{›}gso;` [download] Still haven't found _why_ the change ... I guess the Unicode values were more correct than assuming a particular charset?	[reply] [d/l]

Replies are listed 'Best First'.

Re: Text to HTML
by shenme (Priest) on Aug 24, 2003 at 02:18 UTC

Lincoln Stein changed the code in escapeHTML from using /&155; to using ‹/›. Change comment was:

Version 2.84
HTML escaping code now replaced 0x8b and 0x9b with unicode references ‹ and ›

    s{\x8b}{&#8249;}gso;
    s{\x9b}{&#8250;}gso;
[download]

[reply]
[d/l]