About the middle of last year I started learning Perl with the intent of writing a Word HTML to TWiki converter to ease porting our Word docs to TWiki. At one point I asked Publish or Polish and was told "Publish early and often". But, as these things are inclind to do, I decided I needed to clean up just one more bug - and eventually got busy with something else and the whole project lapsed. Well, having had recent need to convert more docs, I've fixed the "one more bug" and a couple of others besides, so here is the "early" release :-)
Most of this code goes right back to my first lines of Perl, so expect the unexpected. It has been passed through perltidy, but don't be taken in by that :).
Most of the interesting documentation is included at the start of the code.
use warnings; use strict; use utf8; use Fcntl; use HTML::TreeBuilder; use HTML::Entities; use HTML::TableContentParser; =head1 HTML to TWiki text file converter This script takes an HTML document generated by exporting a Word docum +ent as HTML and generates TWiki pages suitable for dropping directly into +the TWiki folder. Two passes are required by html2wiki. The first pass generates a trans +lation file containing various substitution strings that control behaviour of + the conversion process in the second pass. The translation file will requi +re editing before the second pass if you expect good results. Note that the entire document is processed in memory so there may be p +erformance issues when large documents are being processed. Some manual preprocessing of the document to remove tables of contents +, tables of figures and indexes will yeald a better result. Generally the searc +h facilities provided by TWiki obviate the need for such tables in any c +ase. Word tables can cause all sorts of grief. Watch where tables are ancho +red - html2wiki gets confused at times and may either split tables or lose p +arts of a table if the table is anchored to a heading paragraph. Heading lines i +n tables can also cause trouble, resulting in the heading cells being put into +seperate rows. =head2 Running html2wiki For both passes html2wiki takes the path to the HTML document to be co +nverted on the command line. The first pass generates a translation file in the document's location + with the document's name and a .html2wiki extension. The second pass generates a file or files in the document's location w +ith names generated using an abbreviated form of the document's name as a base a +nd incorporating an abbreviated form of the headings text. During either pass html2wiki may generate warning or error messages. =head2 Translation File The translation file consists of a number of sections. Each section is + a list of lines containing configuration information or comprising a match key, +substitution text, substitution parameters and (optional) context text. Any line starting with a # is a comment and is ignored by html2wiki ex +cept the special #= section markers. =head3 XlateParams The translate parameters section provides general confinguration infor +mation. The parameters are commented where necessary and generally comprise the fo +llowing entries: =head4 "WikiNameRoot" Provides the base for the generated file names. This name must conform + to TWiki file naming conventions which follow WikiWord conventions. =head4 "SamePageHeader" Provides the header level below which information is keept in the same + topic page. wiki2html will generate new topic pages (new files) for each heading i +t finds down to this level. =head4 "ParentTopicName" is the WikiWord name of the (parent) topic which contains the link to +the first topic page of the generated files. This should always be provided. If +it is not then the topic linking information provided by TWiki when browsing th +e generated pages will be broken. =head4 "AuthorsWikiName" is the WikiWord name of the putitive author of the generated files. =head3 Subs The "subs" section allows substitution strings to be provided. The sub +stitutions are performed on the generated TWiki text. =head3 ElementSubs This section allows management of HTML elements by either ignoring the +m or pretending that they are a different type of element. This can be used to translate <span ...> elements into <code> elements + for example, or to translate particular paragraph styles to heading elemen +ts. Use "-" to have an element ignored. Note that this does not generally +ignore the contents of the element, but will suppress the direct effects of the e +lement. For example a paragraph element with a particular set of attributes may be + ignored so that it doesn't generate a paragraph break (this may be usefull in tab +les). =cut #The following variables are initialised in either the main block #or in ParseXlateFile. They are used in many places. my $OriginalName; # Name of the source (HTML) file my @WikiFiles; # Files generated and referenced (ie image files) my $TranslateName; # Name of the translation file my %XlateParams; # Hash of general translation parameter names and +values my %ElementSubs; # Control processing for various elements my %Anchors; # Active anchors. A name and matching href are req +uired for linking my @Subs; # Substitution pairs for the final pass my %ParentPageNames; # Names that have been used for parent pages my $Noisy = 0; # Generate progress messages # Abbreviate is used to generate abbreviated files names from headings + for topic # pages. sub Abbreviate { my @Result; while ($_ = shift) { last if ! defined $_; my $Abbrev = substr $_, 0, 1, ""; #Retain first char and preserve +its case tr/A-Z/a-z/; if (length ($_) > 4 and ! /[0-9]/) { tr/A-Za-z//cd; tr/aeiou//d; s/(.)\1+/$1/gi; s/ck/k/g; s/ptn/pn/g; s/tng/tg/g; s/thr/tr/g; s/vnt/vt/g; s/ltn/ln/g; s/lb/b/g; s/tt/t/g; } $Abbrev .= $_; push @Result, $Abbrev; } return wantarray ? @Result : join " ", @Result;; } #ScanForTagSubs scans the tree to identify all the tags that are used +in the #document for inclusion in the translation file. # #Default actions are provided for a small number of common tags and at +tributes. sub ScanForTagSubs { my $Tree = shift; my $MaxPos = 0; my @Tags = $Tree->look_down ("_tag", qr/./); foreach my $Element (@Tags) { my $Tag = $Element->starttag (); my $Action = ""; if ($Tag =~ /courier/i) {$Action = "<code>"} elsif ($Tag =~ /size=7/i) {$Action = "<h1>"} elsif ($Tag =~ /size=6/i) {$Action = "<h2>"} elsif ($Tag =~ /size=5/i) {$Action = "<h3>"} elsif ($Tag =~ /size=4/i) {$Action = "<h4>"} elsif ($Tag =~ /size=3/i) {$Action = "<h5>"} elsif ($Tag =~ /size=2/i) {$Action = "<h6>"} elsif ($Tag =~ /<i>/i) {$Action = "<i>"} $Tag =~ s/\G(.*?)[\n\r]+/$1/gs; $ElementSubs{"$Tag"} = $Action; } } #GenerateXlateFile builds the translation file that is used to control + document #conversion. sub GenerateXlateFile ($) { my $Tree = shift; open (outFile, ">$TranslateName") or die "Unable to create translation + file: $TranslateName ($!)\n"; print outFile "# The first string of each line is a key and should not + be altered.\n"; print outFile "# The second string is a value that may be altered as r +equired Each\n"; print outFile "# section describes permissable values.\n"; print outFile "# Do not alter #= lines!\n"; print outFile "\n"; # The following return in %XlateParams my @Keys = ( ["WikiNameRoot", "# Root name for the generated wiki files"], ["SamePageHeader", "# All headers below this number generate a new w +iki page"], ["ParentTopicName", "# Name of the parent wiki page"], ["AuthorsWikiName", "# WikiName of user attributed with generating t +he pages"], ); print outFile "#=XlateParams\n"; print outFile "# WikiNameRoot must be valid as part of a file name and + should be unique in the intended TWiki context.\n"; for my $key (@Keys) { printf outFile "%-20s %-20s %s\n", "\"$key->[0]\",", "\"$XlateParams +{$key->[0]}\"", "$key->[1]" } my $ContextStr; my $Action; # The following return in @Subs print outFile <<Subs; #=Subs # The following pairs each comprise a search string and a replace stri +ng # The search string is not a regular expression. Sorry. "™", "™" "®", "®" Subs ScanForTagSubs ($Tree); print outFile <<ElementSubs; #=ElementSubs # The following values must be one of: # \"\" Leave unchanged # \"<tag...>\" Replace tags with given tag # Automatic linking for an href can be suppressed by deleting either # or both of the <a name...> or <a href...> entries below. ElementSubs foreach my $Key (sort keys %ElementSubs) { $Action = $ElementSubs{$Key}; printf outFile "%-20s %s\n", "\"$Key\",", "\"$Action\""; } close outFile; } #Assumes $TranslateName has been set. Parses translation file to extra +ct the #document conversion parameters. sub LoadXlateFile { my $SkipTags = shift || ""; my $State = "Searching"; open (inFile, "<$TranslateName") or return; while (<inFile>) { chomp; s/(^#(?!=).*|(?<!^)#[^#"]*)$//g; # Strip simple line end comments (l +ike this one) s/\s+$//g; # Strip trailing white space next if $_ eq ""; # Ignore blank line if (/^#=/) { ($State) = /^#=(\w*)/g; $State = "Skipping" if $SkipTags =~ /$State/i; next; } next if /^\s*#/; # Ignore comment lines next if $State eq "Skipping"; if ($State eq "Searching") { print STDERR "Skipped line (missing #=?): $_\n"; next; } if ($State eq "XlateParams") { my $Key; my $Param; ($Key, $Param) = /^"(.*?)",\s+"(.*)"/g; $XlateParams{$Key} = $Param; } elsif ($State eq "Subs") { my $Search; my $Replace; ($Search, $Replace) = /^"(.*?)",\s+"(.*)"/g; push @Subs, [$Search, $Replace]; } elsif ($State eq "ElementSubs") { my $Tag; my $Action; ($Tag, $Action) = /^"(.*?)",\s+"(.*)"/g; if ($Tag =~ /^<a\b/) { my ($Mode, $Link) = $Tag =~ /^<a\s+(.*?)=(.*?)>/gi; $Link =~ tr/a-zA-Z0-9_//dc; $Link =~ s/^[0-9_]*/LinK/; ++$Anchors{"$Link:$Mode"}; } else { $ElementSubs {$Tag} = $Action; } } else { print STDERR "Don't know how to handle $State.\n"; $State = "Searching"; } } close inFile; } #PrintHeader prints TWiki topic page file meta headers. # #outFile is a file handle. #parentTopicName is the name of the parent page to this one. sub PrintHeader ($$) { my $outFile = shift; my $ParentTopicName = shift; my $AuthorsWikiName = $XlateParams {"AuthorsWikiName"}; my $now = time; print $outFile "%META:TOPICINFO{author=\"$AuthorsWikiName\"". " date=\"$now\" format=\"1.0\" version=\"1.2\"}%\n"; print $outFile "%META:TOPICPARENT{name=\"$ParentTopicName\"}%\n"; print $outFile "<noautolink>\n"; } #FinalFixup performs a final pass through the created files to fix up +anchor #links and poorly handled symbol translations. # #FinalFixup also tidies up tables by increasing the number of cells in + each row #to match the number in the row containing the greatest number of cell +s. This #causes the last cell on each such widened row to span the remaining w +idth of #the table. sub FinalFixup () { foreach my $Filename (@WikiFiles) { open (inFile, "<$Filename"); my @Lines = <inFile>; close inFile; my $LineNum = 0; my $TableStart = undef; my $TableEnd = undef; my $CellCount = 0; foreach my $Line (@Lines) { chomp $Line; my $IsRow = $Line =~ m/^\|/; $TableStart = $LineNum if $IsRow and ! $TableStart; if ($TableStart and $IsRow) {# Scan table lines my $cells = $Line =~ tr/|//; $CellCount = $cells if $cells > $CellCount; $TableEnd = $LineNum; } if ($TableStart and ! $IsRow) {# End of table foreach my $Line (@Lines [$TableStart .. $TableEnd]) { my $cells = $Line =~ tr/|//; $Line .= "|" x ($CellCount - $cells) if $cells < $CellCount; } $CellCount = 0; $TableStart = undef; } my $RefPos = index ($Line, '[[#'); if ($RefPos != -1) {# Fix up the reference my ($Ref) = $Line =~ /\[\[#(.*?)]/g; substr $Line, $RefPos + 2, length ($Ref) + 1, $Anchors{$Ref}; } # Fix up symbols foreach my $Row (@Subs) { my @Pair; @Pair [0, 1] = @$Row; $Line =~ s/\Q$Pair[0]\E/$Pair[1]/g; } } continue {++$LineNum;} open (outFile, ">~$Filename"); print outFile join "\n", @Lines; close outFile; unlink $Filename; rename "~$Filename", $Filename; } } #Handle a table element. Try to prevent table nastyness escaping to th +e #remainder of the document! sub ConvertTable { my $this = shift; return ConvertTableHTML ($this->as_HTML()); } sub ConvertTableHTML { my $tableAsHTML = shift; my $tp = HTML::TableContentParser->new; my $tableCleanHTML = ''; $tp->parse ($tableAsHTML); for my $table (@{$tp->parse($tableAsHTML)}) { for my $row (@{$table->{rows}}) { $tableCleanHTML .= '<tr>'; for my $cell (@{$row->{cells}}) { my $data = $cell->{data} || ''; $data =~ s/\n|<br>/<brr>/g if defined $data; $tableCleanHTML .= "<td>$data</td>"; } $tableCleanHTML .= "</tr>\n"; } } return Convert ($tableCleanHTML, 1); } sub BuildTree ($) { my $Tree = HTML::TreeBuilder->new (); $Tree->ignore_unknown (0); $Tree->attr_encoded (0); $Tree->parse (shift); $Tree->eof (); $Tree = $Tree->guts (); return $Tree; } #Does the HTML to TWiki conversion using the translation tables and pa +rameters #that have already been read in from the translation file. sub Convert ($) { my ($html, $tableMode) = @_; $tableMode ||= 0; my $Tree = BuildTree ($html); return '' if ! defined $Tree; my $WikiText; my $empty_element_map = $Tree->_empty_element_map; my(@C) = [$Tree]; # a stack containing lists of children # I is a stack of indexes to current position in corresponding lists i +n @C # In each of these, 0 is the active point my(@I) = (-1); # initial value must be -1 for each list my @Context = ""; # Contains stack of current nodes my @AnchorStack; my @QueuedAnchors; # Anchors queued for the end of a table my $this; # current node my $content_r; # child list of $this my $TagName; my $InOList = 0; my $InUList = 0; my $ParaCount = 0; my $InHeader = 0; # Loop over the tree while (@C) { # Post processing # Move to next item in this frame if(!defined($I[0]) or ++$I[0] >= @{$C[0]}) { $this = $Context [0]; if (defined $this and ref $this) { my $StartTag = $this->starttag (); my $Action = $ElementSubs {$StartTag}; $StartTag = $Action if defined ($Action) and $Action ne ""; if ($StartTag =~ /^<p\b/) { $WikiText .= $tableMode ? "<brr>" : "<br><br>" if ! $InOList a +nd ! $InUList; } elsif ($StartTag =~ /^<h[1-6]\b/) { $WikiText .= "<br><br>" if --$InHeader == 0; } elsif ($StartTag =~ /^<title\b/) {$WikiText .= "<br><br>";} elsif ($StartTag =~ /^<b\b/) {$WikiText .= "</b>";} elsif ($StartTag =~ /^<tr\b/) {$WikiText .= "<br>";} elsif ($StartTag =~ /^<td\b/) {$WikiText .= "|";} elsif ($StartTag =~ /^<code\b/) {$WikiText .= "</code>";} elsif ($StartTag =~ /^<li\b/) {$WikiText .= "<br>" if ! $tab +leMode;} elsif ($StartTag =~ /^<a href\b/ and @AnchorStack) {# Generate a link if there is a matching target my ($Link) = $StartTag =~ /^<a\s+(?:.*?)=(.*?)>/gi; $Link =~ tr/a-zA-Z0-9_//dc; $Link =~ s/^[0-9_]*/LinK/; my $Text = shift @AnchorStack; $Text = 'here' if ! defined $Text; $WikiText .= "<a $Link:$Text>" if defined $Anchors{"$Link:name +"}; } elsif ($StartTag =~ /^<a name\b/) {# Generate target if there is a matching link my ($Link) = $StartTag =~ /^<a\s+(?:.*?)=(.*?)>/gi; $Link =~ tr/a-zA-Z0-9_//dc; $Link =~ s/^[0-9_]*/LinK/; if (defined $Anchors{"$Link:href"}) { my $Text; $Text .= "<br>" if ! ($WikiText =~ /<br>$/); $Text .= "<a $Link><br>"; $WikiText .= $Text; } } elsif ($StartTag =~ /^<ol\b/) { $WikiText .= "<br><br>"; --$InOList; } elsif ($StartTag =~ /^<ul\b/) { $WikiText .= "<br><br>"; --$InUList; } } shift @Context; shift @I; shift @C; next; } $this = $C[0][$I[0]]; if (! ref $this) {# Add the text $this =~ s/\G(.*?)\n/$1 /gs; $this =~ s/(.{70,80}\s+)/$1<br>/g if ! $InOList and ! $InUList and + ! $tableMode; chomp $this; $this =~ s/\xA0/ /gs; $this =~ s/\x09/<3sp>/gs; if (@AnchorStack) { my $Index = 0; while ($Index < @AnchorStack) { $AnchorStack[$Index] .= $this; ++$Index; } } else { $WikiText .= $this; } } else {# Process this element my $StartTag = $this->starttag (); $StartTag =~ s/[\r\n]*//gs; # Ignore elements nested in headers except anchors if ($InHeader and ! $StartTag =~ /^<a\b/) { print "In header, ignoring $StartTag\n" if $Noisy; $StartTag = "-"; } my $Action = $ElementSubs {$StartTag}; $StartTag = $Action if defined ($Action) and $Action ne ""; if ($StartTag =~ /^<table\b/) { my $newText = ConvertTable ($this); $WikiText .= $newText; $this->delete_content (); } elsif ($StartTag =~ /^<h([1-6])\b/) { $WikiText .= "<h$1>" if ++$InHeader == 1; } elsif ($StartTag =~ /^<b>/) {$WikiText .= "<b>"} elsif ($StartTag =~ /^<br\b/) {$WikiText .= "<br>"} elsif ($StartTag =~ /^<brr\b/) {$WikiText .= "<brr>"} elsif ($StartTag =~ /^<tr\b/) {$WikiText .= "|"} elsif ($StartTag =~ /^<code>/) {$WikiText .= "<code>"} elsif ($StartTag =~ /^<title>/) {$WikiText .= "<h1>"} elsif ($StartTag =~ /^(<li[aAiI]?)\b/) {$WikiText .= $InOList ? "$ +1>" : "<li>";} elsif ($StartTag =~ /^<a href\b/) { unshift @AnchorStack, ""; } elsif ($StartTag =~ /^<ol\b/) { $WikiText .= "<br><br>" if ! $InOList and ! $InUList; ++$InOList; } elsif ($StartTag =~ /^<ul\b/) { $WikiText .= "<br><br>" if ! $InOList and ! $InUList; ++$InUList; } elsif ($StartTag =~ /^<img\b/) { $WikiText .= $StartTag; } } # Now queue up content list for the current element... if( ref $this and not ( # ...except for those which not($content_r = $this->{'_content'} and @$content_r) and # ...have empty content lists $this->{'_empty_element'} || $empty_element_map->{$this->{'_tag' +} || ''} # ...and that don't get post-order callbacks ) ) { unshift @Context, $this; unshift @I, -1; unshift @C, $content_r || []; } } print "Generated marked up TWiki\n" if $Noisy; my $sp = qr/(?:\s|<3sp>)/; my $face = qr/(?:b|i|code)/; my $pre = qr/(?:<ul>|<ol>|<h[1-6]>|<a [a-zA-Z0-9_]+>)/; $WikiText =~ s/<br>\s+/<br>/g; # Remove leading spaces $WikiText =~ s/\s+<br>/<br>/g; # Remove trailing spaces $WikiText =~ s/(<h[1-6]>)(?:<br>)+/$1/g; print "Removed spurious white space at line ends\n" if $Noisy; my $Touched; do { $Touched = 0; # Remove multiple blank lines $Touched |= $WikiText =~ s/(?:<br>){3,}/<br><br>/g; # Remove empty face elements $Touched |= $WikiText =~ s/<($face)>((?:<br>|$sp)*)<\/\1>/$2/g; # Migrate various tags adjacent to text $WikiText =~ s/((?:<br>)+)(<a [^>]*:.*?>)/$2$1/g; $Touched |= $WikiText =~ s/(<$face>)((?:<br>|$sp|$pre|\W+)+)/$2$1/g; $Touched |= $WikiText =~ s/((?:<br>|$sp|$pre|\W+)+)<\/($face)>/<\/$2 +>$1/g; } while ($Touched); print "Removed various empty elements\n" if $Noisy; $WikiText =~ s/<br><br>($pre)/<br>$1/g; $WikiText =~ s/($pre)($pre)/$1<br>$2/g; $WikiText =~ s/(<\/?code><\/?b>|<\/?b><\/?code>)/==/g; $WikiText =~ s/(<\/?i><\/?b>|<\/?b><\/?i>)/__/g; $WikiText =~ s/<\/?b>/*/g; $WikiText =~ s/<\/?i>/_/g; $WikiText =~ s/<\/?code>/=/g; $WikiText =~ s/($pre|<br>)(?:$sp)+/$1/g; $WikiText =~ s/<br>(?:•|<ul>)/<br> * /g; $WikiText =~ s/<ol>/<br> 1 /g; $WikiText =~ s/<ol([a|A|i|I])>/<br> $1 /g; print "Inserted list line prefixes\n" if $Noisy; $WikiText =~ s/<3sp>/ /g; $WikiText =~ s/<h([1-6])>(?{"+"x$1})/<br>---$^R /g; print "Inserted header line prefixes\n" if $Noisy; $WikiText =~ s/<a ([^>]*):(.*?)>/[[#$1][$2]]/g; $WikiText =~ s/<a (.*?)>/#$1/g; print "Inserted links\n" if $Noisy; # Put in the line breaks $WikiText =~ s/<br>/\n/gs; $WikiText =~ s/\n \* \n/\n/gs; $WikiText =~ s/^\n+//gs; print "Restored line breaks\n" if $Noisy; return $WikiText; } sub OutputTWiki ($) { my @Lines = split /\n/, shift; my @Files; my @HeaderOffset; my @PageTag; my $WikiNameRoot = $XlateParams {"WikiNameRoot"}; my $SamePageHeader = $XlateParams {"SamePageHeader"}; my %ImageFiles; my %AddedImages; my $FirstHeader = 1; unshift @PageTag, $XlateParams {"ParentTopicName"}; unshift @PageTag, "$WikiNameRoot"; $ParentPageNames {$PageTag[0]} = 1; unshift @WikiFiles, ("$PageTag[0].txt"); open ($Files [0], ">$WikiFiles[0]"); PrintHeader ($Files [0], $PageTag[1]); $HeaderOffset [0] = 1; foreach my $Line (@Lines) { # Fix up cell breaks $Line =~ s/<brr?>\|/|/gi; $Line =~ s/\|<brr?>/|/gi; $Line =~ s/(^<brr?>\s*|\s*<brr?>$)//gi; $Line =~ s/<brr>/<br>/gi; my ($Plusses) = $Line =~ /^---(\++)/g; # Image file processing if ($Line =~ /(<img\b.*?src="(.*?)".*?>)/ && -e $2) { my $Filename = "$WikiNameRoot$2"; if (defined $ImageFiles {$Filename}) {$Filename .= "-" . ++$ImageFiles {$Filename};} else {$ImageFiles {$Filename} = 1;} sysopen inFile, $2, O_BINARY | O_RDONLY; sysopen outFile, $Filename, O_BINARY | O_WRONLY | O_CREAT; my $Buffer; my $Len; while ($Len = sysread inFile, $Buffer, 2048) {syswrite outFile, $Buffer, $Len;} close inFile; close outFile; $AddedImages {$Filename} = -s $Filename; my $Link = "<img src=\"%ATTACHURLPATH%/$Filename\" alt=\"$Filename +\"/>"; substr $Line, $-[0], length $1, $Link; } # Anchor processing if ($Line =~ /^#/) {# Anchor my ($Tag) = $Line =~ /^#(\w+)/; $Anchors {$Tag} = "$PageTag[0]#$Tag"; } if (defined $Plusses) {$Line =~ s/\*(.*)\*/$1/g;} if (! defined $Plusses or length ($Plusses) >= $SamePageHeader) {# Non-header line or same page header print {$Files [0]} $Line."\n"; next; } # Header processing $Plusses = length ($Plusses); while ($HeaderOffset [0] >= $Plusses and @Files > 1) {# Pop a level print {$Files [0]} "</noautolink>\n"; foreach my $ImageName (sort keys %AddedImages) { my $MetaLine = "%META:FILEATTACHMENT{name=\""; $MetaLine .= $ImageName; $MetaLine .= "\" attr=\"h\" comment=\"\" date=\""; $MetaLine .= time; $MetaLine .= "\" path=\""; $MetaLine .= $ImageName; $MetaLine .= "\" size=\""; $MetaLine .= $AddedImages {$ImageName}; $MetaLine .= "\" user=\""; $MetaLine .= $XlateParams {"AuthorsWikiName"}; $MetaLine .= "\" version=\"1.1\"}%\n"; print {$Files [0]} $MetaLine; } %AddedImages = (); close $Files [0]; shift @Files; shift @HeaderOffset; shift @PageTag; } # Now on right page for header if ($Plusses == 1 && $FirstHeader or $Plusses >= $SamePageHeader) {# Stay on same page $FirstHeader &&= $Plusses != 1; print {$Files [0]} $Line."\n"; next; } # Push a level my $HeaderText = (substr $Line, $Plusses + 3); my $Tag = $HeaderText; $Tag =~ tr/0-9A-Za-z //cd; $Tag =~ tr/a-z/A-Z/; $Tag = join "", Abbreviate (split " ", $Tag); my $PageName = $WikiNameRoot.$Tag; if (exists $ParentPageNames{$PageName}) { $PageName .= ++$ParentPageNames {$PageName}; } else { $ParentPageNames {$PageName} = 1; } unshift @PageTag, $PageName; unshift @WikiFiles, ("$PageTag[0].txt"); unshift @HeaderOffset, $Plusses; print {$Files [0]} "---" . "+" x $Plusses . "[[$PageTag[0]][$HeaderT +ext]]\n"; unshift @Files, undef; open ($Files [0], ">$WikiFiles[0]"); PrintHeader ($Files [0], $PageTag[1]); print {$Files [0]} $Line."\n"; print "$WikiFiles[1] parent of $WikiFiles[0]\n" if $Noisy; } foreach my $ImageName (sort keys %AddedImages) { my $MetaLine = "%META:FILEATTACHMENT{name=\""; $MetaLine .= $ImageName; $MetaLine .= "\" attr=\"h\" comment=\"\" date=\""; $MetaLine .= time; $MetaLine .= "\" path=\""; $MetaLine .= $ImageName; $MetaLine .= "\" size=\""; $MetaLine .= $AddedImages {$ImageName}; $MetaLine .= "\" user=\""; $MetaLine .= $XlateParams {"AuthorsWikiName"}; $MetaLine .= "\" version=\"1.1\"}%\n"; print {$Files [0]} $MetaLine; } %AddedImages = (); do { print {$Files [0]} "</noautolink>\n"; close $Files [0]; shift @Files; } while (@Files); FinalFixup (); print "Done\n"; } sub HelpAndExit { print shift while scalar @_; print "\n"; print "html2wiki [-n] <source document>\n"; print "\n"; print " -n Turn on noisy mode. Lists character and tag/attribute subs +titutions\n"; print " as they are found in the first pass and progress in the se +cond pass.\n"; print " <source document> The document to convert from HTML to TWiki +format.\n"; exit (-1); } sub SetDefaults ($) { $XlateParams {"WikiNameRoot"} = shift; $XlateParams {"SamePageHeader"} = 4; $XlateParams {"ParentTopicName"} = "Main"; $XlateParams {"AuthorsWikiName"} = "TWikiGuest"; LoadXlateFile ("ElementSubs "); } #The main block processes the command line to retreive the file name o +f the #document to process. It then checks to see if the translation file ex +ists and #either calls GenerateXlateFile to generate it, or calls ParseXlateFil +e and #Convert to perform the conversion. my $Param = shift; HelpAndExit if ! defined $Param; if ($Param =~ /^-n$/) { $Noisy = 1; $OriginalName = shift; } else {$OriginalName = $Param;} HelpAndExit ("Document file name required\n") if ! defined $OriginalNa +me; HelpAndExit ("Document file <$OriginalName> not found\n") if ! (-e $Or +iginalName); (my $WikiNameRoot = $OriginalName) =~ s/(.*)\..*?$/$1/; $WikiNameRoot =~ tr/ \-_a-zA-Z0-9\x00-\xFF/ a-zA-Z0-9/d; # Strip "na +sty" characters my @Words; push @Words, ucfirst $_ foreach split " ", $WikiNameRoot; $WikiNameRoot = join "", Abbreviate (@Words); $TranslateName = "$OriginalName.html2wiki"; SetDefaults ($WikiNameRoot); my $html = ""; open inFile, "<$OriginalName"; while (<inFile>) { chomp; if (($html =~ /[^\s>]$/) and (/^\b/)) {$html .= " ";} s/ / /g; $html .= $_; } close inFile; $html =~ s/[\r\n]+/ /gs; $html =~ s/<!--.*?-->//g; # Delete comments print "Loaded $OriginalName\n" if $Noisy; print "Parsed html\n" if $Noisy; if (! (-e $TranslateName) || -M $TranslateName > -M $OriginalName) { my $Tree = BuildTree ($html); GenerateXlateFile ($Tree); print "Translate file has been generated as: $TranslateName"; } else { print "Converting to TWiki\n" if $Noisy; LoadXlateFile (); OutputTWiki (Convert ($html)); print "Loaded configuration information\n" if $Noisy; if (1 == @WikiFiles) { print "Conversion is complete. The following file was generated: $ +WikiFiles[0]\n"; } else { print "Conversion is complete. The following files were generated: +\n"; printf " $_\n" while $_ = pop @WikiFiles; } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Word HTML to TWiki converter
by spiritway (Vicar) on Feb 08, 2006 at 03:17 UTC |