synistar has asked for the wisdom of the Perl Monks concerning the following question:

I need to convert many (over 150) HTML pages that use table layouts into something usable with a predefined set of style sheets. These pages use no paragraph or div tags. They are entirely laid out with tables and table cells.

Here is an example table. I need to remove all the table mark up and place all the TDs labeled Column 1 into a div. Then I need to place all the cells labeled Column 2 into another div. I would like to do this in perl to avoid lots of tedious hand editing of the files.

Column 1 Title Column 2 Title
Column 1 content Column 1 content Column 2 content Column 2 content
Column 1 content Column 1 content Column 2 Footer

Simply doing a search and replace on the TD tags results in the contents of the two columns coming out intermingled.

To make things even worse the layouts are not consistent so the perl code would need to recognize colspan attributes. Does anyone know of a module or script that already does something like this?

  • Comment on Converting HTML Table Layouts (Linearizing Tables)

Replies are listed 'Best First'.
Re: Converting HTML Table Layouts (Linearizing Tables)
by Limbic~Region (Chancellor) on Mar 19, 2004 at 15:48 UTC
Re: Converting HTML Table Layouts (Linearizing Tables)
by halley (Prior) on Mar 19, 2004 at 15:46 UTC
    I guess my question is whether the table tags themselves are sufficient to decide what is "column 1" versus "column 2". That is, is there an overarching table with two cells?
    <table><tr> <td> ... all stuff in column 1 ... </td> <td> ... all stuff in column 2 ... </td> </tr></table>
    If not, then you'll have to be pretty careful about the discriminator, to make sure all the column 1 stuff is cordoned off from the column 2 stuff correctly. If you have colspans that go across the column 1 / column 2 boundary, it will be pretty tricky. Otherwise, counting colspans is not that difficult.

    In either case, you will probably end up using HTML::Parser or HTML::TreeBuilder or other generic HTML tree parsing helpers, and then playing with the parsed tag entities to collect them into separate components. Then, of course, you can output these elements in whatever way you want.

    --
    [ e d @ h a l l e y . c c ]

Re: Converting HTML Table Layouts (Linearizing Tables)
by Aristotle (Chancellor) on Mar 20, 2004 at 09:35 UTC
    Just use a HTML parser, read the contents into an array of arrays, then output it in row order. The only tricky (but not difficult) part is dealing with column spans. If you have row spans, it's gonna get painful, but not much harder.

    Makeshifts last the longest.