in reply to Using HTML::Parser to edit files in place

As you might have guessed, HTML::Parser is your friend, but there isn't a direct "in-place" system for editing. The Parser, however, does give you enough information to weasel your way around, should you require it.

As I described in an an earlier post, you can ask Parser for 'offset' and 'length' information on the bits and pieces it gives you, and these are relative to the scalar you sent to the parser in the first place. These will enable you to re-write parts, should you so desire it.

You mentioned wanting to convert content that lives inside a table cell, such as something that is inside a TD tag. So, you will want to watch out for TD tags, and to act accordingly. This could be handled by a sub which re-writes to the desired output, given the entire contents of the TD cell, tags and all.


Here's a quick hack at it:
#!/usr/bin/perl use strict; use HTML::Parser; sub Convert { my ($what) = @_; return "'$what'"; } sub Fixerizer { my ($content) = @_; my ($content_start) = 0; my ($content_end) = 0; my ($fixed_content); my (@mods); # &$TagStart() handles the opening of tags: <TD> my ($TagStart) = sub { my ($tagname, $attr, $offset, $length) = @_; # If this is a <TD> type tag... if ($tagname eq 'td') { # ...make a note of where the contents # of it should start. # First, copy any other HTML up to # the end of this tag $fixed_content .= substr( $content, $content_end, $offset+$length-$content_end ); # Synchronize, stop copying... $content_start = $offset+$length; $content_end = $content_start-1; } }; # &$TagEnd() handles the closing of tags: </TD> my ($TagEnd) = sub { my ($tagname, $offset, $length) = @_; # Check for any tag which might close out # the <TD>, and handle busted HTML # which is lazy: '<TR><TD></TR>' if (($tagname eq 'td' || $tagname eq 'tr' || $tagname eq 'table') && ($content_start > $content_end)) { # Add in the modified content $fixed_content .= Convert( substr( $content, $content_start, $offset-$content_start ) ); # And the tag itself $fixed_content .= substr( $content, $offset, $length, ); # Synchronize, stop copying $content_end = $offset+$length; $content_start = $content_end - 1; } }; # Whip up a new HTML::Parser object with the # above-defined handlers hooked in. my ($hp) = new HTML::Parser ( api_version => 3, start_h => [ $TagStart, 'tagname,attr,offset,length' ], end_h => [ $TagEnd, 'tagname,offset,length' ], ); # Et voila! $hp->parse($content); # Don't forget to catch any dangling HTML... $fixed_content .= substr( $content, $content_end+1, ) if ($content_end < length($content)); # Ship back the modified version. return $fixed_content; }
To make this "go", you would want to use it like:     print Fixerizer(open (TEST, "test.html") && join ('', <TEST>)); Please excuse this one line hack. A real program would be much more careful.

This will take input such as my 'test.html':
<TABLE BORDER=0> <TR> <TD ALIGN=left>My Friend</TD> <TD ALIGN=up>My Other Friend</TD> </TR> <TR> <TD ALIGN=left>My Friend</TD> </TR> </TABLE>
And return:
<TABLE BORDER=0> <TR> <TD ALIGN=left>'My Friend'</TD> <TD ALIGN=up>'My Other Friend'</TD> </TR> <TR> <TD ALIGN=left>'My Friend'</TD> </TR> </TABLE>
This simply puts single quotes around whatever is in the cell, which isn't very daring or bold. This can be customized to suit your particular application.

Enjoy.

Replies are listed 'Best First'.
Re: Re: Using HTML::Parser to edit files in place
by markjugg (Curate) on Mar 15, 2001 at 01:00 UTC
    I started using this as the basis of my own solution (thanks!), but I realized the problem space was more complex than simply processing each TD tag. When the file is updated from the CGI environment, I need to match up the form fields from the CGI form with the TD tags. That could be accomplished by numbering the form fields in the order they appear in file. Not so hard.

    This is harder: In the old script, I had a nice trick to figure out what size to make the form input field (and whether to make it a text box or a textarea). I based the size on the largest piece of content in a particular column. To this, I had to read through the whole table once before I processed the first TD cell.

    After some contemplation, I realized that all I needed to do to fix the old script was to simply remove any newline characters that appear in the form. Since I control the template file, I could already make sure that each TD appeared on a single line, that the HTML was complete enough, etc.

    This quick fix is almost unfortunate because in many other regards, the script could use several "good style" updates, including: using CGI.pm, using 'strict', seperating the code from the design with a template. It was last worked on almost 2 years ago. I've learned a lot since then. :)

    Despite it's poor style, it's been a useful tool over the years. Perhaps I'll get around to genericizing it, documenting it and releasing it into the wild.

    -mark