comment on

As you might have guessed, HTML::Parser is your friend, but there isn't a direct "in-place" system for editing. The Parser, however, does give you enough information to weasel your way around, should you require it.

As I described in an an earlier post, you can ask Parser for 'offset' and 'length' information on the bits and pieces it gives you, and these are relative to the scalar you sent to the parser in the first place. These will enable you to re-write parts, should you so desire it.

You mentioned wanting to convert content that lives inside a table cell, such as something that is inside a TD tag. So, you will want to watch out for TD tags, and to act accordingly. This could be handled by a sub which re-writes to the desired output, given the entire contents of the TD cell, tags and all.

Here's a quick hack at it:

#!/usr/bin/perl

use strict;
use HTML::Parser;

sub Convert
{
        my ($what) = @_;

        return "'$what'";
}

sub Fixerizer
{
        my ($content) = @_;

        my ($content_start) = 0;
        my ($content_end)   = 0;

        my ($fixed_content);
        my (@mods);

        # &$TagStart() handles the opening of tags: <TD>
        my ($TagStart) = sub
        {
                my ($tagname, $attr, $offset, $length) = @_;

                # If this is a <TD> type tag...
                if ($tagname eq 'td')
                {
                        # ...make a note of where the contents
                        # of it should start.

                        # First, copy any other HTML up to
                        # the end of this tag
                        $fixed_content .=
                                substr(
                                        $content,
                                        $content_end,
                                        $offset+$length-$content_end
                                );

                        # Synchronize, stop copying...
                        $content_start = $offset+$length;
                        $content_end   = $content_start-1;
                }
        };

        # &$TagEnd() handles the closing of tags: </TD>
        my ($TagEnd) = sub
        {
                my ($tagname, $offset, $length) = @_;

                # Check for any tag which might close out
                # the <TD>, and handle busted HTML
                # which is lazy: '<TR><TD></TR>'
                if (($tagname eq 'td'
                  || $tagname eq 'tr'
                  || $tagname eq 'table')
                 && ($content_start > $content_end))
                {
                        # Add in the modified content
                        $fixed_content .=
                                Convert(
                                    substr(
                                        $content,
                                        $content_start,
                                        $offset-$content_start
                                    )
                                );

                        # And the tag itself
                        $fixed_content .=
                                substr(
                                        $content,
                                        $offset,
                                        $length,
                                );

                        # Synchronize, stop copying
                        $content_end   = $offset+$length;
                        $content_start = $content_end - 1;
                }
        };

        # Whip up a new HTML::Parser object with the
        # above-defined handlers hooked in.
        my ($hp) = new HTML::Parser (
                        api_version => 3,
                        start_h =>
                            [
                                $TagStart,
                                'tagname,attr,offset,length'
                            ],
                        end_h =>
                            [
                                $TagEnd,
                                'tagname,offset,length'
                            ],
                );

        # Et voila!
        $hp->parse($content);

        # Don't forget to catch any dangling HTML...
        $fixed_content .=
                substr(
                        $content,
                        $content_end+1,
                ) if ($content_end < length($content));

        # Ship back the modified version.
        return $fixed_content;
}
[download]

To make this "go", you would want to use it like: print Fixerizer(open (TEST, "test.html") && join ('', <TEST>)); Please excuse this one line hack. A real program would be much more careful.

This will take input such as my 'test.html':

      <TABLE BORDER=0>
      <TR>
      <TD ALIGN=left>My Friend</TD>
      <TD ALIGN=up>My Other Friend</TD>
      </TR>
      <TR>
      <TD ALIGN=left>My Friend</TD>
      </TR>
      </TABLE>
[download]

And return:

      <TABLE BORDER=0>
      <TR>
      <TD ALIGN=left>'My Friend'</TD>
      <TD ALIGN=up>'My Other Friend'</TD>
      </TR>
      <TR>
      <TD ALIGN=left>'My Friend'</TD>
      </TR>
      </TABLE>
[download]

This simply puts single quotes around whatever is in the cell, which isn't very daring or bold. This can be customized to suit your particular application.

Enjoy.

In reply to Re: Using HTML::Parser to edit files in place by tadman
in thread Using HTML::Parser to edit files in place by markjugg

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.