Re: Keeping bad HTML bad

You're going to have problems with HTML parsers - since, as everybody has pointed out, it's not really HTML.

If you are in a position where you cannot force who/whatever is producing the broken HTML to stick to standards the easist alternative is to treat it as a string or a sequence of tags rather than a tree structure.

I had a similar problem several years back, which I resolved by simply adding special comments around the content that the user had to edit. Something like:


some stuff
<!-- start editable foo/bar -->
some more stuff
<!-- end editable foo/bar -->
even more stuff
[download]

The "editable" stuff could then be extracted with some simple regexes.

Without some more info on what kind of transformations you're trying to apply to the source it's a little difficult to give more specific advice. Can you give us more of an idea of what you're trying to do?

Comment on Re: Keeping bad HTML bad Download Code

Replies are listed 'Best First'.
Re: Re: Keeping bad HTML bad by trs80 (Priest) on Aug 24, 2002 at 21:37 UTC
This is a good suggestion, but in my case I am very limited in what I can do for the user as far as the HTML, and all comments are removed (and are to be removed by client request) from all pages processed. I go into some specifics in one of my earlier replies, but to rephrase and recap what I am doing: Retrieve remote document via HTTP ( LWP::UserAgent, HTTP::Request ) Parse document for local storage and confirm that it's format isn't horribly disgusting ( HTML::TreeBuilder ) Allow editing of title tag, meta tags, anchor tag title attribute, and img tag alt attribute. The forms for the editing are created by relying on where each tag is located inside of the element array created by HTML::TreeBuilder. That is if a person selects alt tags as way they want to edit each img tag is located using the look_down method in an array context: `my @img = $tree->look_down('_tag', 'img'); my $count; my $form; foreach my $element (@img) { # make a form element $form .= # call to CGI function, name = "img-$count" $count++; } return $form;` [download] Then when they submit the form the $count is referenced and the appropriate img tags alt content is replaced. But this is all moot since the issue was and is that HTML::TreeBuilder is "supposed" to handle bad HTML, since it uses HTML::Parser and one of the goals of HTML::Parser is to work with documents that are really out there, the example given should work with HTML::TreeBuilder and in fact it does, part of my problem was not turning off implicit_tags as one of my other replies above states. The implicit_tags is unique to the HTML::TreeBuilder module and it attempts to correct badly formated HTML, which 98% of the time is most likely a good thing, but at least the author designed in the ability to turn off that behavior in the 2% of the times it isn't a good thing. I have tested my ideas and have confirmed that setting that flag allows for the conditions I need, but results in a different anomaly, which I have contacted the author of the module about.	[reply] [d/l]

Replies are listed 'Best First'.

Re: Re: Keeping bad HTML bad
by trs80 (Priest) on Aug 24, 2002 at 21:37 UTC

Retrieve remote document via HTTP ( LWP::UserAgent, HTTP::Request )
Parse document for local storage and confirm that it's format isn't horribly disgusting ( HTML::TreeBuilder )
Allow editing of title tag, meta tags, anchor tag title attribute, and img tag alt attribute.

my @img = $tree->look_down('_tag', 'img');
my $count;
my $form;
foreach my $element (@img) {
    # make a form element
    $form .= # call to CGI function, name = "img-$count"
    $count++;
}
return $form;
[download]

[reply]
[d/l]