Re: Munging Rendered HTML While Preserving Formatting

It isn't perl, but I use lynx for this task:

$nicely_formatted = `lynx -dump $URL`;
[download]

Updated: nevermind, I did not parse the question correctly (see response below). For clarity, you are looking for this sort of transform?:

fo<b>o</b>  =>  ba<b>r</b>
[download]

If so, see the response below Re^3: Munging Rendered HTML While Preserving Formatting

If anyone needs me I'll be in the Angry Dome.

Comment on Re: Munging Rendered HTML While Preserving Formatting Select or Download Code

Replies are listed 'Best First'.
Re^2: Munging Rendered HTML While Preserving Formatting by Limbic~Region (Chancellor) on Jun 28, 2004 at 16:02 UTC
idsfa, I am not sure I made the problem clear as your response at first glance isn't appropriate. The task is to change foo to bar in rendered HTML while keeping the original formatting. What it boils down to changing foo to bar in the underlying HTML (which may involve imbedded tags) so that the rendered HTML looks like you did s/foo/bar/g Cheers - L~R	[reply]
Re^3: Munging Rendered HTML While Preserving Formatting by idsfa (Vicar) on Jun 28, 2004 at 16:52 UTC
In general, I don't think you can get there from here. Consider the transform `s/foo/fishstick/g`. How do you transform the HTML `fo<b>o</b>`? Assuming you constrain the replacement to have the same length as the original, something like this would do the job: `use Regexp::Common; while( $html =~ s/($RE{balanced}{-parens=>'<>'})// ) { $tags{$-[0]} .= $1; } $html =~ s/foo/bar/g; foreach my $point (sort {$b<=>$a} keys (%tags)) { substr($html, $point, 0 ) = $tags{$point}; }` [download] For the pathological case of a tag with an attribute containing a '>' -- at this point you know as well as I do that you're into a full HTML parser: use HTML::Parser; # Remove the s///g from this one to leave tags alone # Alternately, specify additional methods to alter only # specific token types sub tagpush {$_ = shift; s/foo/bar/g; $tags{length($html)} .= $_ ;} sub txtpush { $html .= "@_"; } my $p = HTML::Parser->new(unbroken_text => 1, text_h => [ \&txtpush, "text" ], default_h => [ \&tagpush, "text" ], ); my $file = shift \|\| usage(); $p->parse_file($file) \|\| die "Can't open file $file: $!\n"; $html =~ s/foo/bar/g; foreach my $point (sort {$b<=>$a} keys (%tags)) { substr($html, $point, 0 ) = $tags{$point}; } [download] This last once handles cases like `f<!-- -->oo`, `f<b>oo</b>` and `<img alt=">foo">` properly as well, which a token parser will not catch. If anyone needs me I'll be in the Angry Dome.	[reply] [d/l] [select]
Re^4: Munging Rendered HTML While Preserving Formatting by Limbic~Region (Chancellor) on Jun 28, 2004 at 17:47 UTC
idsfa, Ok, so now you see what I was getting at. I don't have any experience with HTML munging so I don't know what one should do in these cases - that's why I asked. It is obviously a hard problem but I would think someone was working on it. I guess I will crawl back under my rawk now but thanks for the additional insight. Cheers - L~R	[reply] [d/l]