Once again, I have a module but no name. I come here in the hope of finding a good name that helps others find this module and put it to good use.
Let me first describe what the module does:
The module exports two functions, rewrite_html and rewrite_css. These functions rewrite all things that look like URLs to be relative to a given base URL. This is of interest when you're converting scraped HTML to self-contained static files. The usage is:
use HTML::RewriteURLs; my $html = <<HTML; <html> <head> <link rel="stylesheet" src="http://localhost:5000/css/site.css" /> </head> <body> <a href="http://perlmonks.org">Go to Perlmonks.org</a> <a href="http://localhost:5000">Go to home page/a> </body> </html> HTML my $local_html = rewrite_html( "http://localhost:5000/about", $html ); print $local_html; __END__ <html> <head> <link rel="stylesheet" src="../css/site.css" /> </head> <body> <a href="http://perlmonks.org">Go to Perlmonks.org</a> <a href="..">Go to home page/a> </body> </html>
The current name for the module is HTML::RewriteURLs, and this name is bad because the module does not allow or support arbitrary URL rewriting but only rewrites URLs relative to a given URL. The functions are also badly named, because rewrite_html doesn't rewrite the HTML but it makes URLs relative to a given base. And the HTML::RewriteURLs name is also bad/not comprehensive because the module also supports rewriting CSS.
I'm willing to stay with the HTML:: namespace because nobody really cares about CSS before caring about HTML.
I think a better name could be HTML::RelativeURLs, but I'm not sure if other people have the same association. The functions could be renamed to relative_urls_html() and relative_urls_css().
Another name could be URL::Relative or something like that, but that shifts the focus away from the documents I'm mistreating to the URLs. I'm not sure what people look for first.
Below is the ugly, ugly regular expression I use for munging the HTML. I know and accept that this regex won't handle all edge cases, but seeing that there is no HTML rewriting module on CPAN at all, I think I'll first release a simpleminded version of what I need before I cater to the edge cases. I'm not fond of using HTML::TreeParser because it will rewrite the document structure of the scraped pages and the only change I want is the change in the URL attributes.
=head2 C<< rewrite_html >> Rewrites all HTML links to be relative to the given URL. This only rewrites things that look like C<< src= >> and C<< href= >> attri +butes. Unquoted attributes will not be rewritten. This should be fixed. =cut sub rewrite_html { my($url, $html)= @_; $url = URI::URL->new( $url ); #croak "Can only rewrite relative to an absolute URL!" # unless $url->is_absolute; # Rewrite relative to absolute rewrite_html_inplace( $url, $html ); $html } sub rewrite_html_inplace { my $url = shift; $url = URI::URL->new( $url ); #croak "Can only rewrite relative to an absolute URL!" # unless $url->is_absolute; # Rewrite relative to absolute $_[0] =~ s!((?:src|href)\s*=\s*(["']))(.+?)\2!$1 . relative_url(UR +I::URL->new( $url ),"$3") . $2!ge; }
Update: Now released as HTML::Rebase, thanks for the discussion and improvements!
In reply to RFC: Name and/or API for module ("HTML::RewriteURLs") by Corion
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |