comment on

Once again, I have a module but no name. I come here in the hope of finding a good name that helps others find this module and put it to good use.

Let me first describe what the module does:

The module exports two functions, rewrite_html and rewrite_css. These functions rewrite all things that look like URLs to be relative to a given base URL. This is of interest when you're converting scraped HTML to self-contained static files. The usage is:

use HTML::RewriteURLs;
my $html = <<HTML;
<html>
<head>
<link rel="stylesheet" src="http://localhost:5000/css/site.css" />
</head>
<body>
<a href="http://perlmonks.org">Go to Perlmonks.org</a>
<a href="http://localhost:5000">Go to home page/a>
</body>
</html>
HTML

my $local_html = rewrite_html( "http://localhost:5000/about", $html );
print $local_html;
__END__
<html>
<head>
<link rel="stylesheet" src="../css/site.css" />
</head>
<body>
<a href="http://perlmonks.org">Go to Perlmonks.org</a>
<a href="..">Go to home page/a>
</body>
</html>
[download]

The current name for the module is HTML::RewriteURLs, and this name is bad because the module does not allow or support arbitrary URL rewriting but only rewrites URLs relative to a given URL. The functions are also badly named, because rewrite_html doesn't rewrite the HTML but it makes URLs relative to a given base. And the HTML::RewriteURLs name is also bad/not comprehensive because the module also supports rewriting CSS.

I'm willing to stay with the HTML:: namespace because nobody really cares about CSS before caring about HTML.

I think a better name could be HTML::RelativeURLs, but I'm not sure if other people have the same association. The functions could be renamed to relative_urls_html() and relative_urls_css().

Another name could be URL::Relative or something like that, but that shifts the focus away from the documents I'm mistreating to the URLs. I'm not sure what people look for first.

Below is the ugly, ugly regular expression I use for munging the HTML. I know and accept that this regex won't handle all edge cases, but seeing that there is no HTML rewriting module on CPAN at all, I think I'll first release a simpleminded version of what I need before I cater to the edge cases. I'm not fond of using HTML::TreeParser because it will rewrite the document structure of the scraped pages and the only change I want is the change in the URL attributes.

=head2 C<< rewrite_html >>

Rewrites all HTML links to be relative to the given URL. This
only rewrites things that look like C<< src= >> and C<< href= >> attri
+butes.
Unquoted attributes will not be rewritten. This should be fixed.

=cut

sub rewrite_html {
    my($url, $html)= @_;
    $url = URI::URL->new( $url );
    
    #croak "Can only rewrite relative to an absolute URL!"
    #    unless $url->is_absolute;

    # Rewrite relative to absolute
    rewrite_html_inplace( $url, $html );
    
    $html
}

sub rewrite_html_inplace {
    my $url = shift;
    $url = URI::URL->new( $url );
    
    #croak "Can only rewrite relative to an absolute URL!"
    #    unless $url->is_absolute;

    # Rewrite relative to absolute
    $_[0] =~ s!((?:src|href)\s*=\s*(["']))(.+?)\2!$1 . relative_url(UR
+I::URL->new( $url ),"$3") . $2!ge;
}
[download]

Update: Now released as HTML::Rebase, thanks for the discussion and improvements!

In reply to RFC: Name and/or API for module ("HTML::RewriteURLs") by Corion

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.