Corion has asked for the wisdom of the Perl Monks concerning the following question:

Once again, I have a module but no name. I come here in the hope of finding a good name that helps others find this module and put it to good use.

Let me first describe what the module does:

The module exports two functions, rewrite_html and rewrite_css. These functions rewrite all things that look like URLs to be relative to a given base URL. This is of interest when you're converting scraped HTML to self-contained static files. The usage is:

use HTML::RewriteURLs; my $html = <<HTML; <html> <head> <link rel="stylesheet" src="http://localhost:5000/css/site.css" /> </head> <body> <a href="http://perlmonks.org">Go to Perlmonks.org</a> <a href="http://localhost:5000">Go to home page/a> </body> </html> HTML my $local_html = rewrite_html( "http://localhost:5000/about", $html ); print $local_html; __END__ <html> <head> <link rel="stylesheet" src="../css/site.css" /> </head> <body> <a href="http://perlmonks.org">Go to Perlmonks.org</a> <a href="..">Go to home page/a> </body> </html>

The current name for the module is HTML::RewriteURLs, and this name is bad because the module does not allow or support arbitrary URL rewriting but only rewrites URLs relative to a given URL. The functions are also badly named, because rewrite_html doesn't rewrite the HTML but it makes URLs relative to a given base. And the HTML::RewriteURLs name is also bad/not comprehensive because the module also supports rewriting CSS.

I'm willing to stay with the HTML:: namespace because nobody really cares about CSS before caring about HTML.

I think a better name could be HTML::RelativeURLs, but I'm not sure if other people have the same association. The functions could be renamed to relative_urls_html() and relative_urls_css().

Another name could be URL::Relative or something like that, but that shifts the focus away from the documents I'm mistreating to the URLs. I'm not sure what people look for first.

Below is the ugly, ugly regular expression I use for munging the HTML. I know and accept that this regex won't handle all edge cases, but seeing that there is no HTML rewriting module on CPAN at all, I think I'll first release a simpleminded version of what I need before I cater to the edge cases. I'm not fond of using HTML::TreeParser because it will rewrite the document structure of the scraped pages and the only change I want is the change in the URL attributes.

=head2 C<< rewrite_html >> Rewrites all HTML links to be relative to the given URL. This only rewrites things that look like C<< src= >> and C<< href= >> attri +butes. Unquoted attributes will not be rewritten. This should be fixed. =cut sub rewrite_html { my($url, $html)= @_; $url = URI::URL->new( $url ); #croak "Can only rewrite relative to an absolute URL!" # unless $url->is_absolute; # Rewrite relative to absolute rewrite_html_inplace( $url, $html ); $html } sub rewrite_html_inplace { my $url = shift; $url = URI::URL->new( $url ); #croak "Can only rewrite relative to an absolute URL!" # unless $url->is_absolute; # Rewrite relative to absolute $_[0] =~ s!((?:src|href)\s*=\s*(["']))(.+?)\2!$1 . relative_url(UR +I::URL->new( $url ),"$3") . $2!ge; }

Update: Now released as HTML::Rebase, thanks for the discussion and improvements!

Replies are listed 'Best First'.
Re: RFC: Name and/or API for module ("HTML::RewriteURLs")
by BrowserUk (Patriarch) on Jul 25, 2015 at 10:40 UTC

    I wouldn't use the term 'rewrite', because in my mind, in reference to things HTML, that makes me think of a dynamic process applied at the server when serving, rather than a static process of text substitution.

    I think 'ConvertUrls' is more applicable? HTML::ConvertUrls.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
Re: RFC: Name and/or API for module ("HTML::RewriteURLs")
by shmem (Chancellor) on Jul 25, 2015 at 10:18 UTC

    First, since you don't rewrite_html in a general way, but only URLS, and only those that are relative to a given base, you actually do relocate_rel_url.
    From this follows, second, that the package name is HTML::RelocateRelativeURL proper (which is as ugly as the regex you use :-), probably with the shortcut HTML::RRU (like Data::Dump::Streamer which comes optionally with the alias DDS).

    perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'

      Hmm - I'm not too fond of "RelocateRelativeURL" because it's unwieldly, but maybe just HTML::MakeRelativeURL is good enough. The function names will need a good new name but that should come once I've found the module name ;)

        Ok - next shot: HTML::RebaseURL. How's that?

        update:Golfed down a bit: HTML::Rebase might suffice.

        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: RFC: Name and/or API for module ("HTML::RewriteURLs")
by 1nickt (Canon) on Jul 25, 2015 at 13:45 UTC

    I definitely think you should use the top level space URL:: because that's where you are operating. If there was HTML::URL:: that would be OK or maybe even better, but there isn't currently. URL::Transform:: does already exist though. For me, the best thing is to find like-purposed modules grouped together.

    Since your module does one thing to all URLs in a document, how about URL::Transform::Base ?

    While searching, I wondered if the existing URL::Transform is of any interest to you?

    The way forward always starts with a minimal test.

      For me personally, I think that it should stay in the HTML namespace, as the primary purpose is to recreate HTML docs by simply changing URLs. When I think of URL namespace, I'd expect to have to use both an HTML parser in conjunction with something from URL.

      The HTML::URL suggestion is a good one though... it makes sense. Something like HTML::URL::Localize, HTML::URL::Rebase etc. I still like HTML::LocalizeURLs / HTML::URLLocalize too though ;)

        Corion, whatever name you happen to choose, would you please keep "URL" singular? Reading the name with plural version sounds horrendous to my ears.

Re: RFC: Name and/or API for module ("HTML::RewriteURLs")
by ww (Archbishop) on Jul 25, 2015 at 11:47 UTC

    Suggesting "HTML::RelativeURLs" occured to me just before I saw your mention of it. But, IMHO, your suggestion for HTML::MakeRelativeURL is an improvement.

Re: RFC: Name and/or API for module ("HTML::RewriteURLs")
by LanX (Saint) on Jul 25, 2015 at 21:14 UTC
    Rebase.pm sounds good for me, not sure about the path ...

    ... but HTML::Rebase should be obvious, since there is already a 'HTML <base> Tag' for the purpose of relative URLs.

    (your code handles the base tag, right? ;)

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      My code doesn't handle tags, it blindly handles attributes. But as the <base> tag has the href= attribute, that will be rewritten too. Maybe the more proper approach would be to eliminate the tag but I'll wait for bug reports to come in :)

        Well. .. If you don't have options to handle base, then better forget my suggestion to call it Rebase.pm ... ;)

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        Updated

        wording