A module for creating "sane" URLs from "arbitrary" titles?

Corion has asked for the wisdom of the Perl Monks concerning the following question:

I'm autogenerating URLs for my image stream (github repository). I want googleable "sane" URLs for each image. This means that I want to turn, for example, the image img_1648 with the resolution 0640 in the folder 20120419-luminale and no tags into the string img_1648_0640_20120419-luminale.jpeg. But it also means that I want to turn "arbitrary" words in tags into their romanized form, turn whitespace into underscores etc, like the following :

Motörhead into Motorhead
Ævar Arnfjörð Bjarmason into AEvar_Arnfjord_Bjarmason
```
Ümloud   feat.   ß
```
into Umloud_feat._ss

In addition, the empty string remains the empty string, and newlines and tabs also get converted to underscores. The routine is only intended for URL fragments, not complete URLs or path parts. /foo/bar/index.html will get turned into foo_bar_index.html.

I'm not so sure about the "ss" part, but that's what Text::Unidecode gives me, and for the moment I'm OK with that.

The ridiculously simple code that does these replacements (assuming Unicode input) is:

sub sanitize_name {
    # make uri-sane filenames
    # We assume Unicode on input.

    # XXX Maybe use whatever SocialText used to create titles
    
    # First, downgrade to ASCII chars (or transliterate if possible)
    @_ = unidecode(@_);

    for( @_ ) {
        s/['"]//gi;
        s/[^a-zA-Z0-9.-]/ /gi;
        s/\s+/_/g;
        s/_-_/-/g;
        s/^_+//g;
        s/_+$//g;
    };
    wantarray ? @_ : $_[0];
};
[download]

The test cases I currently have document my expectations best:

#!perl -w
use strict;
use Test::More;
use Data::Dumper;

use App::ImageStream::Image;
use utf8;

binmode DATA, ':utf8';
my @tests = map { s!\s+$!!g; [split /\|/] } grep {!/^\s*#/} <DATA>;

push @tests, ["String\nWith\n\nNewlines\r\nEmbedded","String_With_Newl
+ines_Embedded"];
push @tests, ["String\tWith \t Tabs \tEmbedded","String_With_Tabs_Embe
+dded"];
push @tests, ["","",'Empty String'];

plan tests => 1+@tests*2;

for (@tests) {
    my $name= $_->[2] || $_->[1];
    is App::ImageStream::Image::sanitize_name($_->[0]), $_->[1], $name
+;
    is App::ImageStream::Image::sanitize_name($_->[1]), $_->[1], "'$na
+me' is idempotent";
};

is_deeply [App::ImageStream::Image::sanitize_name(
    'Lenny', 'Motörhead'
)], ['Lenny','Motorhead'], "Multiple arguments also work";

__DATA__
Grégory|Gregory
   Leading Spaces|Leading_Spaces
   Trailing Space|Trailing_Space
Ævar Arnfjörð Bjarmason|AEvar_Arnfjord_Bjarmason
forward/slash|forward_slash
Ümloud feat. ß|Umloud_feat._ss
/foo/bar/index.html|foo_bar_index.html|filename with path
[download]

The thing that keeps nagging me is that all those blog engines have been doing this kind of thing for a long time already, as have Stackoverflow etc. - but I can't find a Perl module on CPAN that implements this. This is a call for critique - likely I have overlooked some edge cases that result in "ugly" URL fragments created. But this is also a call to whether such a module already exists, or whether I should just release this snippet as a module for general consumption.

Comment on A module for creating "sane" URLs from "arbitrary" titles? Select or Download Code

Replies are listed 'Best First'.
Re: A module for creating "sane" URLs from "arbitrary" titles? by jwkrahn (Abbot) on Apr 20, 2012 at 21:25 UTC
`# First, downgrade to ASCII chars (or transliterate if possible) @_ = unidecode(@_); for( @_ ) { s/['"]//gi; s/[^a-zA-Z0-9.-]/ /gi; s/\s+/_/g; s/_-_/-/g; s/^_+//g; s/_+$//g; };` [download] You say transliterate but you don't actually use transliterate. `s/['"]//gi;` [download] Since when do the `'` and `"` characters have both upper and lower case representations? `tr/'"//d;` [download] `s/[^a-zA-Z0-9.-]/ /gi;` [download] Which characters does the `/i` option affect here? `tr/a-zA-Z0-9.-/ /c;` [download]	[reply] [d/l] [select]
Re^2: A module for creating "sane" URLs from "arbitrary" titles? by Corion (Patriarch) on Apr 20, 2012 at 21:40 UTC
By "transliterate" I mean what Text::Unidecode does, for example for 北亰 to "Bei Jing". I assume the same will happen for Cyrillic letters. Good point on the `/i` options - they're useless indeed.	[reply] [d/l]
Re: A module for creating "sane" URLs from "arbitrary" titles? by JavaFan (Canon) on Apr 20, 2012 at 17:50 UTC
`s/_-_/-/g; s/^_+//g; s/_+$//g;` [download] Suggestion: replace the above with: `s/_(?:-_)+/_/g; s/^[-_]+//; s/[-_]+$//;` [download]	[reply] [d/l] [select]
Re^2: A module for creating "sane" URLs from "arbitrary" titles? by Corion (Patriarch) on Apr 20, 2012 at 17:58 UTC
Thanks for the additions. I'd like `abc - def` to become `abc-def`, which is why I'll be using the dash as a replacement instead of the underscore in `s/_(?:-_)+/-/g;` , but eliminating trailing dashes and repeated dashes is a good addition indeed. ~~Even though it makes links to `C++` and `C` equivalent `:)`.~~ Update: I noticed that `C++` and `C` were already converted to the same URL fragment, `C` anyway.	[reply] [d/l] [select]
Re^3: A module for creating "sane" URLs from "arbitrary" titles? by JavaFan (Canon) on Apr 20, 2012 at 18:56 UTC
I meant to write `s/_(?:-_)+/-/g`, instead of replacing a line of `_-_-_-_` with an underscore.	[reply] [d/l] [select]
Re: A module for creating "sane" URLs from "arbitrary" titles? (Naming Things) by Corion (Patriarch) on Apr 22, 2012 at 11:15 UTC
Thanks for all the comments! I will release the code as a module. Except that I already know that I'm bad at naming things. My current favourite is to name the routine `clean_fragment()` (instead of `sanitize_name` from above), and the module `Text::CleanFragment`. I've rejected `Text::ASCIIfy` or something including ASCII, because whitespace and `/` are part of ASCII but would be unsafe to use in the result. A better name would imply that the results are good to use as filenames or URL fragments.	[reply] [d/l] [select]
Re^2: A module for creating "sane" URLs from "arbitrary" titles? (Text::CleanFragment) by Corion (Patriarch) on Apr 23, 2012 at 20:58 UTC
Text::CleanFragment is on its way to CPAN.	[reply]
Re: A module for creating "sane" URLs from "arbitrary" titles? by tobyink (Canon) on Apr 20, 2012 at 23:46 UTC
Converting "ß" to "ss" seems reasonable to me. "sz" is a possibility of course, but IIRC "ss" tends to be used when sorting "ß" alphabetically. I say, go for it. Publish. I'd use a module like this in WWW::DataWiki if it existed. Right now I just do something along the lines of: `my $slug = lc $ctx->req->header('Slug'); $slug =~ s/[^a-z0-9]/-/; $slug =~ s/[-]{2,}/-/g; while (page_exists($slug)) { $slug++; } $slug = sprintf('uuid-%s', lc $self->uuid_generator->create_str) unless $slug =~ /^[a-z][a-z0-9-]*[a-z0-9]$/;` [download] It would be nice if your module provided a way of passing in a "page already exists" function as a coderef. `perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'`	[reply] [d/l]
Re^2: A module for creating "sane" URLs from "arbitrary" titles? by Corion (Patriarch) on Apr 21, 2012 at 08:35 UTC
I have the "URL was already created" thing on my to-do list as well, but I'm not sure that the module needs a callback. You can always do that check on the outside, and especially I'm unsure that incrementing the result string is what is desireable. For example I would like to append a counter such that `img_1234` and `img_1234` map to `img_1234` (first) and `img_1234_1` (second). Incrementing `img_1234` to `img_1235` will create an ugly and confusing rippling effect, as most of my images are sequentially numbered. On the other hand, almost every situation I come up with has to tackle this problem - at least the two use cases I can see, generating URL fragments and (re)naming files according to contained tags. So indeed, both functionalities would be useful, but I don't see how to do automatic counter (or whatever) generation through an API in a way that avoids `duplicate_1_1_1` but still allows for `img_1234` to become `img_1234_1`. Passing in and using the duplicate detection callback is easy, but I don't see how it can be useful except if the duplicate callback returns the result to use instead. Basically, the use case with the explicit default callback would be: `my %fragment_exists; sanitize_name(sub {map { $_ . "_" . $fragment_exists{ $_ }++} @_ }, $title, ...);` [download] I think the following idiom will be going into the documentation, together with the hint of sorting the fragments, and potentially stripping of anything that looks like a counter, to eliminate the `_1_1_1` effect: `my %fragment_exists; my $fragment; my $duplicate = ''; do { $fragment = sanitize_name( "$title_" . $duplicate++ ); while( $fragment_exists{ $fragment }++ );` [download] As `sanitize_name` will strip trailing underscores, it's easy to add a counter to the end that way, even if that counter starts numbering the first duplicate with `_1` instead of `_2` ...	[reply] [d/l] [select]
Re: A module for creating "sane" URLs from "arbitrary" titles? by GrandFather (Saint) on Apr 22, 2012 at 08:12 UTC
and if you would like to abbreviate words you may be interested in Abbreviate english words. True laziness is hard work	[reply]