Corion has asked for the wisdom of the Perl Monks concerning the following question:

I'm autogenerating URLs for my image stream (github repository). I want googleable "sane" URLs for each image. This means that I want to turn, for example, the image img_1648 with the resolution 0640 in the folder 20120419-luminale and no tags into the string img_1648_0640_20120419-luminale.jpeg. But it also means that I want to turn "arbitrary" words in tags into their romanized form, turn whitespace into underscores etc, like the following :

In addition, the empty string remains the empty string, and newlines and tabs also get converted to underscores. The routine is only intended for URL fragments, not complete URLs or path parts. /foo/bar/index.html will get turned into foo_bar_index.html.

I'm not so sure about the "ss" part, but that's what Text::Unidecode gives me, and for the moment I'm OK with that.

The ridiculously simple code that does these replacements (assuming Unicode input) is:

sub sanitize_name { # make uri-sane filenames # We assume Unicode on input. # XXX Maybe use whatever SocialText used to create titles # First, downgrade to ASCII chars (or transliterate if possible) @_ = unidecode(@_); for( @_ ) { s/['"]//gi; s/[^a-zA-Z0-9.-]/ /gi; s/\s+/_/g; s/_-_/-/g; s/^_+//g; s/_+$//g; }; wantarray ? @_ : $_[0]; };

The test cases I currently have document my expectations best:

#!perl -w use strict; use Test::More; use Data::Dumper; use App::ImageStream::Image; use utf8; binmode DATA, ':utf8'; my @tests = map { s!\s+$!!g; [split /\|/] } grep {!/^\s*#/} <DATA>; push @tests, ["String\nWith\n\nNewlines\r\nEmbedded","String_With_Newl +ines_Embedded"]; push @tests, ["String\tWith \t Tabs \tEmbedded","String_With_Tabs_Embe +dded"]; push @tests, ["","",'Empty String']; plan tests => 1+@tests*2; for (@tests) { my $name= $_->[2] || $_->[1]; is App::ImageStream::Image::sanitize_name($_->[0]), $_->[1], $name +; is App::ImageStream::Image::sanitize_name($_->[1]), $_->[1], "'$na +me' is idempotent"; }; is_deeply [App::ImageStream::Image::sanitize_name( 'Lenny', 'Motörhead' )], ['Lenny','Motorhead'], "Multiple arguments also work"; __DATA__ Grégory|Gregory Leading Spaces|Leading_Spaces Trailing Space|Trailing_Space Ævar Arnfjörð Bjarmason|AEvar_Arnfjord_Bjarmason forward/slash|forward_slash Ümloud feat. ß|Umloud_feat._ss /foo/bar/index.html|foo_bar_index.html|filename with path

The thing that keeps nagging me is that all those blog engines have been doing this kind of thing for a long time already, as have Stackoverflow etc. - but I can't find a Perl module on CPAN that implements this. This is a call for critique - likely I have overlooked some edge cases that result in "ugly" URL fragments created. But this is also a call to whether such a module already exists, or whether I should just release this snippet as a module for general consumption.

Replies are listed 'Best First'.
Re: A module for creating "sane" URLs from "arbitrary" titles?
by jwkrahn (Abbot) on Apr 20, 2012 at 21:25 UTC
    # First, downgrade to ASCII chars (or transliterate if possible) @_ = unidecode(@_); for( @_ ) { s/['"]//gi; s/[^a-zA-Z0-9.-]/ /gi; s/\s+/_/g; s/_-_/-/g; s/^_+//g; s/_+$//g; };

    You say transliterate but you don't actually use transliterate.

    s/['"]//gi;

    Since when do the ' and " characters have both upper and lower case representations?

    tr/'"//d;
    s/[^a-zA-Z0-9.-]/ /gi;

    Which characters does the /i option affect here?

    tr/a-zA-Z0-9.-/ /c;

      By "transliterate" I mean what Text::Unidecode does, for example for 北亰 to "Bei Jing". I assume the same will happen for Cyrillic letters.

      Good point on the /i options - they're useless indeed.

Re: A module for creating "sane" URLs from "arbitrary" titles?
by JavaFan (Canon) on Apr 20, 2012 at 17:50 UTC
    s/_-_/-/g; s/^_+//g; s/_+$//g;
    Suggestion: replace the above with:
    s/_(?:-_)+/_/g; s/^[-_]+//; s/[-_]+$//;

      Thanks for the additions.

      I'd like abc - def to become abc-def, which is why I'll be using the dash as a replacement instead of the underscore in s/_(?:-_)+/-/g; , but eliminating trailing dashes and repeated dashes is a good addition indeed. Even though it makes links to C++ and C equivalent :).

      Update: I noticed that C++ and C were already converted to the same URL fragment, C anyway.

        I meant to write s/_(?:-_)+/-/g, instead of replacing a line of _-_-_-_ with an underscore.
Re: A module for creating "sane" URLs from "arbitrary" titles? (Naming Things)
by Corion (Patriarch) on Apr 22, 2012 at 11:15 UTC

    Thanks for all the comments! I will release the code as a module. Except that I already know that I'm bad at naming things. My current favourite is to name the routine clean_fragment() (instead of sanitize_name from above), and the module Text::CleanFragment. I've rejected Text::ASCIIfy or something including ASCII, because whitespace and / are part of ASCII but would be unsafe to use in the result.

    A better name would imply that the results are good to use as filenames or URL fragments.

Re: A module for creating "sane" URLs from "arbitrary" titles?
by tobyink (Canon) on Apr 20, 2012 at 23:46 UTC

    Converting "ß" to "ss" seems reasonable to me. "sz" is a possibility of course, but IIRC "ss" tends to be used when sorting "ß" alphabetically.

    I say, go for it. Publish. I'd use a module like this in WWW::DataWiki if it existed. Right now I just do something along the lines of:

    my $slug = lc $ctx->req->header('Slug'); $slug =~ s/[^a-z0-9]/-/; $slug =~ s/[-]{2,}/-/g; while (page_exists($slug)) { $slug++; } $slug = sprintf('uuid-%s', lc $self->uuid_generator->create_str) unless $slug =~ /^[a-z][a-z0-9-]*[a-z0-9]$/;

    It would be nice if your module provided a way of passing in a "page already exists" function as a coderef.

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

      I have the "URL was already created" thing on my to-do list as well, but I'm not sure that the module needs a callback. You can always do that check on the outside, and especially I'm unsure that incrementing the result string is what is desireable. For example I would like to append a counter such that img_1234 and img_1234 map to img_1234 (first) and img_1234_1 (second). Incrementing img_1234 to img_1235 will create an ugly and confusing rippling effect, as most of my images are sequentially numbered.

      On the other hand, almost every situation I come up with has to tackle this problem - at least the two use cases I can see, generating URL fragments and (re)naming files according to contained tags.

      So indeed, both functionalities would be useful, but I don't see how to do automatic counter (or whatever) generation through an API in a way that avoids duplicate_1_1_1 but still allows for img_1234 to become img_1234_1.

      Passing in and using the duplicate detection callback is easy, but I don't see how it can be useful except if the duplicate callback returns the result to use instead. Basically, the use case with the explicit default callback would be:

      my %fragment_exists; sanitize_name(sub {map { $_ . "_" . $fragment_exists{ $_ }++} @_ }, $title, ...);

      I think the following idiom will be going into the documentation, together with the hint of sorting the fragments, and potentially stripping of anything that looks like a counter, to eliminate the _1_1_1 effect:

      my %fragment_exists; my $fragment; my $duplicate = ''; do { $fragment = sanitize_name( "$title_" . $duplicate++ ); while( $fragment_exists{ $fragment }++ );

      As sanitize_name will strip trailing underscores, it's easy to add a counter to the end that way, even if that counter starts numbering the first duplicate with _1 instead of _2 ...

Re: A module for creating "sane" URLs from "arbitrary" titles?
by GrandFather (Saint) on Apr 22, 2012 at 08:12 UTC

    and if you would like to abbreviate words you may be interested in Abbreviate english words.

    True laziness is hard work