comment on

I'm autogenerating URLs for my image stream (github repository). I want googleable "sane" URLs for each image. This means that I want to turn, for example, the image img_1648 with the resolution 0640 in the folder 20120419-luminale and no tags into the string img_1648_0640_20120419-luminale.jpeg. But it also means that I want to turn "arbitrary" words in tags into their romanized form, turn whitespace into underscores etc, like the following :

Motörhead into Motorhead
Ævar Arnfjörð Bjarmason into AEvar_Arnfjord_Bjarmason
```
Ümloud   feat.   ß
```
into Umloud_feat._ss

In addition, the empty string remains the empty string, and newlines and tabs also get converted to underscores. The routine is only intended for URL fragments, not complete URLs or path parts. /foo/bar/index.html will get turned into foo_bar_index.html.

I'm not so sure about the "ss" part, but that's what Text::Unidecode gives me, and for the moment I'm OK with that.

The ridiculously simple code that does these replacements (assuming Unicode input) is:

sub sanitize_name {
    # make uri-sane filenames
    # We assume Unicode on input.

    # XXX Maybe use whatever SocialText used to create titles
    
    # First, downgrade to ASCII chars (or transliterate if possible)
    @_ = unidecode(@_);

    for( @_ ) {
        s/['"]//gi;
        s/[^a-zA-Z0-9.-]/ /gi;
        s/\s+/_/g;
        s/_-_/-/g;
        s/^_+//g;
        s/_+$//g;
    };
    wantarray ? @_ : $_[0];
};
[download]

The test cases I currently have document my expectations best:

#!perl -w
use strict;
use Test::More;
use Data::Dumper;

use App::ImageStream::Image;
use utf8;

binmode DATA, ':utf8';
my @tests = map { s!\s+$!!g; [split /\|/] } grep {!/^\s*#/} <DATA>;

push @tests, ["String\nWith\n\nNewlines\r\nEmbedded","String_With_Newl
+ines_Embedded"];
push @tests, ["String\tWith \t Tabs \tEmbedded","String_With_Tabs_Embe
+dded"];
push @tests, ["","",'Empty String'];

plan tests => 1+@tests*2;

for (@tests) {
    my $name= $_->[2] || $_->[1];
    is App::ImageStream::Image::sanitize_name($_->[0]), $_->[1], $name
+;
    is App::ImageStream::Image::sanitize_name($_->[1]), $_->[1], "'$na
+me' is idempotent";
};

is_deeply [App::ImageStream::Image::sanitize_name(
    'Lenny', 'Motörhead'
)], ['Lenny','Motorhead'], "Multiple arguments also work";

__DATA__
Grégory|Gregory
   Leading Spaces|Leading_Spaces
   Trailing Space|Trailing_Space
Ævar Arnfjörð Bjarmason|AEvar_Arnfjord_Bjarmason
forward/slash|forward_slash
Ümloud feat. ß|Umloud_feat._ss
/foo/bar/index.html|foo_bar_index.html|filename with path
[download]

The thing that keeps nagging me is that all those blog engines have been doing this kind of thing for a long time already, as have Stackoverflow etc. - but I can't find a Perl module on CPAN that implements this. This is a call for critique - likely I have overlooked some edge cases that result in "ugly" URL fragments created. But this is also a call to whether such a module already exists, or whether I should just release this snippet as a module for general consumption.

In reply to A module for creating "sane" URLs from "arbitrary" titles? by Corion

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.