I'm autogenerating URLs for my image stream (github repository). I want googleable "sane" URLs for each image. This means that I want to turn, for example, the image img_1648 with the resolution 0640 in the folder 20120419-luminale and no tags into the string img_1648_0640_20120419-luminale.jpeg. But it also means that I want to turn "arbitrary" words in tags into their romanized form, turn whitespace into underscores etc, like the following :

In addition, the empty string remains the empty string, and newlines and tabs also get converted to underscores. The routine is only intended for URL fragments, not complete URLs or path parts. /foo/bar/index.html will get turned into foo_bar_index.html.

I'm not so sure about the "ss" part, but that's what Text::Unidecode gives me, and for the moment I'm OK with that.

The ridiculously simple code that does these replacements (assuming Unicode input) is:

sub sanitize_name { # make uri-sane filenames # We assume Unicode on input. # XXX Maybe use whatever SocialText used to create titles # First, downgrade to ASCII chars (or transliterate if possible) @_ = unidecode(@_); for( @_ ) { s/['"]//gi; s/[^a-zA-Z0-9.-]/ /gi; s/\s+/_/g; s/_-_/-/g; s/^_+//g; s/_+$//g; }; wantarray ? @_ : $_[0]; };

The test cases I currently have document my expectations best:

#!perl -w use strict; use Test::More; use Data::Dumper; use App::ImageStream::Image; use utf8; binmode DATA, ':utf8'; my @tests = map { s!\s+$!!g; [split /\|/] } grep {!/^\s*#/} <DATA>; push @tests, ["String\nWith\n\nNewlines\r\nEmbedded","String_With_Newl +ines_Embedded"]; push @tests, ["String\tWith \t Tabs \tEmbedded","String_With_Tabs_Embe +dded"]; push @tests, ["","",'Empty String']; plan tests => 1+@tests*2; for (@tests) { my $name= $_->[2] || $_->[1]; is App::ImageStream::Image::sanitize_name($_->[0]), $_->[1], $name +; is App::ImageStream::Image::sanitize_name($_->[1]), $_->[1], "'$na +me' is idempotent"; }; is_deeply [App::ImageStream::Image::sanitize_name( 'Lenny', 'Motörhead' )], ['Lenny','Motorhead'], "Multiple arguments also work"; __DATA__ Grégory|Gregory Leading Spaces|Leading_Spaces Trailing Space|Trailing_Space Ævar Arnfjörð Bjarmason|AEvar_Arnfjord_Bjarmason forward/slash|forward_slash Ümloud feat. ß|Umloud_feat._ss /foo/bar/index.html|foo_bar_index.html|filename with path

The thing that keeps nagging me is that all those blog engines have been doing this kind of thing for a long time already, as have Stackoverflow etc. - but I can't find a Perl module on CPAN that implements this. This is a call for critique - likely I have overlooked some edge cases that result in "ugly" URL fragments created. But this is also a call to whether such a module already exists, or whether I should just release this snippet as a module for general consumption.


In reply to A module for creating "sane" URLs from "arbitrary" titles? by Corion

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.