Corion has asked for the wisdom of the Perl Monks concerning the following question:
I'm autogenerating URLs for my image stream (github repository). I want googleable "sane" URLs for each image. This means that I want to turn, for example, the image img_1648 with the resolution 0640 in the folder 20120419-luminale and no tags into the string img_1648_0640_20120419-luminale.jpeg. But it also means that I want to turn "arbitrary" words in tags into their romanized form, turn whitespace into underscores etc, like the following :
Ümloud feat. ßinto Umloud_feat._ss
In addition, the empty string remains the empty string, and newlines and tabs also get converted to underscores. The routine is only intended for URL fragments, not complete URLs or path parts. /foo/bar/index.html will get turned into foo_bar_index.html.
I'm not so sure about the "ss" part, but that's what Text::Unidecode gives me, and for the moment I'm OK with that.
The ridiculously simple code that does these replacements (assuming Unicode input) is:
sub sanitize_name { # make uri-sane filenames # We assume Unicode on input. # XXX Maybe use whatever SocialText used to create titles # First, downgrade to ASCII chars (or transliterate if possible) @_ = unidecode(@_); for( @_ ) { s/['"]//gi; s/[^a-zA-Z0-9.-]/ /gi; s/\s+/_/g; s/_-_/-/g; s/^_+//g; s/_+$//g; }; wantarray ? @_ : $_[0]; };
The test cases I currently have document my expectations best:
#!perl -w use strict; use Test::More; use Data::Dumper; use App::ImageStream::Image; use utf8; binmode DATA, ':utf8'; my @tests = map { s!\s+$!!g; [split /\|/] } grep {!/^\s*#/} <DATA>; push @tests, ["String\nWith\n\nNewlines\r\nEmbedded","String_With_Newl +ines_Embedded"]; push @tests, ["String\tWith \t Tabs \tEmbedded","String_With_Tabs_Embe +dded"]; push @tests, ["","",'Empty String']; plan tests => 1+@tests*2; for (@tests) { my $name= $_->[2] || $_->[1]; is App::ImageStream::Image::sanitize_name($_->[0]), $_->[1], $name +; is App::ImageStream::Image::sanitize_name($_->[1]), $_->[1], "'$na +me' is idempotent"; }; is_deeply [App::ImageStream::Image::sanitize_name( 'Lenny', 'Motörhead' )], ['Lenny','Motorhead'], "Multiple arguments also work"; __DATA__ Grégory|Gregory Leading Spaces|Leading_Spaces Trailing Space|Trailing_Space Ævar Arnfjörð Bjarmason|AEvar_Arnfjord_Bjarmason forward/slash|forward_slash Ümloud feat. ß|Umloud_feat._ss /foo/bar/index.html|foo_bar_index.html|filename with path
The thing that keeps nagging me is that all those blog engines have been doing this kind of thing for a long time already, as have Stackoverflow etc. - but I can't find a Perl module on CPAN that implements this. This is a call for critique - likely I have overlooked some edge cases that result in "ugly" URL fragments created. But this is also a call to whether such a module already exists, or whether I should just release this snippet as a module for general consumption.
|
|---|