Regex to Truncate URLs Nicely

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex to Truncate URLs Nicely by fruiture (Curate) on Oct 31, 2002 at 23:49 UTC
Check out the URI module to do it correctly, i think that's th easiest way to parse an URL and modify it wisely. -- http://fruiture.de	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Regex to Truncate URLs Nicely by Enlil (Parson) on Nov 01, 2002 at 00:15 UTC
Since I would now guess this is an "academic" endevour. I will give you the way I would approach it, without modules. First I would use two regexes. The first to reduce something like: http://some-shop.com/dir1/dir2/buystuff.cgi?x=1&y=2&z=3 to something like http://some-shop.com/(...)/buystuff.cgi?x=1&y=2&z=3 and the second regex to remove anything at the end if there is a long query string at the end. But only doing anything if the URL is over 50 chars.(then again I might just use a couple of splits and some concatenation magic instead, but that would depend on what all my data looked like.) Good Luck. -enlil	[reply]
Re^2: Regex to Truncate URLs Nicely by Aristotle (Chancellor) on Nov 02, 2002 at 07:48 UTC
I'd do it the other way around. The query parameters may contain slashes, but the path cannot contain question marks. If you try to reduce directories first, you will have to resolve the ambiguity of slashes in the path vs slashes in the query parameters. If you remove the query parameters first, for which there is an unambiguous criterion, then the slashes suddenly are unambiguous too. Makeshifts last the longest.	[reply]
Re: Regex to Truncate URLs Nicely by Cody Pendant (Prior) on Nov 01, 2002 at 10:57 UTC
Thank you all for your help, I really appreciate it -- it never occurred to me to split on "/" which is certainly a novel approach. I'll check out all your code and post again tomorrow. -- `($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;` [download]	[reply] [d/l]
Re: Regex to Truncate URLs Nicely by Revelation (Deacon) on Nov 01, 2002 at 00:40 UTC
Possible Code: `print url_parse('http://www.moo.com/moo.cgi?moo=moo'); sub url_parse { my $url = shift; $url =~ m!(http:\/\/[^\/]+)!gis; my $base = $1; ( my $directorystruct = $url ) =~ s!$1!!gis; my ( undef, @directories ) = split /\//, $directorystruct; my $tnum = $#directories; $directories[$tnum] =~ s/(.)\?./$1/gis; return $base . '/' . $directories[$tnum] if scalar(@directories) < += 1; return $base . '/../' . $directories[$tnum]; }` [download] Gyan Kapur gyan.kapur@rhhllp.com	[reply] [d/l]
Re: Regex to Truncate URLs Nicely by Wonko the sane (Curate) on Nov 01, 2002 at 01:23 UTC
How about something like this? `$url =~ s!^(https?://.?/)(?:.{20}.)?(/[^?])(\?.)*!$1(..)$2!` [download] Works on urls with or without args on the end, the 20 in the middle can be adjusted to fit whatever url you mostly encounter. Best Regards, Wonko	[reply] [d/l]
Re: Regex to Truncate URLs Nicely by artist (Parson) on Nov 01, 2002 at 03:16 UTC
Hi, Mine is not 100% perl solution. If You may be able to use external services like Tiny URL which can shorten the URL itself, can help for the underlying link. The one on the display can be shorten by Website names etc. or as per methods mentioned by other monks here. Appreciating the Tiny Art, Artist	[reply]
Re: Regex to Truncate URLs Nicely by Aristotle (Chancellor) on Nov 02, 2002 at 08:14 UTC
Do use URI. That said: `my $maxlen = 35; s![?].$!!; # chop query params if any s{^(.)(?=/[^/]/?)}{length $1 < $maxlen ? $1 : substr($1, 0, $maxlen-3 +)."..."}e;` [download] Makeshifts last the longest.	[reply] [d/l]
Re: Re: Regex to Truncate URLs Nicely by Cody Pendant (Prior) on Nov 02, 2002 at 11:05 UTC
Neat, thank you. One question: `s![?].*$!!;` [download] Why is the query in brackets there? -- `($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;` [download]	[reply] [d/l] [select]
Re^3: Regex to Truncate URLs Nicely by Aristotle (Chancellor) on Nov 02, 2002 at 11:42 UTC
Cause it has to be escaped outside: `s!\?.$!!;` The bracketed version occasionally looks less noisy. That's all. Makeshifts last the longest.*	[reply] [d/l]
use split? by cebrown (Pilgrim) on Nov 01, 2002 at 00:19 UTC
I'm just about to brave the rush hour, so can't post code, but I would suggest using `split` on "/" instead of a regex. The first few items in the split list will make up the front of the URI, and the last one can be split again on "?" to knock off the query parameters.	[reply] [d/l]
Re: use split? by fruiture (Curate) on Nov 01, 2002 at 11:28 UTC
Imho split() is NOT a good idea: `http://host.com/some/uri/whatever?some/query/string` [download] -- http://fruiture.de	[reply] [d/l]
Re: Re: use split? by Enlil (Parson) on Nov 01, 2002 at 23:02 UTC
The problem with either method is that there are special cases which one might miss unless they understand exactly what a URL might look like (or for that case any data you have to parse through). Personally, I would use a module if someone has already taken the time to do the leg work of what specifications an URL has to meet. When I initially coded up a regex for this, and then didn't post it because I don't wish to do someone elses homework, but rather posted the method I took, and I completely neglected the special case that fruiture mentions above. But I don't see a problem with using split(s). Anyhow, on to the code (granted no guarantees that it will work for all cases, I would use URI): use strict; use warnings; while ( my $url = <DATA> ) { chomp($url); my $dup_url = $url; if ( length($url) > 49) { $url =~ s!(?: (^https?://[^/]+/)./(.)\?.* ) \| (?: (^https?://[^/]+/)./(.) ) ! ($1\|\|$3) . '(...)/'. ($2\|\|$4) !ex; my $http = (split /\/\//,$dup_url)[0]; my ($url_start, $url_end) = (split /\// ,(split /\?/,$dup_url)[0]) +[2,-1]; $dup_url = "$http//$url_start/(...)/$url_end"; } print "REGEX: $url\n"; print "SPLIT: $dup_url\n\n"; } __DATA__ http://some-shop.com/dir1/dir2/buystuff.cgi?x=1&y=2&z=3 http://somewhere/with/a/vastly/deep/structure/virus.exe http://host.com/some/uri/whatever?some/query/stringthatis/here https://some-shop.com/dir1/dir2/buystuff.cgi?x=1&y=2&z=3 https://somewhere/with/a/vastly/deep/structure/virus.exe https://host.com/some/uri/whatever?some/query/stringthatis/here [download]	[reply] [d/l]
Missed pun opportunity! by cebrown (Pilgrim) on Nov 01, 2002 at 00:21 UTC
I should said that I can't post code because I have to `split`.	[reply] [d/l]