Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a board where one of the main problems is people posting long URLs which stretch out the post area.

What we want is something that will stop a URL over, say, fifty chars being displayed in full, but replace it with something which is still informative.

For long directory structures, I've got as far as:

(http:\/\/[^\/]+).*

so I can replace
http://somewhere/with/a/deep/structure/
with just
http://somewhere/(...)

What people are asking for is this, though -- replacing (optionally) any long dir structures, preserving the document name on the end, but stripping any long query string info from the end as well, so that

http://somewhere/with/a/deep/structure/virus.exe
would be reduced to
http://somewhere/(...)/virus.exe
and
http://some-shop.com/dir1/dir2/buystuff.cgi?x=1&y=2&z=3
would be reduced to
http://some-shop.com/(...)/buystuff.cgi
as well.

Any pointers gratefully received.
--

($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;

Replies are listed 'Best First'.
Re: Regex to Truncate URLs Nicely
by fruiture (Curate) on Oct 31, 2002 at 23:49 UTC

    Check out the URI module to do it correctly, i think that's th easiest way to parse an URL and modify it wisely.

    --
    http://fruiture.de
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Regex to Truncate URLs Nicely
by Enlil (Parson) on Nov 01, 2002 at 00:15 UTC
    Since I would now guess this is an "academic" endevour. I will give you the way I would approach it, without modules. First I would use two regexes. The first to reduce something like:

    http://some-shop.com/dir1/dir2/buystuff.cgi?x=1&y=2&z=3

    to something like

    http://some-shop.com/(...)/buystuff.cgi?x=1&y=2&z=3

    and the second regex to remove anything at the end if there is a long query string at the end. But only doing anything if the URL is over 50 chars.(then again I might just use a couple of splits and some concatenation magic instead, but that would depend on what all my data looked like.)

    Good Luck.

    -enlil

      I'd do it the other way around. The query parameters may contain slashes, but the path cannot contain question marks. If you try to reduce directories first, you will have to resolve the ambiguity of slashes in the path vs slashes in the query parameters. If you remove the query parameters first, for which there is an unambiguous criterion, then the slashes suddenly are unambiguous too.

      Makeshifts last the longest.

Re: Regex to Truncate URLs Nicely
by Cody Pendant (Prior) on Nov 01, 2002 at 10:57 UTC
    Thank you all for your help, I really appreciate it -- it never occurred to me to split on "/" which is certainly a novel approach.

    I'll check out all your code and post again tomorrow.
    --

    ($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;
Re: Regex to Truncate URLs Nicely
by Revelation (Deacon) on Nov 01, 2002 at 00:40 UTC
    Possible Code:
    print url_parse('http://www.moo.com/moo.cgi?moo=moo'); sub url_parse { my $url = shift; $url =~ m!(http:\/\/[^\/]+)!gis; my $base = $1; ( my $directorystruct = $url ) =~ s!$1!!gis; my ( undef, @directories ) = split /\//, $directorystruct; my $tnum = $#directories; $directories[$tnum] =~ s/(.*)\?.*/$1/gis; return $base . '/' . $directories[$tnum] if scalar(@directories) < += 1; return $base . '/../' . $directories[$tnum]; }

    Gyan Kapur
    gyan.kapur@rhhllp.com
Re: Regex to Truncate URLs Nicely
by Wonko the sane (Curate) on Nov 01, 2002 at 01:23 UTC

    How about something like this?

    $url =~ s!^(https?://.*?/)(?:.{20}.*)?(/[^?]*)(\?.*)*!$1(..)$2!
    Works on urls with or without args on the end, the 20 in the middle can be adjusted to fit whatever url you mostly encounter.

    Best Regards,
    Wonko
Re: Regex to Truncate URLs Nicely
by artist (Parson) on Nov 01, 2002 at 03:16 UTC
    Hi,
    Mine is not 100% perl solution. If You may be able to use external services like Tiny URL which can shorten the URL itself, can help for the underlying link. The one on the display can be shorten by Website names etc. or as per methods mentioned by other monks here.

    Appreciating the Tiny Art,
    Artist

Re: Regex to Truncate URLs Nicely
by Aristotle (Chancellor) on Nov 02, 2002 at 08:14 UTC
    Do use URI. That said:
    my $maxlen = 35; s![?].*$!!; # chop query params if any s{^(.*)(?=/[^/]/?)}{length $1 < $maxlen ? $1 : substr($1, 0, $maxlen-3 +)."..."}e;

    Makeshifts last the longest.

      Neat, thank you.

      One question:

      s![?].*$!!;

      Why is the query in brackets there?
      --

      ($_='jjjuuusssttt annootthheer pppeeerrrlll haaaccckkeer')=~y/a-z//s;print;
        Cause it has to be escaped outside: s!\?.*$!!; The bracketed version occasionally looks less noisy. That's all.

        Makeshifts last the longest.

use split?
by cebrown (Pilgrim) on Nov 01, 2002 at 00:19 UTC
    I'm just about to brave the rush hour, so can't post code, but I would suggest using split on "/" instead of a regex.

    The first few items in the split list will make up the front of the URI, and the last one can be split again on "?" to knock off the query parameters.

      Imho split() is NOT a good idea:

      http://host.com/some/uri/whatever?some/query/string
      --
      http://fruiture.de
        The problem with either method is that there are special cases which one might miss unless they understand exactly what a URL might look like (or for that case any data you have to parse through).

        Personally, I would use a module if someone has already taken the time to do the leg work of what specifications an URL has to meet.

        When I initially coded up a regex for this, and then didn't post it because I don't wish to do someone elses homework, but rather posted the method I took, and I completely neglected the special case that fruiture mentions above. But I don't see a problem with using split(s). Anyhow, on to the code (granted no guarantees that it will work for all cases, I would use URI):

        use strict; use warnings; while ( my $url = <DATA> ) { chomp($url); my $dup_url = $url; if ( length($url) > 49) { $url =~ s!(?: (^https?://[^/]+/).*/(.*)\?.* ) | (?: (^https?://[^/]+/).*/(.*) ) ! ($1||$3) . '(...)/'. ($2||$4) !ex; my $http = (split /\/\//,$dup_url)[0]; my ($url_start, $url_end) = (split /\// ,(split /\?/,$dup_url)[0]) +[2,-1]; $dup_url = "$http//$url_start/(...)/$url_end"; } print "REGEX: $url\n"; print "SPLIT: $dup_url\n\n"; } __DATA__ http://some-shop.com/dir1/dir2/buystuff.cgi?x=1&y=2&z=3 http://somewhere/with/a/vastly/deep/structure/virus.exe http://host.com/some/uri/whatever?some/query/stringthatis/here https://some-shop.com/dir1/dir2/buystuff.cgi?x=1&y=2&z=3 https://somewhere/with/a/vastly/deep/structure/virus.exe https://host.com/some/uri/whatever?some/query/stringthatis/here
      I should said that I can't post code because I have to split.