Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've got a bunch of programs that handle a website I'm working on, one of which is a main "server" program. It basically gets a single parameter: page location (for example: http://.../serve.pl?page=/perl/index.html). This works very nicely. This script "builds" the page (that is, /perl/index.html wouldn't have title html, the navigation stuff, etc; it's just the guts of that particular page).

The problem here is that in order for my page to actually display properly - title image, nav bar, etc - each page that is to be viewed gets run through this script. The problem arises that now every internal link has to change from "<a href="/index.html">" to "<a href="/serve.pl?page=/index.html">".

I've got a decent version working, but I'm wondering if Perl (or heck, even UNIX/Linux for that matter) provides some way of reducing paths. By this, I mean if I get a link that points to "../programs/blah.cgi" in, say, "/computing/perl/index.html", it will be smart enough to figure out the absolute path -- in this case, "/computing/programs/blah.cgi" -- rather than nested relative paths -- "/computing/programs/../perl/blah.cgi".

It's basically a matter of beautifying HTML through Perl. If the link were a directory, I could simply run a system call like:

$dir = `cd $some_link; pwd`; chomp($dir);

(Knowing what to remove from the front of the new $dir variable, such as /usr/local/apache/htdocs/.)

Any thoughts?

Replies are listed 'Best First'.
Re: Minimizing paths?
by blokhead (Monsignor) on Sep 17, 2002 at 05:49 UTC
    You may want to consider using URIs instead of query strings for the arguments to your CGI script. Rewriting your links from /serve.pl?page=foo.html to /serve.pl/foo.html has the advantage that relative links in the HTML will be correctly interpreted by the browser, and you wouldn't have to do any work with directory trees. Not to mention, it's much easier to look at.

    You can easily utilize ScriptAlias directives in Apache for this, which allows you to tell the webserver that URLs like /serve.pl/arguments/go/here should be executed by the CGI script, and that serve.pl is not to be interpreted as a directory name. The stuff in the URI after serve.pl will be placed in $ENV{PATH_INFO}, so modifying your code is easy. You should be able to just change your code from using param('page') to the PATH_INFO environment variable instead, and everything should still work.

    Here's how you can get ScriptAliases to work:

    If you have access to the httpd.conf file, you can add a ScriptAlias for the individual script in question. It can be anywhere in the httpd.conf file and will look something like this:

    ScriptAlias /serve.pl "/usr/local/apache/htdocs/serve.pl"
    Alternatively, if you don't have access to the httpd.conf file, many ISPs configure your /cgi-bin directory to automatically use ScriptAliases, so anything in that directory will be configured properly for this.

    So now, all you have to do is rewrite your HTML parser to use the URI format instead of the query-string argument format. Now when you have a page /serve.pl/foo/bar.html, which has a link to ../perl.html in its HTML, the browser will do all the work on its own and come up with a URL that your script can accept as-is.

    The only work that really needs to be done is with absolute links. But that's trivial! Just prepend the location of your script to the beginning of all HREFs starting with a slash. /index.html becomes simply /serve.pl/index.html.

    Best of luck,

    blokhead

      If you are using Apache, you can also use the cgi-script handler instead of ScriptAlias. So, in general, if your server.pl script is already running on the webserver, you can use the PATH_INFO trick to pass the URIs to the script.

      You do need to watch out for root-relative URIs though; say you've got an html page with the following code:

      <img src="/pix/image.gif"> <img src="pix/image.gif">
      When the HTML page is called as /index.html the URIs for the images will be:
      /pix/image.gif /pix/image.gif
      When you call the page as server.pl/index.html the URIs will be:
      /pix/image.gif /server.pl/pix/image.gif
      By the way, using the BASE tag might also help.
      -- Joost downtime n. The period during which a system is error-free and immune from user input.
Re: Minimizing paths?
by perrin (Chancellor) on Sep 17, 2002 at 04:42 UTC
    Apache can do damn near anything. This would be a no-brainer with mod_perl, but it looks like you're using CGI, so try out mod_rewrite.
Re: Minimizing paths?
by swiftone (Curate) on Sep 17, 2002 at 13:17 UTC
    (Not really an answer to your question, but others have addressed that, and this is related)
    Assuming your main program doesn't do anything user specific, why put the burden of program execution on the user? The sites I work on have their dynamic scripts as scripts, and the bulk of the html generated just as your is, except that I save the output to a branch the webserver serves up.

    That didn't sound right. Let me offer an example:

    /usr/local/apache/content holds the "content" html just as your /perl/index.html does.

    /usr/local/apache/htdocs is where the webserver serves up pages.

    My program wraps the content html in headers and footers (as well as running the whole thing through Template Toolkit), then saves the result as an .html somwhere under /usr/local/apache/htdocs/

    Thus my users don't have any execution time for "flat" pages, only with actual dynamic content, but my unit doesn't have to maintain the navigation on every page because the build process drops it in place. Our setup is a little complex, with a database holding the relationships between pages, and the build process generating a different local navigation for each page, but you can season to taste.

    It's not a concept I can take any credit for, it's been done for a while (Laziness, after all). Template Toolkit ships with a helper program that does something like this, and it's a simple script to do it with most other template systems (I only have experience with Template::Toolkit and HTML::Template, but the concept holds true for anything with HTML output).

Re: Minimizing paths?
by twerq (Deacon) on Sep 17, 2002 at 14:02 UTC
    This technique is similar to what Joost was describing, but the way I usually get this kind of thing rolling is with Apache's suprisingly flexible Multiviews option.

    Simply throw Options +MultiViews in your virtualhost, and now Apache goes out of it's way to match the best possible document from the URI.

    Which basically means you can pretend your serve.pl is a directory by using a url like
    http://.../serve/?page=/perl/index.html
    and your HREFs can look like <a href="?page=this_that_theother">Link!</a>

    Or, without Multiviews, you can simply tell Apache to use your serve.pl as the DirectoryIndex inside a directory and then just pass it your variables in the same way as above.


    --twerq
Re: Minimizing paths?
by Smylers (Pilgrim) on Sep 18, 2002 at 12:17 UTC
    I've got a bunch of programs that handle a website I'm working on, one of which is a main "server" program. It basically gets a single parameter: page location (for example: http://.../serve.pl?page=/perl/index.html). This works very nicely.

    Have you tried your script to see how ‘nicely’ it works with pages that aren't in your document directory. You don't want things like this to work:

    • http://.../serve.pl?page=/etc/passwd
    • http://.../serve.pl?page=../../../../../../etc/passwd

    You'd improve security if you just passed in the basename of the file as the CGI parameter, with the path and extension being hardcoded in the Perl script and added there.

    But even that may not be secure. Do not put your site live without checking the vulnerabilities mentioned in this Phrack article. This still applies even if you go for URL rewriting as suggested in other answers.

    Smylers

Re: Minimizing paths?
by Anonymous Monk on Sep 17, 2002 at 16:24 UTC
    sub reduce_path ($) { # given a path which may contain multiple instances of .. # generate an absolute path my @path = split /\//, shift; my @new_path; while (scalar @path) { $_ = shift @path; if (/^\.\.$/o) { pop @new_path or croak "I can't recurse past my root\n"; } elsif (/^.$/o) { # do nothing. } else { push @new_path, $_; } } return join ('/', @new_path); } # usage: print reduce_path '/foo/bar/../baz', "\n"; print reduce_path '/foo/bar/../../baz', "\n"; print reduce_path 'foo/bar/baz/../', "\n";
Re: Minimizing paths?
by sharkey (Scribe) on Sep 17, 2002 at 18:54 UTC
    I did something similar for creating absolute paths for use in redirects. Here's the code:
    sub absolute_url ($$) { my ($base,$url) = @_; $base =~ s{/[^/]*$}{/}; # remove file from base $url = $base.$url; # append base and url 1 while ( $url =~ s{/[^/]*/\.\./}{/} ); # remove .. return $url; # all done! }