newrisedesigns has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks,

I am in the process of writing a weblog page for my website which will be driven by Perl. Currently, I have the program slurp in the weblog file (plain-text, with a record delimiter) and spit the chunks back out.

Slurping is a sin, and I wish to become enlightened. I want to rewrite the code (it's not yet live) so the weblog isn't slurped from a file. I'm considering using a 'while' loop, however, I'm not sure how I can keep track of the file. I have an "offset" modifier and a "span" modifier, which allows users to start at a point in the log and read through however many entries they requested.

Currently, I have a $count variable in the 'while' loop, and some 'if' statements which determine which entries are pulled. I am under the assumption that I can use $_ to read from <WEBLOG> and $. to get the current entry (as i would in $weblog[$currententry]).

Is this the most efficient form of reading from a file? Any suggestions?

a somewhat lost John J Reiser
newrisedesigns.com

Replies are listed 'Best First'.
Re: Anti-slurp weblog
by Zaxo (Archbishop) on Jan 21, 2002 at 12:25 UTC

    I'll second grep's suggestion that you look into database support and other already-invented wheels. That begs the question you asked, though, of how to be efficient about reading files, and what use could be made of $_ and $. in doing so.

    As it happens, you posted this right after some cb discussion of readability, maintainability, C-style operators, and precedence. My suggestion here is an example of code I consider readable and all those other good things, but which depends crucially on "esoteric" details like operator precedence. I believe it to be highly optimized in the good way, by doing as little as possible inside the loop.

    It sounds like you have most of the pieces in mind. Assuming WEBLOG is opened successfully, $sep is the record seperator, and $offset and $span are as you describe:

    { local $/ = $sep; my $done = $offset + $span ; while (<WEBLOG>) { chomp, print if $. == $offset .. $. == $done && last; } }
    That uses the "flipflop" operator, .. in scalar context. It returns false until its first argument is true. Then it ignores its first argument and returns true until after its second argument becomes true. I use that so only one comparison is done for each record read. last is called to stop reading at that point, with && used because it has higher precedence than .. so that last is short-circuited out until it is needed.

    chomp and print are done in sequence (comma operator) only if the flipflop is returning true.

    After Compline,
    Zaxo

Re: Anti-slurp weblog
by trs80 (Priest) on Jan 21, 2002 at 10:50 UTC
    I took the question to refer to a web diary of sorts and not the apahce web server logs. I would suggest using seperate files for each entry and use a counter file to sequentially number your entries and use that number as part of the filename. I recmommend that you start at 100000 for the counter file, then you can easily search for filenames that contain 6 digits in the future. It will be safe to slurp the smaller files as needed and it is easy to pull from one point to the next with this:
    # set our variable to store the logs in my $weblog_data; # set the default for the number of entries per page my $number_entries = 10; # this could be dynamic with changes # starting point for entries my $starting_point = '100123'; # need to be pulled foreach ($starting_point..$starting_point_for_entries+$number_entries) + { if (-e "$_.wl") { open(WEBLOG,"$_.wl") or warn $!; while (<WEBLOG>) { $weblog_data .= $_; } $weblog_data = "<br>"; } }

    That is just an example of how to loop through to get them but the code is far from complete. Using the number method will make your next and previous links easier to create as well. For me working with several files in this manner is easier then working with one long file.
Re: Anti-slurp weblog
by grep (Monsignor) on Jan 21, 2002 at 09:32 UTC
    I would suggest not reinventing the wheel here. Analog is an excellent open source weblog analyzer. If you want prettier pictures for point-haired bosses then look to WebTrends.

    Both of these products have been around for a while and are easy (very easy) to setup in *nix and Win32.

    But to answer your question...
    I think you are hitting the practical limitations of flat files. I would suggest going to a real database of some kind. PostgreSQL is an excellent database that will meet all you needs and more, and it runs in *nix and Win32(with some finagling). Sorry no Mac.

    Don't reinvent good wheels

    UPDATE: as trs80 pointed out it looks like I am off base with the Analog. I would still suggest looking into a database solution or trs80's solution of a series of flat files. If you think you will be getting a lot of traffic I would lean towards the DB.

    grep
    grep> cd pub
    grep> more beer
Re: Anti-slurp weblog
by n3dst4 (Scribe) on Jan 22, 2002 at 00:07 UTC

    There is also great merit in maintaining an index file. You can index, say, the date, title and filename, and the index file will probably remain small enough that you *can* slurp it. I mean, you're not going to run Slashdot with it but even if you post once a day for the next three years you're still looking at a few K. And you can do a few neat tricks like just reading off the top three records to save time.

    For great speeedness, use fixed-length records. Space wastage sucks, but you then get the luxury of being able to arbitrarily jump around the file.

    <lecture props='grep'>
    The thing is, once you you've got a good index file system going, you probably want to add querying abilities, so you'll add in a keyword index. Then you'll break the index handler off into a daemon process so you can update asynchronously and handle collisions. Then you'll generalise the data storage to handle arbitrary data in arbitrary table. You might even add a SQL interpreter. You get the idea - you will thank yourself for learning to use a database and DBI.
    </lecture>

    But if it's impossible, I meant what I said about an index file.

      Everyone++ for all your help. The Index file idea is great.

      The problem is, I'm virtually hosted, and I don't have mySql or postgres set up. Although I can have my website 'upgraded' (my service provider is updating servers), I don't have the time to back everything up and move everything over. Mark it up to plain old laziness and lack of time.

      Hopefully though, I will be on the new server, and I will rewrite the weblog using mySql (offered with the upgrade).

      Thanks again for all your help, it is greatly appreciated.

      John J Reiser
      newrisedesigns.com