In general, your object-oriented code will be slower than
non-OO code. But that's not the whole story, because the
slowdown is not very significant, and is far outweighed
by things like disk access, database access, etc. So on
the whole, it would make things easier for you, probably,
to "go OO", at least conceptually, then I'd say go for it.
You also asked:
> Assuming I can't get SQL, would I see a performance
> increase if (e.g.) I moved to DBM from the
> tab-delimited flat files?
Yes, so long as you use them correctly. By that I mean, you
need to structure your data in such a way--namely, key-value
pairs--that the benefits of using a DBM will be displayed.
For doing lookups on data, getting that data out of a DBM,
where you have some unique key into the DBM, is much faster
than searching through each line in a text file, trying to
find the data. But you have to get your data set up in such
a way that this is the case.
And yes, I would recommend using Berkeley DB. Get version 2,
because I've heard about memory leaks in 1. | [reply] [d/l] |
In a project like this I really recommend going with a SQL
database of some type. (MySQL and Postgres are both viable
free alternatives see
this column for a good overview of each's strengths and
weakness's. Also, see
this for a good intro to some of the things you need to
be aware of like data modeling and so forth.
If you want to do cross referencing, SQL offers a powerful
methodology for that type of thing. SQL in general makes it
very easy to do very complex tasks with very few words. I
would recommend SQL not merely for performance issues but
for the degree to which it will help you deal with some of
the complexity it sounds like you want to work into this
project.
Good Luck,
Mark | [reply] |
Whatever solution you choose, keep in mind that your client may well come up with new ideas of what should be reported, once your report system is working.
They will say: "Oh, you can do this now, then I want..."
Here are some things that to my experience will be asked for by the client:
- Session statistics, i.e. select all hits from cookie/ip-number x who is less than 15 minutes apart.
- Different statistics for different departments, e.g. "We just relised that the ACME department should have their own stats, they have their pages in /docs/others/acme and in /cgi-bin/acmestuff/"
- If the statistics are to be presented to management, presentation is very important; the reports should look good, and contain exactly the data that they want (the data that will impress them).
Try to make a design that will make these kind of things easy to put in once the client realises he/she wants it.
Here is a description of a design of an OO log analysis application that I made. This design is in many ways a lot more primitive than what you want, but it may give some input:
There are a number of objects in the application. The most important ones right now are the report object, the logfile object, the input object and the category object. There is also an output object.
The report object
The report object (Report.pm) is the central object in the application. It stores global settings and it has slots for all other objects. Objects "know" about each other because they:
- are stored in slots (attributes) in the report object
- each have slot for the report object
Hence obj2 ("self") can call a method in obj1 by doing:
<BR>
$self->report->slot_for_obj1->method;
<BR>
I.e. go through the report object to access other objects.
It also contains methods for resolving settings from CGI input with defaults. Lastly it contains high-level methods for printing to file and browser.
The logfile object
The logfile object represents the log file. It has a method next_line for returning the next line of the log file. It also parses the date of the line.
The input object
The input object represents the user's input from the HTML form. It particularly likes parameter names in three parts with hyphens between each part, i.e. "a-query-apples". It stores these parameters in a tree structure so that each object can go in and look up its own parameters.
The category object
The category object should be subclassed (and is so by the Pages.pm, Query.pm categories) but can be used right off the bat if configured properly.
It can hold a pattern and do matching on the line. It stores statistics on matches in a tree structure, called "tree". It can present itself in HTML and does so in three ways:
- As description and checkbox for user interface
- As report fragment with name, bar chart
- As Table of contents fragment with link to report fragment
The HTML_output object
The HTML_output object contains utility scripts (er, methods) for printing out stuff: Form elements, bar charts.
The code is available at sourceforge, but I doubt that it is of any practical value for anyone else as yet. It does not use a data base and doesn't split lines into their items, so it doesn't fit your specs.
/jeorgen | [reply] [d/l] |
If were just throwing out concept ideas here, yes a database would be great to do that type of cross referencing...The cool thing about doing somthing like this you could easily make an interface like a search engine that is almost like a frontend to the database...somthing where you could do simple sql queries like "select hostnames,urls from database-name where hostname like .de sort by urls". Somthing like that could easily be put into a frontend cgi script that had a bunch of drop downs and stuff...The only complexity to this is deciding how you want to sort your tables. The acctual frontend is simple dbi stuff.
or at least in my humble opinion of course
| [reply] |
Thanks for all the suggestions folks!
After posting the original question and reading all the responses,
and thinking more about the situation, I've decided to push
harder for the SQL idea (we have an Oracle server, so power
to spare!) One thing holding me back here was the problem
with the size of the database; we might be limited to
keeping the size down under a gig or so, but people do occasionally
request stats for the deep distant past, which is in those flat files
(currently).
I want to make the interface transparent not only to the user,
but to *me* as the programmer, as much as possible (except,
of course, for speed!) but now that I have seen the DBD::RAM
module I have more hope of writing an abstraction layer. We'll store the older
data in DBM files (or stick with the current flat files) and
access those through the DBD::RAM module which gives us an
SQL frontend into that data.
Is it conventional to say "omm" at this point? =)
| [reply] |