Dealing with large logfiles

arturo has asked for the wisdom of the Perl Monks concerning the following question:

how do. This is part Perl-specific, part "general program theory" I am not by any stretch of the imagination a gene-e-yuss hacker, and I'm a little at sea as to where to start. Basically, I'm inheriting a system of scripts that parses an Apache logfile (rotated monthly, on a site that gets 6000+ hits / day; near the end of a month, we're looking at ~ 200MB) to generate various stats. Currently, the logfile gets split, analyzed, and various fragments of the data are stored in tab-delimited flat files. The aim is to have a representation of the server log data that will allow cross-referencing on various things (e.g. what was the most common request from UNIX clients in .de ?) The usual means of access is through a CGI script. Webalizer produces reports of the right sort, but the boss-man wants to keep this a custom job so we can maintain it in-house.

I take it that putting things in an SQL database is probably the sanest way to go here (faster at any rate, and it will certainly simplify things), but assuming that's not a goer (I'll make it one if I can; my supervisor seems mildly enthusiastic), does anybody have any advice concerning program design -- I think I like going OO for some parts (since the script generates various reports, I could have a master Report class and have various report types inherit from that one) but the reports are usually generated on the fly for easy web access. Would there be a significant performance hit going OO? (Disk space is not super-tight, but it's far from unlimited)
I'm having trouble looking up resources for this project; I guess in the main that's what I'm fishing for, since I don't have a CS degree and I'm not really up on various means of data representation =)

Assuming I can't get SQL, would I see a performance increase if (e.g.) I moved to DBM from the tab-delimited flat files? And if so, which DBM would you Monks recommend (from my glance at the capabilities, I suppose Berkeley DBM, but I have no experience with this? thanks for any help you can give!-

Comment on Dealing with large logfiles

Replies are listed 'Best First'.
Re: Dealing with large logfiles by btrott (Parson) on Jul 20, 2000 at 03:49 UTC
In general, your object-oriented code will be slower than non-OO code. But that's not the whole story, because the slowdown is not very significant, and is far outweighed by things like disk access, database access, etc. So on the whole, it would make things easier for you, probably, to "go OO", at least conceptually, then I'd say go for it. You also asked: `> Assuming I can't get SQL, would I see a performance > increase if (e.g.) I moved to DBM from the > tab-delimited flat files?` [download] Yes, so long as you use them correctly. By that I mean, you need to structure your data in such a way--namely, key-value pairs--that the benefits of using a DBM will be displayed. For doing lookups on data, getting that data out of a DBM, where you have some unique key into the DBM, is much faster than searching through each line in a text file, trying to find the data. But you have to get your data set up in such a way that this is the case. And yes, I would recommend using Berkeley DB. Get version 2, because I've heard about memory leaks in 1.	[reply] [d/l]
Re: Dealing with large logfiles by young perlhopper (Scribe) on Jul 20, 2000 at 05:29 UTC
In a project like this I really recommend going with a SQL database of some type. (MySQL and Postgres are both viable free alternatives see this column for a good overview of each's strengths and weakness's. Also, see this for a good intro to some of the things you need to be aware of like data modeling and so forth. If you want to do cross referencing, SQL offers a powerful methodology for that type of thing. SQL in general makes it very easy to do very complex tasks with very few words. I would recommend SQL not merely for performance issues but for the degree to which it will help you deal with some of the complexity it sounds like you want to work into this project. Good Luck, Mark	[reply]
Re: Dealing with large logfiles by jeorgen (Pilgrim) on Jul 20, 2000 at 15:33 UTC
Whatever solution you choose, keep in mind that your client may well come up with new ideas of what should be reported, once your report system is working. They will say: "Oh, you can do this now, then I want..." Here are some things that to my experience will be asked for by the client: Session statistics, i.e. select all hits from cookie/ip-number x who is less than 15 minutes apart. Different statistics for different departments, e.g. "We just relised that the ACME department should have their own stats, they have their pages in /docs/others/acme and in /cgi-bin/acmestuff/" If the statistics are to be presented to management, presentation is very important; the reports should look good, and contain exactly the data that they want (the data that will impress them). Try to make a design that will make these kind of things easy to put in once the client realises he/she wants it. Here is a description of a design of an OO log analysis application that I made. This design is in many ways a lot more primitive than what you want, but it may give some input: There are a number of objects in the application. The most important ones right now are the report object, the logfile object, the input object and the category object. There is also an output object. The report object The report object (Report.pm) is the central object in the application. It stores global settings and it has slots for all other objects. Objects "know" about each other because they: are stored in slots (attributes) in the report object each have slot for the report object Hence obj2 ("self") can call a method in obj1 by doing: `<BR> $self->report->slot_for_obj1->method; <BR>` [download] I.e. go through the report object to access other objects. It also contains methods for resolving settings from CGI input with defaults. Lastly it contains high-level methods for printing to file and browser. The logfile object The logfile object represents the log file. It has a method next_line for returning the next line of the log file. It also parses the date of the line. The input object The input object represents the user's input from the HTML form. It particularly likes parameter names in three parts with hyphens between each part, i.e. "a-query-apples". It stores these parameters in a tree structure so that each object can go in and look up its own parameters. The category object The category object should be subclassed (and is so by the Pages.pm, Query.pm categories) but can be used right off the bat if configured properly. It can hold a pattern and do matching on the line. It stores statistics on matches in a tree structure, called "tree". It can present itself in HTML and does so in three ways: As description and checkbox for user interface As report fragment with name, bar chart As Table of contents fragment with link to report fragment The HTML_output object The HTML_output object contains utility scripts (er, methods) for printing out stuff: Form elements, bar charts. The code is available at sourceforge, but I doubt that it is of any practical value for anyone else as yet. It does not use a data base and doesn't split lines into their items, so it doesn't fit your specs. /jeorgen	[reply] [d/l]
Re: Dealing with large logfiles by cleen (Pilgrim) on Jul 20, 2000 at 07:10 UTC
If were just throwing out concept ideas here, yes a database would be great to do that type of cross referencing...The cool thing about doing somthing like this you could easily make an interface like a search engine that is almost like a frontend to the database...somthing where you could do simple sql queries like "select hostnames,urls from database-name where hostname like .de sort by urls". Somthing like that could easily be put into a frontend cgi script that had a bunch of drop downs and stuff...The only complexity to this is deciding how you want to sort your tables. The acctual frontend is simple dbi stuff. or at least in my humble opinion of course	[reply]
Re: Dealing with large logfiles by arturo (Vicar) on Jul 20, 2000 at 18:24 UTC
Thanks for all the suggestions folks! After posting the original question and reading all the responses, and thinking more about the situation, I've decided to push harder for the SQL idea (we have an Oracle server, so power to spare!) One thing holding me back here was the problem with the size of the database; we might be limited to keeping the size down under a gig or so, but people do occasionally request stats for the deep distant past, which is in those flat files (currently). I want to make the interface transparent not only to the user, but to me as the programmer, as much as possible (except, of course, for speed!) but now that I have seen the DBD::RAM module I have more hope of writing an abstraction layer. We'll store the older data in DBM files (or stick with the current flat files) and access those through the DBD::RAM module which gives us an SQL frontend into that data. Is it conventional to say "omm" at this point? =)	[reply]