Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Question about properly laying out a database

by Stamp_Guy (Monk)
on Dec 12, 2001 at 02:31 UTC ( [id://131083]=perlquestion: print w/replies, xml ) Need Help??

Stamp_Guy has asked for the wisdom of the Perl Monks concerning the following question:

Hello Fellow Monks: I have a question about a database I'm trying to lay out. It will be using the Berkley database through the DB_File module. It will be accessed through the web via a form with four options - each of which narrow down the results. Since it will be for an auto dealership, the limiting fields will be: Make, Model, Year, and Price Range. I could simply have a unique key for each car and then iterate over the list each time to narrow it down, but I have a hunch there is a better way to do it. There shouldn't be more than about 500 entries in the database at any time, so it's pretty small. I would appreciate all the suggestions I can get. In the mean time, I'll be combing perl monks for more ideas. :-) Thanks!

Replies are listed 'Best First'.
Re: Question about properly laying out a database
by joealba (Hermit) on Dec 12, 2001 at 03:09 UTC
    I'd suggest that you have functions to perform each search, returning an array of matches. Then, just do a quick join of all your keys that match the search. The real question is "How do you make optimal search functions?"

    To get the fastest, most foolproof solution, you should normalize the database. Rather than having vehicle entries like this:
    my %cars = ( 1001 => { make => "Chevrolet", model => "Camaro", year => 1972, price => 6000, }, ...
    You should have this:
    my %cars = ( 1002 => { make_id => 10, model_id => 27, year => 2001, price => 24000, }, ); my %makes = { 1 = "Chevrolet", 10 = "Acura", }; my %models = { 12 = "Camaro", 27 = "RSX", };
    Then, your "make" and "model" search functions are doing fast, numerical comparisons, and not string matches. And you're guaranteed to have consistent spelling of Chevrolet! You never have to match /chev(y|rolet)/i!

    You'll probably want other fields, like "description", "series", "miles", etc.

    You'll also want things like "show_price" and "show_miles". If that 1972 Camaro has 217,000 miles, no one is gonna look at it at $6,000, even if it's puff! So, just don't show the price -- and get the customer in there to see how well it's been restored.

    Other thoughts: Might Storable be a better way to go than DB_File, monks?

    There's no way for you to go with MySQL, huh? :)

    I've got just a *little* experience with exactly this stuff <dripping sarcasm> - 1 2. /msg me if you need more help.
      No, I don't think Storable would be a better option for this. With Storable, you would have to load the entire database into memory just to search it. DB_File creates fast, indexed access to records without actually loading them into RAM.

      However, your suggestion of getting an array of matches for each criterion and then finding the overlapping set is a good one. You can make one DB_File database for each criterion (make.db, model.db, etc.) and then the content for each record could be a list of car IDs (like unique object IDs) that you use to look up the car data in a separate content database (also a DB_File, with data serialized using Storable in each record).

      However, it would be much easier to just use MySQL.

        But if he's going to search all the records to match on some set of criteria, won't he have to load the whole thing in memory anyway?

        In this application, it's not very often that you call up one record by its key id. Searches are more common. So, you'll have to evaluate just how much data each record will hold, how much RAM will be used up, and how quickly you can load all that data into memory.

      Given joelba's excellent suggestion, it may also help to create some indexes to speed up subsequent searches. Recall, in "Re: Question about properly laying out a database," above:

      my %cars = ( 1002 => { make_id => 10, model_id => 27, year => 2001, price => 24000, }, ); my %makes = { 1 = "Chevrolet", 10 = "Acura", }; my %models = { 12 = "Camaro", 27 = "RSX", };

      For example, indexing Makes might look like:

      my %makes_idx = ( 1 => [1001, ... ], # cars that are Chevys 10 => [1002, ...], # cars that are Acuras ... );

      Or, indexing Years might look like:

      my %years_idx = ( 2001 => [ 1001, 1002, ... ], # cars that are 2001 models ... );

      You get the idea.

      Then searching by any single attribute that is indexed becomes nearly instantaneous. Doing a "JOIN" on two (or more) indexed attributes looks something like this:

      # For example, finding all 2001 or newer Chevys my %temp_hash = reverse %makes; # gives the id for each make my $make_id = $temp_hash{'Chevrolet'); # get the id for Chevy my %make_set = map { $_=>1 } @{$makes_idx{$make_id}}; # set of cars t +hat are Chevys # search above for 2001 or newer models my @temp_years = grep { $_ >= 2001 } keys %years_idx; my @result_ids = grep { exists $make_set{$_} } @years_idx{@temp_years} +;
      Update: Changed search example to look for a range rather than a specific year.

      dmm

      You can give a man a fish and feed him for a day ...
      Or, you can
      teach him to fish and feed him for a lifetime
        Hmm.. My gut reaction at first was "Well, it's a relatively small amount of data. And, you'll wind up recalculating the indexes for each search anyway. Great idea, but is it worth the effort here?"

        Then, I got a message from Stamp_Guy saying that he cannot use DB_File because complex data structures aren't supported. (C'mon.. how complex is a HoH?) So, now I am thoroughly convinced that Storable is a Good Idea, as long as the number of records doesn't get out of hand. And if you use Storable, the indexes can be stored objects too -- so the overhead of generating the indexes is gone from the search program. Sweet!

        So, here's the quick and dirty:
        %cars - Storable HoH as illustrated above, with any additional vehicle info added too

        %makes - Storable hash of make_id => make. This should probably be a complete list of makes, which you can find on cars.com.

        %models - Storable HoH of
        model_id => { model => 'modelname', make_id => make_id (key from %makes) }
        . You can build this make_id hash as your inventory grows. Be sure to leave off the model details/series. Ex. model => 330 should be in this hash to cover all BMW 330's. You don't necessarily need to have one entry for 330xi, 330ci, etc. Lexus is another pain in the rear when it comes to this model naming scheme.

        • Then, your indexes: %years_idx, %make_idx, %model_idx. Obviously you will need to recreate your indexes each time you make a change to the dealer's inventory. A minor race condition exists here if someone searches while you are updating the inventory, but I'll leave that for you to solve. :)
Re: Question about properly laying out a database
by chip (Curate) on Dec 12, 2001 at 02:42 UTC
    I suggest you use a real database like PostgresQL or at least MySQL. They're not that hard to set up, DBI is wonderful, and you can query on any field with great simplicity.

    Failing that, you can keep your database in a flat file and use DB_File to create multiple indices into that flat file.

        -- Chip Salzenberg, Free-Floating Agent of Chaos

      I suggest you use a real database like PostgresQL or at least MySQL. They're not that hard to set up, DBI is wonderful, and you can query on any field with great simplicity.

      This would be my suggestion, too. It also gives you the benefit of scalability - if your database is so designed.

      To help you on your way, jeffa has written an article (Migrating a 1NF table to multiple 2NF tables) that gives very good pointers on db design. There are also good links at the end of that article and it's replies so you might want to check them out.

      --
      sligi

        My initial reaction as well was to suggest a DB like MySQL, but on further consideration... I've worked on a website with over 8000 entries where each entry had about 15 pieces of metadata, all of which was stored in DBMs. It's worked reliably for over five years.

        Question to the questioner: is this the first brush at this sort of system that your employer has attempted? If so, don't try to get it perfect, just get something out there that works, even if you end up rewriting the whole thing again in 6 months. It's unavoidable that the requirements will change, and it's futile to spend time now implementing a heavyweight architecture where the problem domain isn't clear.

Re: Question about properly laying out a database
by dws (Chancellor) on Dec 12, 2001 at 02:54 UTC
    You don't mention it, but I assume that there's additional per-automobile information in each database record. I also assume that the "price range" is an enumeration. (I've been car shopping lately, and most sites seem set up that way.)

    Conceptually, you'll be reading each of the 500 records, applying the search criteria, and then collecting up the matching records for display. Easy stuff, with many implementation options.

    At an implementation level, you probably want to avoid reading the full record until you need to (assuming there's lots more stuff in the full record). This suggests splitting your data into two tables. The first would hold search criteria and the unique key. The second would hold the unique key and the remaining data.

    You could put each table into a separate DBM file, or, if performance is a big concern, you could leave the "search criteria" file as a flat file. It would only be ~24K, assuming an average record size of 50 bytes and a 500 entry limit. It's going to be faster to suck a file of that size into memory and search it in-place than it is to read the same data record-by-record from a DBM.

Re: Question about properly laying out a database
by traveler (Parson) on Dec 12, 2001 at 04:18 UTC
    joealba made the valuable suggestion that you "normalize" your data. If you are not a DB guru this may not mean much. Here is a paper describing the concepts in a fairly readable way. The link takes you into the middle of an article, starting with a section "Why Normalize". You may want to go back to the beginning if you are not familiar with the definitions. There are lots of good articles on the web, but this may help get you started on why and how to normalize. There is also a good collection of info on About. I found the normalization series exceptionally clear.

    Regardless of the tool you use, normalization does help. In general, database practice is pretty well evolved. Some techniques may seem like overkill (and some may be), but they may make your overall design and programming task easier.

    HTH, --traveler

Re: Question about properly laying out a database
by ask (Pilgrim) on Dec 12, 2001 at 13:43 UTC
    You guys are just too much work! :-)

    While I also would use MySQL or Postgresql for something like this it really doesn't matter. How to organize the data doesn't matter either.

    Stamp_Guy, I would recommend just focusing on organizing the data so it's most convenient for you to implement the search features you need. If the site gets less than a few hits per second, then with only 500 entries in the database it'll be super fast no matter how you do it. How big is the whole database? If you run mod_perl you could actually use Storable and just load the whole thing into memory every time the file gets updated (although with DB_File and BerkeleyDB you can get either shared memory BerkeleyDB foo and/or the filesystem cache).

    But I digress; the point is that all that doesn't matter. Save the fancy design for when you need it. Focus on getting the thing done and implementing the needed functionality.

     - ask

    -- 
    ask bjoern hansen, http://ask.netcetera.dk/   !try; do();
    
      The design doesn't have to be fancy.. but it does have to be correct. Organizing the data in a normalized way will make it faster, easier to implement, and MUCH easier to extend later.

      Always think ahead... *especially* when you are dealing with people in sales. :) Out of 10 projects developed for sales purposes, all 10 of them will want extra functionality after the project is completed. I absolutely guarantee it (or your money back). hehehe

      Update: Thought I'd clarify on the "easier to implement" bit. Okay, you'll have to do an extra lookup to match a few keys to their values -- like matching the make_id to the make text string. But, when it comes time to write search code, it's faster and easier to develop a search that does a numerical match, rather than some kind of regexp match on the string!
Re: Question about properly laying out a database
by bmccoy (Beadle) on Dec 12, 2001 at 08:07 UTC
    Why not take the extra leap and use an RDBMS like PostgreSQL or MySQL? DB_File may be fine for something small, but projects (especially web-based ones) always grow beyond their specifications and end up requiring more maintenance and robustness. :-)

    -- Brett

    Go not to the Elves for counsel, for they will say both no and yes

Re: Question about properly laying out a database
by Anonymous Monk on Dec 13, 2001 at 01:10 UTC
    BerkeleyDB, you've got exactly one index per file. You could fake extra indexes, but you'll have come up with "one to many" relationships - you may find "chevy" in the "brands" index, but it can't refer to any single other record - it has to refer to multiple other records - which is ok with a RDBMS, but a pain with .db files. If you make a linear search through the .db file every time, performance should be ok for no more then a hit a second, but will be slower then doing a linear search through a flat file. With a hashed file, you access the records randomly when you do a "linear" search, but with a flat file, a linear search really is linear. I wouldn't have any reservations at all about doing... EACHREC: foreach my $rec (keys %tiedhash) { foreach my $criteria (split / /, @searchkeys) { next EACHREC unless $rec =~ m/$criteria/; } print qq{
  • $rec $tiedhash{$rec}\n}; } or something thereabouts...
Re: Question about properly laying out a database
by Stamp_Guy (Monk) on Dec 13, 2001 at 07:22 UTC
    I've kinda got a mix of you guy's suggestions. This is totally a test script here. I'll be using HTML::Template for all the html and will be cleaning up a few things, but to give you a general idea, here's what I made. It seems to work perfectly for my purposes. I tested it with as many as 3000 entries, even though it isn't likely that it will EVER have more than 500 (300 is what we expect). Comments are welcome!

    #!c:/perl/bin/perl.exe -w use strict; use CGI::Carp qw(fatalsToBrowser); ##################################################################### # CONFIGURATION ##################################################################### my %OPTIONS = ( databaseDir => "c:/", ); my %makes = ( 1 => 'all', 2 => 'Chevrolet', 3 => 'Ford', 4 => 'Pontiac', 5 => 'Honda', ); my %models = ( 1 => 'all', 2 => 'Car', 3 => 'Truck', 4 => 'Van', 5 => 'Mini Van', ); my %years = ( 1 => 'all', 2 => '1980-1985', 3 => '1986-1990', 4 => '1991-1995', 5 => '1996-2000', 6 => '2000-Present', ); my %sortBy = ( # The numbers here correspond to the place in the output hash string. Make => '0', Model => '1', Price => '2', Year => '3', ); ##################################################################### # END OF CONFIGURATION ##################################################################### use DB_File; use CGI; my $query = new CGI; my $action = $query->param('action'); my $id = $query->param('id'); my $sortBy = $query->param('sortBy'); my $make = $query->param('make'); my $model = $query->param('model'); my $year = $query->param('year'); my $price = $query->param('price'); my %OUTPUT; print $query->header; my %ACTIONS = ( search => \&search, display => \&display, ); if ($action) { # If an action is given, active the appropriate subroutine $ACTIONS{$action}->(); } else { # If no action is given, display the update page $ACTIONS{'search'}->(); } ##################################################################### # SUBROUTINES ##################################################################### sub check { if($_[0] == 1) { return(1); } elsif($_[0] == $_[1]) { return(1); } else { return(0); } } sub checkPrice { if($_[0] <= $price) { return(1); } else { return(0); } } sub search { print "<html><head><title>Output Page</title><style type=\"text/cs +s\"><!-- .header { font: bold 10pt verdana; color: white; background: + orange; } .caption { font: 8pt verdana; color:black; } --></style></ +head><body><TABLE width=600 cellpadding=0 cellspacing=0>"; my %OUTPUT=(); my $counter = 0; tie (my %INDEX, "DB_File", "$OPTIONS{databaseDir}/index.db", O_RDW +R|O_CREAT, 0755, $DB_HASH) || die "Cannot open database: $!\n"; tie (my %DATABASE, "DB_File", "$OPTIONS{databaseDir}/database.db", + O_RDWR|O_CREAT, 0755, $DB_HASH) || die "Cannot open database: $!\n"; foreach my $key (keys %INDEX) { my @key = split(/\|/, $INDEX{$key}); if(&check("$make","$key[0]")) { if(&check("$model","$key[1]")) { if(&check("$year","$key[2]")) { if(&checkPrice("$key[3]")) { # Form of output string: Make | Model | Price +| Details (from database. in this case, year|description) $OUTPUT{$key} = "$makes{$key[0]}|$models{$key[ +1]}|$key[3]|$DATABASE{$key}"; $counter++; } } } } } if($counter == 0) { print "<td>No cars matched your search. Please try again.</td +>"; } else { print "<tr><td colspan=4 align=right class=caption>Click on th +e header to sort by that category</td></tr>"; print "<tr> <td class=\"header\"><a href=\"carsearch.cgi?sortBy=Make\& +make=$make\&model=$model\&price=$price\&year=$year\">Make</a></td> <td class=\"header\"><a href=\"carsearch.cgi?sortBy=Model\ +&make=$make\&model=$model\&price=$price\&year=$year\">Model</a></td> <td class=\"header\"><a href=\"carsearch.cgi?sortBy=Year\& +make=$make\&model=$model\&price=$price\&year=$year\">Year</a></td> <td class=\"header\"><a href=\"carsearch.cgi?sortBy=Price\ +&make=$make\&model=$model\&price=$price\&year=$year\">Price</a></td> </tr>"; foreach my $key (sort { my @a = split(/\|/, $OUTPUT{$a}); my @ +b = split(/\|/, $OUTPUT{$b}); $a[$sortBy{$sortBy}] cmp $b[$sortBy{$so +rtBy}]; } (keys %OUTPUT)) { my @details = split(/\|/, $OUTPUT{$key}); print "<tr><td><a href=\"carsearch.cgi?action=display\&id= +$key\">$details[0]</a></td>"; print "<td>$details[1]</td>"; print "<td>$details[3]"; print " <td>\$$details[2]</td></tr>\n "; } } print "</TABLE></body></html>"; untie (%DATABASE); untie (%INDEX); } sub display { tie (my %INDEX, "DB_File", "$OPTIONS{databaseDir}/index.db", O_RDW +R|O_CREAT, 0755, $DB_HASH) || die "Cannot open database: $!\n"; tie (my %DATABASE, "DB_File", "$OPTIONS{databaseDir}/database.db", + O_RDWR|O_CREAT, 0755, $DB_HASH) || die "Cannot open database: $!\n"; my @key = split(/\|/, $INDEX{$id}); my @details = split(/\|/, $DATABASE{$id}); print "Make: $makes{$key[0]}<br>"; print "Model: $models{$key[1]}<br>"; print "Year: $details[0]<br>"; print "Price: \$$key[3]<p>"; print "Description: $details[1]"; untie (%DATABASE); untie (%INDEX); }
Re: Question about properly laying out a database
by bct (Initiate) on Dec 13, 2001 at 01:57 UTC
    More of the same: It seems to me that fussing with
    BerkeleyDB and DB_File is the long way.

    To get everything that others have offered (indexing,
    normalization) and have a fast, scalable system,
    think about using MySQL and DBI.

    bct

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://131083]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2024-03-29 11:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found