Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

an efficient program?

by Angharad (Pilgrim)
on May 25, 2006 at 20:52 UTC ( [id://551702]=perlquestion: print w/replies, xml ) Need Help??

Angharad has asked for the wisdom of the Perl Monks concerning the following question:

Hi there
I have a database (mysql) with around 80000 entries. All entries have codes like this: wey134. These codes represent certain 'groupings' and a significant number of entries can belong to the same grouping and therefore have the same code to refer to that.
Each entry has its own unique ID. For every entry, I want to do some analysis - comparing the first entry for example with every other entry in the database. But I only want to do the analysis on that entry with all other entries that are NOT in the same grouping.
As it's a large database, I want to make sure I write an efficient program. I want to go though each entry in the database and then compare that entry with all other entries and pull out those that are not in the same grouping as the first entry of interest.
So - I thought of using fetch_arrayref for that and go though each entry in the array one at a time using a for loop and then create another for loop within that one which would go though all other entries and then compare with 'group codes' and pull out those that are different to the initial entry.
I'm concerned though in case this would make for a particularly slow program and I was wondering if anyone had suggestions for writing a program that would make the process faster.
Thanks a lot.

Replies are listed 'Best First'.
Re: an efficient program?
by Errto (Vicar) on May 25, 2006 at 21:19 UTC
    Normally I would suggest doing this sort of processing in SQL and then doing what needs to be done on the client side. If you really need to compare every entry with every single other entry that does not have the same code value, you can use (God help me) a Cartesian product:
    SELECT a.entry_id, a.some_other_col, b.some_other_col FROM my_table a, my_table b WHERE a.entry_code != b.entry_code ORDER BY a.entry_id
    This way you only need to do one query against the DB. If you have enough memory on the machine, you can also select the data in fairly large chunks using fetchall_arrayref with an appropriate max_rows parameter. But bear in mind what you're trying to do will be inherently fairly resource-consuming. For example, let's say you have 50 groupings and 1600 rows per grouping. That means you'll have a total of (80000 x (80000 - 1600)) = 6,272,000,000 rows of data to deal with, which is a lot.
Re: an efficient program?
by ptum (Priest) on May 25, 2006 at 21:21 UTC

    Depending on your database, it might be more efficient for you to push as much of the grouping and comparisons to the database engine. I can't tell what could be done because your question is pretty vague with regard to the nature of the 'analysis' that you are performing.

    If the number of groupings is fairly small, you might consider building result set hashes or arrays for each grouping, so that (for a particular entry with a particular grouping code) you could only visit the result sets that do NOT correspond to the entry's grouping code (as you specified).

    Instinctively, I flinch when I see an Order n-squared algorithm -- it seems like there might be a better way to skin this cat, but again, I'm not able to give you advice on this at this level of abstraction, except to say that if you find yourself executing the same database query over and over again, you can sometimes save quite a bit of time building a lookup hash once with a slightly more complex query, and then referring to that hash instead of going to the database over and over.

    Of course, you'll want to build statement handles and placeholders for any repetitive query.


    No good deed goes unpunished. -- (attributed to) Oscar Wilde
Re: an efficient program?
by MonkE (Hermit) on May 25, 2006 at 21:34 UTC
    Despite the large amount of information you provided, it is still not clear how your data is laid out or what type of analysis you intend to do. For some tips on how to present your problem see [How (Not) To Ask A Question].

    Depending on the type of analysis you need, you should strongly consider using the capabilities of the underlying mysql database instead, and minimze the use of client-side code for database processing -- Do as much processing as possible on the database side (for the sake of speed and bandwidth). When you have your results, then you can create a pretty report with perl.
Re: an efficient program?
by SamCG (Hermit) on May 25, 2006 at 21:16 UTC
    I don't think I have a clear idea of your database structure. Is this all in one table? If so, can you give a slightly better idea what this table looks like?

    Usually I try to put as much of the logic in my query as possible. Can you conceive of a way to do most of this work in your SQL?



    -----------------
    s''limp';@p=split '!','n!h!p!';s,m,s,;$s=y;$c=slice @p1;so brutally;d;$n=reverse;$c=$s**$#p;print(''.$c^chop($n))while($c/=$#p)>=1;
Re: an efficient program?
by dsheroh (Monsignor) on May 25, 2006 at 21:46 UTC
    You briefly mentioned that every entry has a unique ID, then went straight off into wanting to compare each entry with all entries that aren't in the same group without explicitly stating why the unique ID is relevant. If you meant to indicate that you're looking for cases in which the (supposedly) unique ID is duplicated in another group, then this becomes fairly simple: Either in an SQL query or a Perl hash, look for IDs which appear more than once, then check each such ID to see whether it appears in multiple groupings or just multiple times in a single grouping.

    But, then, I'm making assumptions here about what I think you may have implied based on which statements are in proximity to each other. You may very well be trying to do something completely different...

Re: an efficient program?
by badaiaqrandista (Pilgrim) on May 26, 2006 at 00:30 UTC

    I think you don't want to load the data all at once into memory and then do the analysis. I think you should analyse the data one by one, create temporary results, and then work on that result. From my experience, this is a better way of working with large data set.

    -cheepy-
Re: an efficient program?
by Angharad (Pilgrim) on May 25, 2006 at 23:05 UTC
    Hi again
    Apologies for not making my problem clear. I shall try again.
    The database table looks something like this
    id someotherdata groupcode 1 blah wek134 2 foo htw456 3 baa twe546 4 soc wek134
    In a nutshell I want to create a list that for id 1 lists all other entries without the groupcode wek134, and then for id 2 lists all other entries without the groupcode htw456 and so on
    The actual analysis comes later. This is just to get the data in the correct format.
    Thanks a lot
      Ouch. Just... ouch.

      If that's really what you want/need to do, then Errto's pretty well nailed it - you're describing a slightly-filtered Cartesian join. You'll get the least-bad performance by doing it in the database with SQL similar to what he suggested, but I'd advise you to first take a hard look at whether there is any other way to accomplish your actual goal, because, if you have a lot of groupings and you combine each record with all records in every other grouping, then you're looking at having to process billions of combinations.

      Angharad:

      A cartesian-product like this may generate a huge (read that unreadable and unusable) report if you have many rows. If you don't have many rows, then it's not a tough problem anyway ;^).

      Perhaps an alternative report structure might be nicer. Instead of the id-based cartesian product, you might do a cartesian product of the group codes and display them as a table. Then for each cell of the table, you can have a separate cartesian product as another page in your report. Then you could glue it together with some perl, something like (untested, incomplete):

      ##### # Header page ##### print "Group vs. Group summary\n" . "-----------------------\n\n" . "Grp Grp Stat1 Stat2 ...\n" . "---- ---- ------ ------ ------\n"; # Do SQL query here to generate cross-tab for Group pairs: my $S=$DB->prepare( "SELECT a.groupcode, b.groupcode, /* compute stats */ " ."FROM my_table a, my_table b " ."WHERE a.entry_code != b.entry_code " ."GROUP BY a.groupcode, b.groupcode " ."ORDER BY a.groupcode, b.groupcode" ) or die; $S->execute or die; for (my $ar=$S->fetchrow_arrayref) { # print junk # Store the pair of group codes. (I'm assuming that the # stats are symmetric, so only push group code pairs once # so we don't get two pages of details for each pair.) # (?Syntax OK? I've never tried pushing an arrayref onto # an array before...) push @grp_pairs, \($ar[0], $ar[1]) if $ar[0] < $ar[1]; } ##### # Detail view pages ##### $S = $DB->prepare( "SELECT a.entry_id, a.some_other_col, b.entry_id, " ." b.some_other_col " ."FROM my_table a, my_table b " ."WHERE a.entry_code=? and b.entry_code=? " ."ORDER BY a.entry_id, b.entry_id" ) or die; # (?Syntax again? probably wrong and needs fixing) while (my ($grp1, $grp2) = pop @grp_pairs) { print $FormFeed . "DETAILS for $grp1 vs. $grp2\n\n" . "A.ID A.ColB B.ID B.ColB Stat1 Stat2 ...\n" . "---- ------ ---- ------ ----- ----- ------\n"; # Generate group vs. group details $S->execute($grp1, $grp2) or die; for (my $ar=$S->fetchrow_arrayref) { # print junk... } }
      --roboticus
Re: an efficient program?
by Angharad (Pilgrim) on May 26, 2006 at 08:32 UTC
    thanks a lot. i really apprecoate everyones input. i think I need another solution to my problem :)

      I hate doing this but...

      If you can, and I realise that it may not be possible without having to explain the entire background of your problem, describe how you are going to use the results of the query, and what the purpose is of that use. You may find that someone here has a different perspective on solving that problem that you haven't thought of.

      There is something about what you are describing that reminds me of a problem I had to tackle a few years ago, but it is rather complex to describe without going into reams of background--which would be a waste of time if I'm misinterpreting your goals.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://551702]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2024-03-29 07:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found