Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I'm writing some Perl code to do some statistical analysis on two data files which contain records that may or may not be related. I have some code that creates a scoring system to rank matches bewtween records - all records from file a are matched with those records from file b that are possible matches. Each possible match is assigned a score. I have approx. 10 million possible combinations.

My data is stored like this:

$rMatch->{$File1RecordID}->{$File2RecordID} = $ranking;

My first step in matching the data is to find all those records in file 2 that are a top ranked match with only one record in file 1. I consider those "strong matches" and want to eliminate them from the pool of records before matching the remaining data.

The second step is to create combinations of probable matches all remaining records from file 1 and all records from file 2. The lowest overall "score" for that match will be the recordset alignment that I go with for the remaining records.

My problem? The process of identifying the "strong matches" is too CPU intensive, running at 100% usage for a long time. My CPU is actually overheating and my machine is shutting down. I can mitigate that with a sleep(1) statement, but that's not very elegant...it slows this massive task down...and I suspect the problem is the number of times that I sort my hash in the program to identify strong matches. In other words, inefficiency in my code ;) Is there a better way to do this? This sort is the only way I know to find the lowest value in my hash.

In the code below, id1 is my id from the first file, id2 is the id from the second file, and $rC holds the ranking for the combination of id1 and id2. Basically, I want to check all other combinations of records for the existence of id2 as a top rank. If I do not find any other combinations, I'm reasonably confident that those 2 records should be matched.

sub IsStrongMatch { # Return true if id2 is only top ranked match for id1 my $id1 = shift; my $id2 = shift; my $rC = shift; for my $i1 ( keys %{$rC} ) { next if $i1 == $id1; foreach my $i2 ( sort { $rC->{$i1}->{$a} <=> $rC->{$i1}->{$b} +} keys %{$rC->{$i1}} ) { if ( $id2 == $i2 ) { return 0; } last; } } return 1; }

2006-08-31 Retitled by planetscape, as per Monastery guidelines: one-word (or module-only) titles hinder site navigation

( keep:0 edit:14 reap:0 )

Original title: 'Efficiency'


In reply to How can I improve the efficiency of this very intensive code? by clearcache

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (9)
As of 2024-03-28 09:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found