Re: Code efficiency / algorithm

dws hit the fastest solution on the head. Given your id's are discrete rather than continuous, if the number and size of your ranges are within the bounds of sanity, building a hash lookup for every ID in the cumulative ranges it almost trivial to code and doesn't take long to build.

Of course, this makes the lookup very fast (O(n) I think).

Code

#! perl -slw
use strict;
use Inline::Files;
use Data::Dumper;

my %dates;
while (<DATAFILE1>) {
    chomp;
    my $name = $_;
    <DATAFILE1>;
    while(my $line = <DATAFILE1>) {
        last if $line =~ /^\s*$/;
        my ($start, $end)  = $line =~ /^\d+\s+(\d+)\s+(\d+)/;
        my $data = "$name\t$line";
        push @{ $dates{$_} }, \$data for $start .. $end;
    }
}

#! print scalar keys %dates; <>;
#! Dump the %dates hash. (Data::Dumper blows the stack on structure th
+is large:()
#! print "$_:\n @{[ map{ $$_ } @{$dates{$_}} ]}" for keys %dates;

<DATAFILE2>; #! Skip the header line
while( <DATAFILE2> ) {
    my ($ID, $text) = /(\d+),(.*$)/;
    print "$ID '$text' matches with";
    print qq[ @{[ map{ "\t" . $$_ } @{$dates{$ID}} ]} ];

}

__END__

__DATAFILE1__
ABC Corp.

1  200210014  200210105  some text description
2  200211011  200212053  some text description
3  200323021  200331234  some text description

XYZ Ltd.

1  200210014  200210105  some text description
2  200211011  200212053  some text description

CDC Inc.

1  200110014  200110325  some text description
2  200534011  200577234  some text description
3  200212344  200232399  some text description
4  199989987  199999991  some text description

__DATAFILE2__
ID,Address,MoreData
200110100,some text here,etc
200918943,some text here,etc
200211015,some text here,etc
199212395,some text here,etc
200110100,some text here,etc
200210100,some text here,etc
[download]

Output

C:\test>226719-2
200110100 'some text here,etc' matches with
        CDC Inc.        1  200110014  200110325  some text description

200918943 'some text here,etc' matches with

200211015 'some text here,etc' matches with
        ABC Corp.       2  200211011  200212053  some text description
        XYZ Ltd.        2  200211011  200212053  some text description

199212395 'some text here,etc' matches with

200110100 'some text here,etc' matches with
        CDC Inc.        1  200110014  200110325  some text description

200210100 'some text here,etc' matches with
        ABC Corp.       1  200210014  200210105  some text description
        XYZ Ltd.        1  200210014  200210105  some text description


C:\test>
[download]

The criteria as to whether this is a good solution for your application will depend on the range of the ID's your have. Like dws, I had been labouring under the assumption that these were dates covering--in your sample dataset--6 years from 1999 to 2005 for a range size of just over 2000 discrete values. However, on closer inspection I realised that your ids have an extra digit (what is that?), and the size of the range turns out to be close on 90,000.

Using your sample data, this is managable, but if your full dataset extends this in a big way, you might have to think about the interval tree Abigail-II talked about.

I did a little research on interval trees, they appear to be some kind of extension to binary trees by way of Red-Black trees. They don't look to hard to implement and I might have a go at it if I can find a better description/sample code than I have yet seen. Essentially, I have seen the intervals problem come up here a few times now and had never seen a good solution to it, and I am always more interested in the how of solving a problem, than actually solving it.

The great thing about Perl from my perspective is that for most problems and algorithms, Perl's immediacy makes prototyping and comparing different solutions so easy that I just can't resist 'having a go' :^)

The only other language that made coding so immediately rewarding that I have encountered goes way back to DEC Basic* Plus II.

(*Before too many people groan "Ug! BASIC!". Basic Plus was an exceptional version, and LW cites it as one of the influences that went into the original design of Perl)

Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Comment on Re: Code efficiency / algorithm Select or Download Code

Replies are listed 'Best First'.
Re: Re: Code efficiency / algorithm - O(1) by dws (Chancellor) on Jan 15, 2003 at 05:57 UTC
Given your id's are discrete rather than continuous, if the number and size of your ranges are within the bounds of sanity, building a hash lookup for every ID in the cumulative ranges it almost trivial to code and doesn't take long to build. That'll give you O(log² n) queries, but there's a simple way to get to O(1). I had it mostly coded up when you posted. The code is nearly identical to yours (sick minds think alike?), so I'll provide the delta. When reading the first datafile and preparing the data structure, instead of using a hash, use an array. `for ( date2offset($start) .. date2offset($end) ) { push @{$effective[$_]}, \$data; }` [download] The probe them becomes `my $results = $effective[date2offset($date)]; # O(1) if ( defined($results) { # @$results is an array of all matching $data }` [download] I'll leave `date2offset()` as an exercise, other than noting that there's a non-obvious trick you can pull with it. Instead of a 1-to-1 mapping, you can do something not-quite right, like `sub date2offset { my($yyyymmdd) = @_; my $year = int($yyyymmdd/ 10000) - 1970; my $mmdd = $yyyymmdd % 10000; my $month = int($mmmdd / 100) - 1; my $day = $mmdd % 100 - 1; return $year * (12 * 31) + $month * 31 + $day; }` [download] This is a classic time-for-space tradeoff, simplifying the calculation by assuming all months are 31 days long, resulting in 6-7 bogus offsets per year. Note that it has no effect on probe time.	[reply] [d/l] [select]
Re: Re: Re: Code efficiency / algorithm - O(1) by BrowserUk (Patriarch) on Jan 15, 2003 at 06:26 UTC
Re: O(log²) and O(1). 8^?. That's a confused smiley. I resolve to go away and find a good reference on Big-O notation. I seem to recall not having a problem with it when I learnt it 25 or so years ago, but I definately need a refresher course now. Did I miss something? How do you derived a 8-digit date from an 9-digit number? I like the bogus days trick though. Re: (sick minds think alike?) How dare you imply that I think? Update: Now why do you suppose someone felt the need to downvote this node? Because I didn't understand big-O? Because I admitted I didn't understand big-O? Because I resolved to re-learn big-O? Because I questioned dws? (I know it wasn't dws, cos we had further discussion on the subject.) Because I like the "bogus days" trick? Because I made a joke at my own expense? You can bet your bottom dollar whomever downvoted it won't have the hutspar (sp?) to explain his or her decision. Examine what is said, not who speaks. The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.	[reply]