dws hit the fastest solution on the head. Given your id's are discrete rather than continuous, if the number and size of your ranges are within the bounds of sanity, building a hash lookup for every ID in the cumulative ranges it almost trivial to code and doesn't take long to build.

Of course, this makes the lookup very fast (O(n) I think).

Code

#! perl -slw use strict; use Inline::Files; use Data::Dumper; my %dates; while (<DATAFILE1>) { chomp; my $name = $_; <DATAFILE1>; while(my $line = <DATAFILE1>) { last if $line =~ /^\s*$/; my ($start, $end) = $line =~ /^\d+\s+(\d+)\s+(\d+)/; my $data = "$name\t$line"; push @{ $dates{$_} }, \$data for $start .. $end; } } #! print scalar keys %dates; <>; #! Dump the %dates hash. (Data::Dumper blows the stack on structure th +is large:() #! print "$_:\n @{[ map{ $$_ } @{$dates{$_}} ]}" for keys %dates; <DATAFILE2>; #! Skip the header line while( <DATAFILE2> ) { my ($ID, $text) = /(\d+),(.*$)/; print "$ID '$text' matches with"; print qq[ @{[ map{ "\t" . $$_ } @{$dates{$ID}} ]} ]; } __END__ __DATAFILE1__ ABC Corp. 1 200210014 200210105 some text description 2 200211011 200212053 some text description 3 200323021 200331234 some text description XYZ Ltd. 1 200210014 200210105 some text description 2 200211011 200212053 some text description CDC Inc. 1 200110014 200110325 some text description 2 200534011 200577234 some text description 3 200212344 200232399 some text description 4 199989987 199999991 some text description __DATAFILE2__ ID,Address,MoreData 200110100,some text here,etc 200918943,some text here,etc 200211015,some text here,etc 199212395,some text here,etc 200110100,some text here,etc 200210100,some text here,etc

Output

C:\test>226719-2 200110100 'some text here,etc' matches with CDC Inc. 1 200110014 200110325 some text description 200918943 'some text here,etc' matches with 200211015 'some text here,etc' matches with ABC Corp. 2 200211011 200212053 some text description XYZ Ltd. 2 200211011 200212053 some text description 199212395 'some text here,etc' matches with 200110100 'some text here,etc' matches with CDC Inc. 1 200110014 200110325 some text description 200210100 'some text here,etc' matches with ABC Corp. 1 200210014 200210105 some text description XYZ Ltd. 1 200210014 200210105 some text description C:\test>

The criteria as to whether this is a good solution for your application will depend on the range of the ID's your have. Like dws, I had been labouring under the assumption that these were dates covering--in your sample dataset--6 years from 1999 to 2005 for a range size of just over 2000 discrete values. However, on closer inspection I realised that your ids have an extra digit (what is that?), and the size of the range turns out to be close on 90,000.

Using your sample data, this is managable, but if your full dataset extends this in a big way, you might have to think about the interval tree Abigail-II talked about.

I did a little research on interval trees, they appear to be some kind of extension to binary trees by way of Red-Black trees. They don't look to hard to implement and I might have a go at it if I can find a better description/sample code than I have yet seen. Essentially, I have seen the intervals problem come up here a few times now and had never seen a good solution to it, and I am always more interested in the how of solving a problem, than actually solving it.

The great thing about Perl from my perspective is that for most problems and algorithms, Perl's immediacy makes prototyping and comparing different solutions so easy that I just can't resist 'having a go' :^)

The only other language that made coding so immediately rewarding that I have encountered goes way back to DEC Basic* Plus II.

(*Before too many people groan "Ug! BASIC!". Basic Plus was an exceptional version, and LW cites it as one of the influences that went into the original design of Perl)


Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.


In reply to Re: Code efficiency / algorithm by BrowserUk
in thread Code efficiency / algorithm by dave8775

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.