dave8775 has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone,

I have 2 data files that I need to do some matching with and then print a results file with the matches found and some metadata info (like counters, etc).

The first data file contains company names with date ranges and the other data file contains multiple date values (one per line). I need to match each of these values to the range in the other file and then create the results file.

__datafile1__ ABC Corp. 1 200210014 200210105 some text description 2 200211011 200212053 some text description 3 200323021 200331234 some text description XYZ Ltd. 1 200210014 200210105 some text description 2 200211011 200212053 some text description CDC Inc. 1 200110014 200110325 some text description 2 200534011 200577234 some text description 3 200212344 200232399 some text description 4 199989987 199999991 some text description __datafile2__ ID,Address,MoreData 200110100,some text here,etc 200918943,some text here,etc 200211015,some text here,etc 199212395,some text here,etc 200110100,some text here,etc 200210100,some text here,etc ...

datafile2 is 80+MB!

I have now been able to successfully load datafile1 into a record structure (with some help from a monk ;)) and I can get the matches to work... what I am wondering is whether there is a better way to do this (both code wise and algorithm).

The logic I used was to load the data to a record: For example...

$VAR1 = { 'ABC Corp.' => { '20021' => { 'sid' => '200220014', 'eid' => '200221011' }, '20022' => { 'sid' => '200211011', 'eid' => '200212053' }, '20023' => { 'sid' => '200323021', 'eid' => '200331234' } } etc... };

... then for each of the date ids in datafile2 I have to search the record to find a match. Noticed that I used a hash key that has the date and the id concatenated so as to search the year first and then... look closer if need be.

Is there a better way? I can paste some code later if you need clarification on what I did.

Thanks!

David

Replies are listed 'Best First'.
Re: Code efficiency / algorithm
by Abigail-II (Bishop) on Jan 14, 2003 at 11:05 UTC
    This is a known problem in the literature. It's called a stabbing query into line segments. The entries in datafile1 can be considered as line segments, while the entries in datafile2 are the points you query with. So, if you are looking for an efficient algorithm, you'd need to build a segment tree (or an interval tree) from the data in datafile1, and perform queries with the data in datafile2.

    IIRC, you can build a segment tree in O (n log2 n) time, while queries take O (log2 n + k) time. So, if you have N entries in datafile1 and M entries in datafile2, your total running time would be O ((N + M) log2 N + K), where K is the number of matches. It might be that it's actually a factor of O (log N) less. This surely beats the O (N * M) approach of trying to match everything with everything.

    Unfortunally, I don't think there's a module available that implements segment trees.

    Abigail

      Hope you don't mind me asking this, but could you give an assesment of how the algorithm in Re: Code efficiency / algorithm rates big-O wise. I never really mastered that notation.


      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

        Well, for each entry in datafile2, you consider all the elements in datafile1. So your algorithm takes at least Omega (N * M) time.

        Abigail

      IIRC, you can build a segment tree in O (n log2 n) time, while queries take O (log2 n + k) time.

      I wonder if there isn't a way to get O(1) query time by exploiting the nature of the keys and building a different data structure.

      The keys here are dates. There aren't that many distinct dates in a year. By converting all dates to an offset from some given starting date, you can build an array indexible by date offset. Each element of this array, if present, holds an array of references to records that are "effective" on that date.

        Great idea, dws. One modification, though. Since the date information is sparse, a hash of the distinct dates (or date offsets) would be better than an array.
      :( Does this mean my problem cannot be fixed with code? I need to actually implement a different algorithm? I will go check up on that error and see what I can do to mitigate it. If I can't can you offer any tips on how to convert to the tree algorithm? Is it an easy conversion/coding or am I looking at something pretty sophisticated here? Thanks for the help!!! David =8-)
        Of course your problem can be fixed with code. It's not that you are asking for something that's not solvable on a Turing machine. As for wether you need to implement a different algorithm, that's up to you. I described an efficient algorithm. If you find an implementation of a naive algorithm (matching everything from one file with every thing from the other) fast enough, then stick with it. But I got the impression you had some very large datasets. And very large times very large is big.

        As for the algorithm being sophisticated, that depends on what you find sophisticated. I don't find the algorithms itself sophisticated, but then, I did research in this field for a few years. And I wouldn't do the implementation in an evening either. But I'm a slow coder.

        Abigail

        It looks like Set::Infinite and Array::IntSpan could be helpful starting points.

        If all the companies had mutually-exclusive ranges, then you could simply place them in an Array::IntSpan and look them up. But since your ranges intersect, you will need to intersect them with each other (using the operations of Set::Infinite like intersects and contains). That's basically what a tree algorithm would do.

Re: Code efficiency / algorithm
by dws (Chancellor) on Jan 14, 2003 at 06:01 UTC
    I have 2 data files that I need to do some matching with and then print a results file with the matches found and some metadata info (like counters, etc). ... Is there a better way?

    Have you considered using a relational database? This is the kind of data manipulation that they're good for. With data moved into tables, and an appropriate index, you could do something like

    SELECT Datafile1.Company, DataFile2.ID, DataFile1.Description, DataFile2.Address, DataFile2.MoreData FROM Datafile1, DataFile2 WHERE DataFile2.ID >= DataFile1.SID AND DataFile.ID <= DataFile2.EID
    to pull out the data you want.

      mmm... you definatelly make a good point there. I would like to experiment with this proposed solution :) I guess my question is, how easy will it be for me to get the input data files loaded into a db? (which db should I use... simple .db files or mySQL?)

      Cheers,

      David

        I would recommend looking into SQLite (http://www.sqlite.org/), or more specifically, DBD::SQLite by Matt Sargent (available on CPAN). This is a self-contained relational database in a single db file (not to be confused with DB_File - they are in no way related).

        A link to it was posted here by merlyn a few days ago, and I'm now seriously considering using it in my next project. It is much more manageable and, with the correct indexes, probably faster than rolling your own databases with someting like DB_File.

        At the very least, the module may be useful for prototyping the concept of storing the info in a relational database, before moving onto a fully fledged RDBMS if you discover that it is necessary.

        Cheers,

        -- Dave :-)


        $q=[split+qr,,,q,~swmi,.$,],+s.$.Em~w^,,.,s,.,$&&$$q[pos],eg,print
        Go for mySQL. Once you have your data files normalized into flat tables, loading them into mySQL will be really simple.

        But take your time in working out the table design, in order to make it fully normalized -- e.g. although you have just two data files, it looks like you may want at least three tables: one for companies, one for the "datafile1" records that relate to those companies, and the third for the "datafile2" records. The nature of the "other text content" in your data files may suggest other relations that could benefit from being indexed/related via one or more additional tables, apart from these three.

        Personally, unless datafile2 is static, or you can arrange to have whatever is writing that file write directly to the database, I think that using RDBMS is likely to involve more trouble than its worth. The code involved to normalise and insert 80MB+ into the RDBMS is likely to take longer to write and longer to run than processing the data with perl. If the datafile is static that maybe mitigatable, but if you have to reprocess or keep munging large volumes of flat file in order to put it into the database, then the overhead doesn't seem worth it to me.

        Further, performing a join on tables where one of them is that size and every record has to be joined with one or more records from the smaller table is going to need quite substantial box and is not going to be quick.

        I would be of a different opinion if you can arrange to write the bigfile data directly to the RDBMS and you need to perform one or more additional processings of the big file data in some relational manner on a regular basis, then the process of conversion would start to make sense.

        As the popular disclaimer goes, just my 2 cents.


        Examine what is said, not who speaks.

        The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: Code efficiency / algorithm
by graff (Chancellor) on Jan 14, 2003 at 05:57 UTC
    The way you describe it, it looks like each record in datafile2 could (and often will) match more than one record in datafile1. Do you intend for your results file to show all the "file1" matches for a given "file2" record?

    Apart from that, it looks like the nine-digit strings that are labeled "sid" and "eid" in the file1 (%VAR1) data structure are supposed to be the cue for deciding whether a given record in "file2" is a match, based on its first field -- that is, the first line in your "datafile2" example, which starts with "200110100", ought to be a match for the first date range in all three company records from "datafile1". Have I got that right? (the post is a bit confusing, because the Data::Dumper-like output content doesn't match the sample file excerpt)

    If so, then I think my first inclination would be to make the "join" data the outer-most layer of the "file1" data structure, and make it as easy as possible to identify the matches -- something like this (based on the data in your example "file1" excerpt):

    $VAR1 = { '200210014 200210105' => [ "ABC Corp. / 1 / some text description", "XYZ Ltd. / 1 / some text description", "CDC Inc. / 1 / some text description", ], '200211011 200212053' => [ "ABC Corp. / 2 / some text description", "XYZ Ltd. / 2 / some text description", ], '200323021 200331234' => [ "ABC Corp. / 3 / some text description", ], etc... }
    In other words, file1 fills a hash of arrays, where the hash keys are "start_id end_id" for each date range found in file1; each of these hash elements holds an array of one or more company records, where each record is potentially just a single structured string, holding whatever is relevant for your results file.

    With this sort of data structure from file1, you can now read file2 and use the first field of each line to jump directly to the relevant file1 data (untested code, naturally):

    while (<FILE2>) { my ($key2,$data) = split(/,/, $_, 2); # use grep to do the "join": my @match_keys = grep { my ($sid,$eid) = split(/ /,$_); $key2 >= $sid and $key2 <= $eid } keys %VAR +1; foreach my $matched_range ( @match_keys ) { my @matched_data = @{$VAR1{$matched_range}}; # do something with @matched_data } }

      I posted some clarifications regarding what I am trying to do in my answer to the previous reply. Basically I am trying to match each number (e.g. 200210201) in datafile2 to any of the the given ranges in datafile1. If the number in datafile2 is within a range in datafile1 than it is a match. I used the concatenation of the two numbers in datafile1 to establish a unique 'rangeid' to be used as the hash key. In other words, '2 200534011 200577234 some text description' has a hash id of 20052. I then read each line of datafile2 and check through the record set to see if I find a hash key that matches (has the same 4 first numbers--i.e.2005). If so then I look closer at the sid and eid.

      I am just trying to find out better, more efficient ways of doing this. :)

      Thanks!

      David

Re: Code efficiency / algorithm
by BrowserUk (Patriarch) on Jan 14, 2003 at 07:47 UTC

    Provided the size of your datafile1 isn't too large, you might be able to trade memory for speed and build your hash deeper, which should reduce the number of keys at each level and speed the processing by reducing the number fo iterations at the inner levels.

    The structure the code below builds looks like this.

    You could probably simplify %corps somewhat at the inner levels, depending on what information you need available once you have matched a record.

    As pointed out above, you don't say what you want in your output file, so I've just dumped everything that matches each input record.

    The output

    C:\test>226719 200110100,some text here,etc matches with: CDC Inc. 1 -some text description 200918943,some text here,etc matches with: 200211015,some text here,etc matches with: XYZ Ltd. 2 -some text description ABC Corp. 2 -some text description 199212395,some text here,etc matches with: 200110100,some text here,etc matches with: CDC Inc. 1 -some text description 200210100,some text here,etc matches with: XYZ Ltd. 1 -some text description ABC Corp. 1 -some text description C:\test>

    The code


    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: Code efficiency / algorithm
by BrowserUk (Patriarch) on Jan 15, 2003 at 05:09 UTC

    dws hit the fastest solution on the head. Given your id's are discrete rather than continuous, if the number and size of your ranges are within the bounds of sanity, building a hash lookup for every ID in the cumulative ranges it almost trivial to code and doesn't take long to build.

    Of course, this makes the lookup very fast (O(n) I think).

    Code

    Output

    The criteria as to whether this is a good solution for your application will depend on the range of the ID's your have. Like dws, I had been labouring under the assumption that these were dates covering--in your sample dataset--6 years from 1999 to 2005 for a range size of just over 2000 discrete values. However, on closer inspection I realised that your ids have an extra digit (what is that?), and the size of the range turns out to be close on 90,000.

    Using your sample data, this is managable, but if your full dataset extends this in a big way, you might have to think about the interval tree Abigail-II talked about.

    I did a little research on interval trees, they appear to be some kind of extension to binary trees by way of Red-Black trees. They don't look to hard to implement and I might have a go at it if I can find a better description/sample code than I have yet seen. Essentially, I have seen the intervals problem come up here a few times now and had never seen a good solution to it, and I am always more interested in the how of solving a problem, than actually solving it.

    The great thing about Perl from my perspective is that for most problems and algorithms, Perl's immediacy makes prototyping and comparing different solutions so easy that I just can't resist 'having a go' :^)

    The only other language that made coding so immediately rewarding that I have encountered goes way back to DEC Basic* Plus II.

    (*Before too many people groan "Ug! BASIC!". Basic Plus was an exceptional version, and LW cites it as one of the influences that went into the original design of Perl)


    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

      Given your id's are discrete rather than continuous, if the number and size of your ranges are within the bounds of sanity, building a hash lookup for every ID in the cumulative ranges it almost trivial to code and doesn't take long to build.

      That'll give you O(log2 n) queries, but there's a simple way to get to O(1). I had it mostly coded up when you posted. The code is nearly identical to yours (sick minds think alike?), so I'll provide the delta.

      When reading the first datafile and preparing the data structure, instead of using a hash, use an array.

      for ( date2offset($start) .. date2offset($end) ) { push @{$effective[$_]}, \$data; }
      The probe them becomes
      my $results = $effective[date2offset($date)]; # O(1) if ( defined($results) { # @$results is an array of all matching $data }
      I'll leave date2offset() as an exercise, other than noting that there's a non-obvious trick you can pull with it. Instead of a 1-to-1 mapping, you can do something not-quite right, like
      sub date2offset { my($yyyymmdd) = @_; my $year = int($yyyymmdd/ 10000) - 1970; my $mmdd = $yyyymmdd % 10000; my $month = int($mmmdd / 100) - 1; my $day = $mmdd % 100 - 1; return $year * (12 * 31) + $month * 31 + $day; }
      This is a classic time-for-space tradeoff, simplifying the calculation by assuming all months are 31 days long, resulting in 6-7 bogus offsets per year. Note that it has no effect on probe time.

        Re: O(log2) and O(1). 8^?.

        That's a confused smiley. I resolve to go away and find a good reference on Big-O notation. I seem to recall not having a problem with it when I learnt it 25 or so years ago, but I definately need a refresher course now.

        Did I miss something? How do you derived a 8-digit date from an 9-digit number?

        I like the bogus days trick though.

        Re: (sick minds think alike?)

        How dare you imply that I think?

        Update: Now why do you suppose someone felt the need to downvote this node?

        • Because I didn't understand big-O?
        • Because I admitted I didn't understand big-O?
        • Because I resolved to re-learn big-O?
        • Because I questioned dws?

          (I know it wasn't dws, cos we had further discussion on the subject.)

        • Because I like the "bogus days" trick?
        • Because I made a joke at my own expense?

        You can bet your bottom dollar whomever downvoted it won't have the hutspar (sp?) to explain his or her decision.


        Examine what is said, not who speaks.

        The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: Code efficiency / algorithm
by waswas-fng (Curate) on Jan 14, 2003 at 05:36 UTC
    What does your output need to be? what is considered a match?

    -Waswas

      A match is when the number (date) in datafile2 is within the range values in datafile1. So, for example, 200110100 in datafile2 is a match because it is within CDC Inc. 200110014 - 200110325 range. So the output file would include the '200110100,some text here,etc' line.

      I am still debating how the output should look like. One option is to break the output into 2 files--(1) a matched results file that includes the header of datafile2 and all the matched lines and (2) a reports summary file that displays some metadata... such as, how many total records processed, how many such records belong to each company and how many in each range. The other alternative is to display something like this:

      #ABC Corp. 200210100,some text here,etc #1 record found for the 20021 range id. 200211015,some text here,etc ... #x records found for the 20022 range id: #x total records found for ABC Corp. #CDC Inc. ... #57 total records found

      Let me know what you think.

      Thanks!

      David

Re: Code efficiency / algorithm
by dave8775 (Novice) on Jan 14, 2003 at 09:40 UTC

    I am getting this strange error:

    Use of uninitialized value at parse.pl line 48, <datafile2> chunk 1474 +2.

    Line 48 is if ($_ eq $year) { in the code below

    It seems to work fine and all of a sudden I keep getting this error displayed on the screen *shrug*.

    The code that it is complaining about is:

    while (<datafile2>) { chomp; my $line = $_; next if /^$/; next if /^\D.*$/; my ($date, $year) = /^((\d{4})\d+)/; foreach my $name (sort keys %corps) { for my $id (sort keys %{ $corps{$name} }) { $_ = $id; s/(\d{4}).*/$1/; if ($_ eq $year) { if (($date gt $corps{$name}{$id}{bid}) && ($date lt $c +orps{$name}{$id}{eid})) { print "Line $. => CorpName: $name Date=$date is +in RangeId: $id -> $corps{$name}{$id}{bid} - $corps{$name}{$id}{eid}" +; print OUT "$.,$line"; } } } } }

    Did I reach some limit? datafile2 is 80+MB... 367,000 or so lines...

    Cheers,

    David