Re: Code efficiency / algorithm
by Abigail-II (Bishop) on Jan 14, 2003 at 11:05 UTC
|
This is a known problem in the literature. It's called a
stabbing query into line segments. The entries in datafile1
can be considered as line segments, while the entries in
datafile2 are the points you query with. So, if you are looking
for an efficient algorithm, you'd need to build a segment tree
(or an interval tree) from the data in datafile1, and perform
queries with the data in datafile2.
IIRC, you can build a segment tree in O (n log2 n) time, while
queries take O (log2 n + k) time. So, if you have N entries
in datafile1 and M entries in datafile2, your total running
time would be O ((N + M) log2 N + K), where K is the number
of matches. It might be that it's actually a factor of O (log N) less.
This surely beats the O (N * M) approach of trying to match
everything with everything.
Unfortunally, I don't think there's a module available that
implements segment trees.
Abigail | [reply] |
|
|
| [reply] |
|
|
Well, for each entry in datafile2, you consider all the
elements in datafile1. So your algorithm takes at least
Omega (N * M) time.
Abigail
| [reply] |
|
|
IIRC, you can build a segment tree in O (n log2 n) time, while queries take O (log2 n + k) time.
I wonder if there isn't a way to get O(1) query time by exploiting the nature of the keys and building a different data structure.
The keys here are dates. There aren't that many distinct dates in a year. By converting all dates to an offset from some given starting date, you can build an array indexible by date offset. Each element of this array, if present, holds an array of references to records that are "effective" on that date.
| [reply] |
|
|
Great idea, dws. One modification, though. Since the date information is sparse, a hash of the distinct dates (or date offsets) would be better than an array.
| [reply] |
|
|
:( Does this mean my problem cannot be fixed with code? I need to actually implement a different algorithm? I will go check up on that error and see what I can do to mitigate it. If I can't can you offer any tips on how to convert to the tree algorithm? Is it an easy conversion/coding or am I looking at something pretty sophisticated here? Thanks for the help!!! David =8-)
| [reply] |
|
|
Of course your problem can be fixed with code. It's not that
you are asking for something that's not solvable on a Turing
machine. As for wether you need to implement a different
algorithm, that's up to you. I described an efficient algorithm.
If you find an implementation of a naive algorithm (matching
everything from one file with every thing from the other)
fast enough, then stick with it. But I got the impression
you had some very large datasets. And very large times very
large is big.
As for the algorithm being sophisticated, that depends on what
you find sophisticated. I don't find the algorithms itself
sophisticated, but then, I did research in this field for a
few years. And I wouldn't do the implementation in an evening
either. But I'm a slow coder.
Abigail
| [reply] |
|
|
It looks like
Set::Infinite
and
Array::IntSpan
could be helpful starting points.
If all the companies had mutually-exclusive ranges, then you could simply place them in an Array::IntSpan and look them up. But since your ranges intersect, you will need to intersect them with each other (using the operations of Set::Infinite like intersects and contains). That's basically what a tree algorithm would do.
| [reply] |
Re: Code efficiency / algorithm
by dws (Chancellor) on Jan 14, 2003 at 06:01 UTC
|
I have 2 data files that I need to do some matching with and then print a results file with the matches found and some metadata info (like counters, etc). ... Is there a better way?
Have you considered using a relational database? This is the kind of data manipulation that they're good for. With data moved into tables, and an appropriate index, you could do something like
SELECT
Datafile1.Company,
DataFile2.ID,
DataFile1.Description,
DataFile2.Address,
DataFile2.MoreData
FROM
Datafile1, DataFile2
WHERE
DataFile2.ID >= DataFile1.SID
AND DataFile.ID <= DataFile2.EID
to pull out the data you want.
| [reply] [d/l] |
|
|
| [reply] |
|
|
I would recommend looking into SQLite (http://www.sqlite.org/), or more specifically, DBD::SQLite by Matt Sargent (available on CPAN). This is a self-contained relational database in a single db file (not to be confused with DB_File - they are in no way related).
A link to it was posted here by merlyn a few days ago, and I'm now seriously considering using it in my next project. It is much more manageable and, with the correct indexes, probably faster than rolling your own databases with someting like DB_File.
At the very least, the module may be useful for prototyping the concept of storing the info in a relational database, before moving onto a fully fledged RDBMS if you discover that it is necessary.
Cheers, -- Dave :-)
$q=[split+qr,,,q,~swmi,.$,],+s.$.Em~w^,,.,s,.,$&&$$q[pos],eg,print
| [reply] [d/l] |
|
|
Go for mySQL. Once you have your data files normalized into
flat tables, loading them into mySQL will be really simple.
But take your time in working out the table design, in order
to make it fully normalized -- e.g. although you have just
two data files, it looks like you may want at least three
tables: one for companies, one for the "datafile1" records
that relate to those companies, and the third for the
"datafile2" records. The nature of the "other text content"
in your data files may suggest other relations that could
benefit from being indexed/related via one or more additional
tables, apart from these three.
| [reply] |
|
|
Personally, unless datafile2 is static, or you can arrange to have whatever is writing that file write directly to the database, I think that using RDBMS is likely to involve more trouble than its worth. The code involved to normalise and insert 80MB+ into the RDBMS is likely to take longer to write and longer to run than processing the data with perl. If the datafile is static that maybe mitigatable, but if you have to reprocess or keep munging large volumes of flat file in order to put it into the database, then the overhead doesn't seem worth it to me.
Further, performing a join on tables where one of them is that size and every record has to be joined with one or more records from the smaller table is going to need quite substantial box and is not going to be quick.
I would be of a different opinion if you can arrange to write the bigfile data directly to the RDBMS and you need to perform one or more additional processings of the big file data in some relational manner on a regular basis, then the process of conversion would start to make sense.
As the popular disclaimer goes, just my 2 cents.
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.
| [reply] |
|
|
Re: Code efficiency / algorithm
by graff (Chancellor) on Jan 14, 2003 at 05:57 UTC
|
The way you describe it, it looks like each record in
datafile2 could (and often will) match more than one record
in datafile1. Do you intend for your results file to show
all the "file1" matches for a given "file2" record?
Apart from that, it looks like the nine-digit strings that
are labeled "sid" and "eid" in the file1 (%VAR1) data structure are
supposed to be the cue for deciding whether a given
record in "file2" is a match, based on its first field -- that is, the first
line in your "datafile2" example, which starts with
"200110100", ought to be a match for the first date range in all
three company records from "datafile1". Have I got that right?
(the post is a bit confusing, because the Data::Dumper-like
output content doesn't match the sample file excerpt)
If so, then I think my first inclination would be to make
the "join" data the outer-most layer of the "file1" data
structure, and make it as easy as possible to identify the
matches -- something like this (based on the data in your
example "file1" excerpt):
$VAR1 =
{
'200210014 200210105' =>
[
"ABC Corp. / 1 / some text description",
"XYZ Ltd. / 1 / some text description",
"CDC Inc. / 1 / some text description",
],
'200211011 200212053' =>
[
"ABC Corp. / 2 / some text description",
"XYZ Ltd. / 2 / some text description",
],
'200323021 200331234' =>
[
"ABC Corp. / 3 / some text description",
],
etc...
}
In other words, file1 fills a hash of arrays, where the
hash keys are "start_id end_id" for each date range found
in file1; each of these hash elements holds an array of
one or more company records, where each record is
potentially just a single structured string, holding whatever
is relevant for your results file.
With this sort of data structure from file1, you can now
read file2 and use the first field of each line to
jump directly to the relevant file1 data (untested code,
naturally):
while (<FILE2>)
{
my ($key2,$data) = split(/,/, $_, 2);
# use grep to do the "join":
my @match_keys = grep { my ($sid,$eid) = split(/ /,$_);
$key2 >= $sid and $key2 <= $eid } keys %VAR
+1;
foreach my $matched_range ( @match_keys ) {
my @matched_data = @{$VAR1{$matched_range}};
# do something with @matched_data
}
}
| [reply] [d/l] [select] |
|
|
I posted some clarifications regarding what I am trying to do in my answer to the previous reply. Basically I am trying to match each number (e.g. 200210201) in datafile2 to any of the the given ranges in datafile1. If the number in datafile2 is within a range in datafile1 than it is a match. I used the concatenation of the two numbers in datafile1 to establish a unique 'rangeid' to be used as the hash key. In other words, '2 200534011 200577234 some text description' has a hash id of 20052. I then read each line of datafile2 and check through the record set to see if I find a hash key that matches (has the same 4 first numbers--i.e.2005). If so then I look closer at the sid and eid.
I am just trying to find out better, more efficient ways of doing this. :)
Thanks!
David
| [reply] |
Re: Code efficiency / algorithm
by BrowserUk (Patriarch) on Jan 14, 2003 at 07:47 UTC
|
Provided the size of your datafile1 isn't too large, you might be able to trade memory for speed and build your hash deeper, which should reduce the number of keys at each level and speed the processing by reducing the number fo iterations at the inner levels.
The structure the code below builds looks like this.
You could probably simplify %corps somewhat at the inner levels, depending on what information you need available once you have matched a record.
As pointed out above, you don't say what you want in your output file, so I've just dumped everything that matches each input record.
The output
C:\test>226719
200110100,some text here,etc matches with:
CDC Inc. 1 -some text description
200918943,some text here,etc matches with:
200211015,some text here,etc matches with:
XYZ Ltd. 2 -some text description
ABC Corp. 2 -some text description
199212395,some text here,etc matches with:
200110100,some text here,etc matches with:
CDC Inc. 1 -some text description
200210100,some text here,etc matches with:
XYZ Ltd. 1 -some text description
ABC Corp. 1 -some text description
C:\test>
The code
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead. | [reply] [d/l] [select] |
Re: Code efficiency / algorithm
by BrowserUk (Patriarch) on Jan 15, 2003 at 05:09 UTC
|
dws hit the fastest solution on the head. Given your id's are discrete rather than continuous, if the number and size of your ranges are within the bounds of sanity, building a hash lookup for every ID in the cumulative ranges it almost trivial to code and doesn't take long to build.
Of course, this makes the lookup very fast (O(n) I think).
Code
Output
The criteria as to whether this is a good solution for your application will depend on the range of the ID's your have. Like dws, I had been labouring under the assumption that these were dates covering--in your sample dataset--6 years from 1999 to 2005 for a range size of just over 2000 discrete values. However, on closer inspection I realised that your ids have an extra digit (what is that?), and the size of the range turns out to be close on 90,000.
Using your sample data, this is managable, but if your full dataset extends this in a big way, you might have to think about the interval tree Abigail-II talked about.
I did a little research on interval trees, they appear to be some kind of extension to binary trees by way of Red-Black trees. They don't look to hard to implement and I might have a go at it if I can find a better description/sample code than I have yet seen. Essentially, I have seen the intervals problem come up here a few times now and had never seen a good solution to it, and I am always more interested in the how of solving a problem, than actually solving it.
The great thing about Perl from my perspective is that for most problems and algorithms, Perl's immediacy makes prototyping and comparing different solutions so easy that I just can't resist 'having a go' :^)
The only other language that made coding so immediately rewarding that I have encountered goes way back to DEC Basic* Plus II.
(*Before too many people groan "Ug! BASIC!". Basic Plus was an exceptional version, and LW cites it as one of the influences that went into the original design of Perl)
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.
| [reply] [d/l] [select] |
|
|
Given your id's are discrete rather than continuous, if the number and size of your ranges are within the bounds of sanity, building a hash lookup for every ID in the cumulative ranges it almost trivial to code and doesn't take long to build.
That'll give you O(log2 n) queries, but there's a simple way to get to O(1). I had it mostly coded up when you posted. The code is nearly identical to yours (sick minds think alike?), so I'll provide the delta.
When reading the first datafile and preparing the data structure, instead of using a hash, use an array.
for ( date2offset($start) .. date2offset($end) ) {
push @{$effective[$_]}, \$data;
}
The probe them becomes
my $results = $effective[date2offset($date)]; # O(1)
if ( defined($results) {
# @$results is an array of all matching $data
}
I'll leave date2offset() as an exercise, other than noting that there's a non-obvious trick you can pull with it. Instead of a 1-to-1 mapping, you can do something not-quite right, like
sub date2offset {
my($yyyymmdd) = @_;
my $year = int($yyyymmdd/ 10000) - 1970;
my $mmdd = $yyyymmdd % 10000;
my $month = int($mmmdd / 100) - 1;
my $day = $mmdd % 100 - 1;
return $year * (12 * 31) + $month * 31 + $day;
}
This is a classic time-for-space tradeoff, simplifying the calculation by assuming all months are 31 days long, resulting in 6-7 bogus offsets per year. Note that it has no effect on probe time.
| [reply] [d/l] [select] |
|
|
Re: O(log2) and O(1). 8^?.
That's a confused smiley. I resolve to go away and find a good reference on Big-O notation. I seem to recall not having a problem with it when I learnt it 25 or so years ago, but I definately need a refresher course now.
Did I miss something? How do you derived a 8-digit date from an 9-digit number?
I like the bogus days trick though.
Re: (sick minds think alike?)
How dare you imply that I think?
Update: Now why do you suppose someone felt the need to downvote this node?
You can bet your bottom dollar whomever downvoted it won't have the hutspar (sp?) to explain his or her decision.
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.
| [reply] |
Re: Code efficiency / algorithm
by waswas-fng (Curate) on Jan 14, 2003 at 05:36 UTC
|
What does your output need to be? what is considered a match?
-Waswas | [reply] |
|
|
A match is when the number (date) in datafile2 is within the range values in datafile1. So, for example, 200110100 in datafile2 is a match because it is within CDC Inc. 200110014 - 200110325 range. So the output file would include the '200110100,some text here,etc' line.
I am still debating how the output should look like. One option is to break the output into 2 files--(1) a matched results file that includes the header of datafile2 and all the matched lines and (2) a reports summary file that displays some metadata... such as, how many total records processed, how many such records belong to each company and how many in each range. The other alternative is to display something like this:
#ABC Corp.
200210100,some text here,etc
#1 record found for the 20021 range id.
200211015,some text here,etc
...
#x records found for the 20022 range id:
#x total records found for ABC Corp.
#CDC Inc.
...
#57 total records found
Let me know what you think.
Thanks!
David | [reply] [d/l] |
Re: Code efficiency / algorithm
by dave8775 (Novice) on Jan 14, 2003 at 09:40 UTC
|
Use of uninitialized value at parse.pl line 48, <datafile2> chunk 1474
+2.
Line 48 is if ($_ eq $year) { in the code below
It seems to work fine and all of a sudden I keep getting this error displayed on the screen *shrug*.
The code that it is complaining about is:
while (<datafile2>) {
chomp;
my $line = $_;
next if /^$/;
next if /^\D.*$/;
my ($date, $year) = /^((\d{4})\d+)/;
foreach my $name (sort keys %corps) {
for my $id (sort keys %{ $corps{$name} }) {
$_ = $id;
s/(\d{4}).*/$1/;
if ($_ eq $year) {
if (($date gt $corps{$name}{$id}{bid}) && ($date lt $c
+orps{$name}{$id}{eid})) {
print "Line $. => CorpName: $name Date=$date is
+in RangeId: $id -> $corps{$name}{$id}{bid} - $corps{$name}{$id}{eid}"
+;
print OUT "$.,$line";
}
}
}
}
}
Did I reach some limit? datafile2 is 80+MB... 367,000 or so lines...
Cheers,
David | [reply] [d/l] [select] |