Re: How to process two files of over a million lines for changes

So what do I do? How can you compare two files for changes when both files are so large that you either run out of memory or it takes so long to process them that the information in the file is out of date.

I would consider other options. Here's one that might work for you: Use separate logical databases for each customer, and use UNION queries that span the separate logical databases. (MySQL 4.0 supports UNION queries). This looks like:

    SELECT stuff FROM db1.t WHERE stuff like 'foo%'
    UNION
    SELECT stuff FROM db2.t WHERE stuff like 'foo%'
    UNION
    SELECT stuff FROM db3.t WHERE stuff like 'foo%'
[download]

In such a scheme, you would build (and possibly cache) the query at runtime based on the current "complete" databases. When new data for a vendor arrived, you would create a new database, bulk load the new data into that database (without having to worry about deleting expired product records), then switch that database to be current. Then, arrange for new queries to use the new database. Since there may be queries active at the time of the switch, you may need to introduce some delay before recycling (deleting) the old database for that vendor.

The beauty of this scheme is that

it saves you from having to compute deltas,
it allows you to bulk load data without having to worry about transactions (bulk loading can be a lot faster), and
you don't incur transaction overhead when you're inserting while querying.

The (small) downside is that you can't embed a static query (or set of queries) in your applications. Instead, you have to either construct new queries dynamically, or read the cached ones.

Comment on Re: How to process two files of over a million lines for changes Download Code