How to process two files of over a million lines for changes

richard5mith has asked for the wisdom of the Perl Monks concerning the following question:

I have a mySQL database with 8 million rows which is constantly being searched on a series of public websites. It is made up of products from 15 different companies who provide updates throughout the day which have to be loaded into the database and replace whatever is there already.

The number of products each company has varies, but some of them have upwards of a million. So for example, two or three times a day company A provides me with a file of 1.2 million lines in CSV format and this needs to be put into the database.

At the moment when I get each new file, I delete all the current products belonging to that company from the database and do a load data infile on the new one. It's an INNODB table, so it's all still searchable while new data is being loaded, but as you can imagine, it takes a long time.

Normal database optimisation techniques either focus on a database which gets queried often or updated often and sadly ours has to be both, so we're not really get anywhere with a solution of how to solve this problem. It seems that whatever solution I come up with (such as more than one db server with replication), we'll still hit the problem of updating all 8 million rows in the table three times a day.

Anyway, enough database talk. Perl talk.

Each days file won't vary much from the days before, there will be changes (to the price column), deletions (around 10,000 may be deleted) and additions (and maybe 10,000 new products added) so the majority of the 1.2 million lines of the day before will be the same.

Since the files are relatively the same each day, what I'd like to be able to do is look at yesterdays file and todays file and compare them, work out the lines where the price column is different, what's new and what's deleted. Even working out what was added and deleted would be good enough, leaving me with just 10,000 deletes/inserts to do, rather than 1.2 million.

There are of course many methods for comparing lists here on the site, and the List::Compare module provides easy access to them. However, when you're dealing with 1.2 million rows in each file, all these techniques just fall apart. I can't load both files into an array or hash at the same time to do comparisons because of memory constraints and an attempt to use Tie::File so the array was tied to a file on disk didn't get me anywhere either, it used a lot less memory, but took about 10 hours before eventually outputting nothing.

So what do I do? How can you compare two files for changes when both files are so large that you either run out of memory or it takes so long to process them that the information in the file is out of date.

Comment on How to process two files of over a million lines for changes

Replies are listed 'Best First'.
Re: How to process two files of over a million lines for changes by Abigail-II (Bishop) on May 22, 2004 at 22:49 UTC
If the lines in the files have a fixed order, it's easy - you never need more than 2 lines in memory. Assume the file consists of two columns, product name and price, and they are ordered on the product name. Pseudo-algorithm: Read product name (pn.o) and price (p.o) from the old file. Read product name (pn.n) and price (p.n) from the new file. If pn.o eq pn.n, goto 5. If pn.o lt pn.n, then pn.o was deleted. If the old file is exhausted, goto 8, else read the next line of the old file into pn.o and p.o and goto 2. (pn.o gt pn.n) This means pn.n is a new product. If the new file is exhausted, goto 9, else read the next line of the new file into pn.n and p.n and goto 2. If p.o != p.n, the price was modified. Else there was no change in the product. If the old file is exhausted, goto 8, else read the next line of the old file into pn.o and p.o. If the new file is exhausted, goto 9, else read the next line of the new file into pn.n and p.n and goto 2. pn.n is a new product, and so are all other unread entries in the new file. Read them, adjust your database, and end the program. pn.o is a deleted product, and all other unread entries in the old file were deleted as well. Read them, adjust your database, and end the program. Now, if the entries aren't sorted, you may be able to sort them using the `sort` program - it shouldn't have any difficulties sorting a few million lines. Abigail	[reply]
Re: Re: How to process two files of over a million lines for changes by punkish (Priest) on May 23, 2004 at 02:56 UTC
I second Abigail's suggestion. To reiterate -- first sort, then do the comparison. A while back I was twiddling around during some "downtime" and I cooked up the following little Perl exercise. Remove duplicate email addresses from a file where each line was an address. So, I applied the usual, build an array, sort, take out the unique addresses, write them to a new file. Took just under 2 mins on my fast Windows box at work, and about 3.5 mins on my relatively pokey iBook at home. Both machines had half Gb ram. The file was 300+ Mb in size and had about 145 million rows in it. ;-)	[reply]
Re: How to process two files of over a million lines for changes by Abigail-II (Bishop) on May 23, 2004 at 11:32 UTC
I applied the usual, build an array, sort, take out the unique addresses, write them to a new file. For me, "the usual" would be: `$ sort -u file > file.$$ && mv file.$$ file` [download] Would even work without much ram. Abigail	[reply] [d/l]
Re: Re: How to process two files of over a million lines for changes by punkish (Priest) on May 23, 2004 at 13:20 UTC
Re: How to process two files of over a million lines for changes by jepri (Parson) on May 22, 2004 at 22:43 UTC
Use the Unix program "diff". It's designed for this sort of thing. It does depend on the files being in the same order each time, so you might have to sort them first. Fortunately there is a Unix command called "sort" which can probably do that for you. ____________________ Jeremy I didn't believe in evil until I dated it.	[reply]
Re: How to process two files of over a million lines for changes by QM (Parson) on May 23, 2004 at 00:18 UTC
BTW, Abigail-II's excellent outline appears to be a modified merge-sort. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re: How to process two files of over a million lines for changes by samtregar (Abbot) on May 23, 2004 at 20:29 UTC
You say you can't load both files into memory at the same time, but I wonder if you can load something important about every record from both files into memory at the same time. How about an array of MD5 signatures for each line? If you've got two files with 1.2 million rows each and an MD5 signature is 16 bytes long then you should be able to index them both in around 400MB of memory. If Perl adds too much overhead to the arrays (and it might) then use Tie::IntegerArray or just program it in C (heresy!). If the rows have a primary key you might be able to use Bit::Vector to setup bitmaps to test for insertions and deletions. That would likely use less memory than an array of 16bit MD5s, depending on how sparse your key-space is. -sam	[reply]
Re: How to process two files of over a million lines for changes by dws (Chancellor) on May 24, 2004 at 04:54 UTC
So what do I do? How can you compare two files for changes when both files are so large that you either run out of memory or it takes so long to process them that the information in the file is out of date. I would consider other options. Here's one that might work for you: Use separate logical databases for each customer, and use UNION queries that span the separate logical databases. (MySQL 4.0 supports UNION queries). This looks like: `SELECT stuff FROM db1.t WHERE stuff like 'foo%' UNION SELECT stuff FROM db2.t WHERE stuff like 'foo%' UNION SELECT stuff FROM db3.t WHERE stuff like 'foo%'` [download] In such a scheme, you would build (and possibly cache) the query at runtime based on the current "complete" databases. When new data for a vendor arrived, you would create a new database, bulk load the new data into that database (without having to worry about deleting expired product records), then switch that database to be current. Then, arrange for new queries to use the new database. Since there may be queries active at the time of the switch, you may need to introduce some delay before recycling (deleting) the old database for that vendor. The beauty of this scheme is that it saves you from having to compute deltas, it allows you to bulk load data without having to worry about transactions (bulk loading can be a lot faster), and you don't incur transaction overhead when you're inserting while querying. The (small) downside is that you can't embed a static query (or set of queries) in your applications. Instead, you have to either construct new queries dynamically, or read the cached ones.	[reply] [d/l]
Re: How to process two files of over a million lines for changes by davidj (Priest) on May 24, 2004 at 13:46 UTC
Unix has a nice utility called "comm" that is perfectly suited for your needs. It takes 2 sorted input files and returns the output in 3 columns: column 1: lines unique to input file 1 column 2: lines unique to input file 2 column 3: lines common to both files You can read the manpage for it, but this is how you would use it: comm -3 yesterday.file today.file > difs.txt The -3 switch turns off the output of column 3 (which you don't need). The file difs.txt now contains 2 columns of data: The first column is the records unique to yesterday. The second column is the records unique to today. This file can easily be parsed to separate the 2 columns. The records in column 1 are the ones that need to be deleted from the database. The records in column 2 are the ones that need to be added to the database. hope this helps, davidj	[reply]
Re: How to process two files of over a million lines for changes by jkeenan1 (Deacon) on Jun 04, 2004 at 14:27 UTC
Re: richard5smith on List::Compare and large data sets Probably true. In my paying job I'm only dealing with data sets on the order of 103 or 104. I simply don't have sets on the order of 10**6 or larger to test against. I would welcome a suggestion for a publicly available data set of that magnitude against which I could test List::Compare and try to refine it. Jim Keenan (author of List::Compare)	[reply]
Re: How to process two files of over a million lines for changes by ggg (Scribe) on May 23, 2004 at 20:55 UTC
I know nothing of databases and little about Perl, but would it work out to have two files, one live and one for updating? When an update is done, either swap names with the live file or toggle some reference pointer to which one is 'live' and which is for 'updating'.	[reply]