I have a mySQL database with 8 million rows which is constantly being searched on a series of public websites. It is made up of products from 15 different companies who provide updates throughout the day which have to be loaded into the database and replace whatever is there already.

The number of products each company has varies, but some of them have upwards of a million. So for example, two or three times a day company A provides me with a file of 1.2 million lines in CSV format and this needs to be put into the database.

At the moment when I get each new file, I delete all the current products belonging to that company from the database and do a load data infile on the new one. It's an INNODB table, so it's all still searchable while new data is being loaded, but as you can imagine, it takes a long time.

Normal database optimisation techniques either focus on a database which gets queried often or updated often and sadly ours has to be both, so we're not really get anywhere with a solution of how to solve this problem. It seems that whatever solution I come up with (such as more than one db server with replication), we'll still hit the problem of updating all 8 million rows in the table three times a day.

Anyway, enough database talk. Perl talk.

Each days file won't vary much from the days before, there will be changes (to the price column), deletions (around 10,000 may be deleted) and additions (and maybe 10,000 new products added) so the majority of the 1.2 million lines of the day before will be the same.

Since the files are relatively the same each day, what I'd like to be able to do is look at yesterdays file and todays file and compare them, work out the lines where the price column is different, what's new and what's deleted. Even working out what was added and deleted would be good enough, leaving me with just 10,000 deletes/inserts to do, rather than 1.2 million.

There are of course many methods for comparing lists here on the site, and the List::Compare module provides easy access to them. However, when you're dealing with 1.2 million rows in each file, all these techniques just fall apart. I can't load both files into an array or hash at the same time to do comparisons because of memory constraints and an attempt to use Tie::File so the array was tied to a file on disk didn't get me anywhere either, it used a lot less memory, but took about 10 hours before eventually outputting nothing.

So what do I do? How can you compare two files for changes when both files are so large that you either run out of memory or it takes so long to process them that the information in the file is out of date.


In reply to How to process two files of over a million lines for changes by richard5mith

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.