in reply to How to process two files of over a million lines for changes

If the lines in the files have a fixed order, it's easy - you never need more than 2 lines in memory. Assume the file consists of two columns, product name and price, and they are ordered on the product name. Pseudo-algorithm:
  1. Read product name (pn.o) and price (p.o) from the old file. Read product name (pn.n) and price (p.n) from the new file.
  2. If pn.o eq pn.n, goto 5.
  3. If pn.o lt pn.n, then pn.o was deleted. If the old file is exhausted, goto 8, else read the next line of the old file into pn.o and p.o and goto 2.
  4. (pn.o gt pn.n) This means pn.n is a new product. If the new file is exhausted, goto 9, else read the next line of the new file into pn.n and p.n and goto 2.
  5. If p.o != p.n, the price was modified. Else there was no change in the product.
  6. If the old file is exhausted, goto 8, else read the next line of the old file into pn.o and p.o.
  7. If the new file is exhausted, goto 9, else read the next line of the new file into pn.n and p.n and goto 2.
  8. pn.n is a new product, and so are all other unread entries in the new file. Read them, adjust your database, and end the program.
  9. pn.o is a deleted product, and all other unread entries in the old file were deleted as well. Read them, adjust your database, and end the program.
Now, if the entries aren't sorted, you may be able to sort them using the sort program - it shouldn't have any difficulties sorting a few million lines.

Abigail

  • Comment on Re: How to process two files of over a million lines for changes

Replies are listed 'Best First'.
Re: Re: How to process two files of over a million lines for changes
by punkish (Priest) on May 23, 2004 at 02:56 UTC
    I second Abigail's suggestion. To reiterate -- first sort, then do the comparison.

    A while back I was twiddling around during some "downtime" and I cooked up the following little Perl exercise. Remove duplicate email addresses from a file where each line was an address. So, I applied the usual, build an array, sort, take out the unique addresses, write them to a new file. Took just under 2 mins on my fast Windows box at work, and about 3.5 mins on my relatively pokey iBook at home. Both machines had half Gb ram.

    The file was 300+ Mb in size and had about 145 million rows in it.

    ;-)

      I applied the usual, build an array, sort, take out the unique addresses, write them to a new file.
      For me, "the usual" would be:
      $ sort -u file > file.$$ && mv file.$$ file

      Would even work without much ram.

      Abigail

        Ya, but I am still at the "learning from the Cookbook stage." ;-).

        Now, thanks to you, I went to the terminal and learned all sorts of nice things about 'sort.' No more using Perl for this problem.

        Gives meaning to the old saying, "Give a newbie some Perl knowledge, and all problems start looking like Perl problems."