baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

hello,

i need some monks wisdom with this one. the question is not so technical as it is more puzzle like. so the problem is, i have a tsv file with two columns of numbers. numbers are randomly associated

123424 42312 2122 656735 2122 345335 2122 345 234254 32121 32121 23543 324243 2354345
the upper limit of all numbers is 100000000000, so there is no number that is higher then that one. there are around 100000000 entries(lines) in that table. in the first column numbers can be repeated (all numbers are integers).

now i have another tsv file with some rand numbers (int) associated with some large strings in the second column.

12423 string_of_characters_numbers_and_all_kind_of_stuff...,mminouigu +ifbgvufbbfvbiuserubrfui54895t3gh54gb3t789b54g 45g73b54g54bg754 23423453 string_of_characters_numbers_and_all_kind_of_stuff...,mminou +iguifbgvufbbfvbiuserubrfui54895t3gh54gb3t789b54g 45g73b54g54bg754
so what i am looking for is to find which numbers of the first column of the first tsv file can be associated with the first column of the second tsv file. the key is :
first_tsv.sec_column == second_tsv.first_column
the second tsv has some 100000000 lines but keep in mind that lines are large enough that the hashing the file takes up some 2GB of ram + 7GB of swap (i terminated the process there)

so as you can see it is a typical association problem. what i have done so far is tried to hash everything but it just goes out of scope of my memory (2GB). second thing i tried is to do some on-disc hashing but this is not going to work(SQLite does much much much better job). so as i said i tried to import it to db (SQLite) and then through a simple statement

insert into newtable(col1,col2,col3) select tsv1.col1 tsv1.col2 tsv2.c +ol2 from tsv1 inner join tsv2 on tsv1.col2 = tsv2.col2"
associate them (everything is indexed by the book), but it again goes out of scope of my memory. SELECT statement is killing me.

so i it came to me since i'm already dealing with numbers that act as a unique key of some sort , maybe i could use that. maybe someone knows some kind of trick or something to reduce the hashing of the first tsv file - then i could deal with that.

thank you

baxy

Replies are listed 'Best First'.
Re: associations and sorting
by almut (Canon) on Nov 28, 2009 at 17:26 UTC

    Maybe you could try BerkeleyDB (which is a binding to the C library of the same name).  As you essentially only have simple key value associations, and it's exactly tailored for such kind of problems, it might be more memory efficient than SQLite. The current version of Berkeley DB supports table sizes of up to 256 TB.

Re: associations and sorting
by Marshall (Canon) on Nov 28, 2009 at 18:52 UTC
    so what i am looking for is to find which numbers of the first column of the first tsv file can be associated with the first column of the second tsv file.

    Sounds to me like your sorting idea will work. The command line sort can handle huge files. Sort both files. Now that you have them in a predictable ascending order, open both files and do a "jagged walk" down both of them. You are looking for equality in two ascending streams, eg. (1,2,4,6,8) vs (5,6,7,8). Start with 1 and 5. Then read lines from first until you either see 5 or something bigger, in this case 6 shows up. Then read lines from second file until you see 6 or something bigger. In this case 6 appears in both. Then advance to next numbers. I guess you have to do something if numbers appears twice.

    If you made a pass that added leading zeroes so that all numbers of interest contain the same number of digits, then you can use string compare for gt,lt,eq instead of numeric equality. There is also a bigInt module that handles >32 bit ints.

    Update: I don't know how slow the sort will be, but since you are only interested in column one in both files, you could combine the two files and sort by the first column, then you could make one pass through the sorted file and look for repeated numbers that are the same. The format varies between the two files (second column number vs string) so you could tell which file each line came from. In this case, numeric vs ascii sort wouldn't matter as you would just be looking for lines that have the same column one. I don't know how these sorting approaches would compare performance wise with a DB based solution.

Re: associations and sorting
by NiJo (Friar) on Nov 28, 2009 at 19:55 UTC
    I'd split off the large strings of tsv2 very early in the process, and substitute the line numbers. The large strings seem to be just slack, but your unstated final goal might be to compute links from tsv1.col1 to tsv2.col2. That additional lookup is the price you have to pay for reduced memory requirements.

    Stating the file sizes would make your problem more comprehensible.

Re: associations and sorting
by ReturnOfThelonious (Beadle) on Nov 30, 2009 at 18:32 UTC
    Marshall's idea is sound. On Unix-ish systems (or with a unix-ish shell on Windows), the whole task is:
    sort -t$'\t' -k 2,2 file1 >file1a sort -t$'\t' file2 >file2a join -t$'\t' -1 2 file1a file2a >file3
    At least in Bash, $'\t' gives a tab. I'm not sure about other shells. You can always use a literal tab in quotes. The output will have the join field first, followed by the first field of file1, then all the fields (except the first) of file2. It will be sorted (lexically) by the join field. This will handle any size file as long as you have enough disk space.