baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:
i need some monks wisdom with this one. the question is not so technical as it is more puzzle like. so the problem is, i have a tsv file with two columns of numbers. numbers are randomly associated
the upper limit of all numbers is 100000000000, so there is no number that is higher then that one. there are around 100000000 entries(lines) in that table. in the first column numbers can be repeated (all numbers are integers).123424 42312 2122 656735 2122 345335 2122 345 234254 32121 32121 23543 324243 2354345
now i have another tsv file with some rand numbers (int) associated with some large strings in the second column.
so what i am looking for is to find which numbers of the first column of the first tsv file can be associated with the first column of the second tsv file. the key is :12423 string_of_characters_numbers_and_all_kind_of_stuff...,mminouigu +ifbgvufbbfvbiuserubrfui54895t3gh54gb3t789b54g 45g73b54g54bg754 23423453 string_of_characters_numbers_and_all_kind_of_stuff...,mminou +iguifbgvufbbfvbiuserubrfui54895t3gh54gb3t789b54g 45g73b54g54bg754
the second tsv has some 100000000 lines but keep in mind that lines are large enough that the hashing the file takes up some 2GB of ram + 7GB of swap (i terminated the process there)first_tsv.sec_column == second_tsv.first_column
so as you can see it is a typical association problem. what i have done so far is tried to hash everything but it just goes out of scope of my memory (2GB). second thing i tried is to do some on-disc hashing but this is not going to work(SQLite does much much much better job). so as i said i tried to import it to db (SQLite) and then through a simple statement
associate them (everything is indexed by the book), but it again goes out of scope of my memory. SELECT statement is killing me.insert into newtable(col1,col2,col3) select tsv1.col1 tsv1.col2 tsv2.c +ol2 from tsv1 inner join tsv2 on tsv1.col2 = tsv2.col2"
so i it came to me since i'm already dealing with numbers that act as a unique key of some sort , maybe i could use that. maybe someone knows some kind of trick or something to reduce the hashing of the first tsv file - then i could deal with that.
thank you
baxy
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: associations and sorting
by almut (Canon) on Nov 28, 2009 at 17:26 UTC | |
|
Re: associations and sorting
by Marshall (Canon) on Nov 28, 2009 at 18:52 UTC | |
|
Re: associations and sorting
by NiJo (Friar) on Nov 28, 2009 at 19:55 UTC | |
|
Re: associations and sorting
by ReturnOfThelonious (Beadle) on Nov 30, 2009 at 18:32 UTC |