comment on

hello,

i need some monks wisdom with this one. the question is not so technical as it is more puzzle like. so the problem is, i have a tsv file with two columns of numbers. numbers are randomly associated

123424  42312
2122  656735
2122  345335
2122  345
234254  32121
32121  23543  
324243  2354345
[download]

the upper limit of all numbers is 100000000000, so there is no number that is higher then that one. there are around 100000000 entries(lines) in that table. in the first column numbers can be repeated (all numbers are integers).

now i have another tsv file with some rand numbers (int) associated with some large strings in the second column.

12423  string_of_characters_numbers_and_all_kind_of_stuff...,mminouigu
+ifbgvufbbfvbiuserubrfui54895t3gh54gb3t789b54g 45g73b54g54bg754
23423453  string_of_characters_numbers_and_all_kind_of_stuff...,mminou
+iguifbgvufbbfvbiuserubrfui54895t3gh54gb3t789b54g 45g73b54g54bg754
[download]

so what i am looking for is to find which numbers of the first column of the first tsv file can be associated with the first column of the second tsv file. the key is :

first_tsv.sec_column ==  second_tsv.first_column
[download]

the second tsv has some 100000000 lines but keep in mind that lines are large enough that the hashing the file takes up some 2GB of ram + 7GB of swap (i terminated the process there)

so as you can see it is a typical association problem. what i have done so far is tried to hash everything but it just goes out of scope of my memory (2GB). second thing i tried is to do some on-disc hashing but this is not going to work(SQLite does much much much better job). so as i said i tried to import it to db (SQLite) and then through a simple statement

insert into newtable(col1,col2,col3) select tsv1.col1 tsv1.col2 tsv2.c
+ol2 from tsv1 inner join tsv2 on tsv1.col2 = tsv2.col2"
[download]

associate them (everything is indexed by the book), but it again goes out of scope of my memory. SELECT statement is killing me.

so i it came to me since i'm already dealing with numbers that act as a unique key of some sort , maybe i could use that. maybe someone knows some kind of trick or something to reduce the hashing of the first tsv file - then i could deal with that.

thank you

baxy

In reply to associations and sorting by baxy77bax

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.