so what i am looking for is to find which numbers of the first column of the first tsv file can be associated with the first column of the second tsv file.

Sounds to me like your sorting idea will work. The command line sort can handle huge files. Sort both files. Now that you have them in a predictable ascending order, open both files and do a "jagged walk" down both of them. You are looking for equality in two ascending streams, eg. (1,2,4,6,8) vs (5,6,7,8). Start with 1 and 5. Then read lines from first until you either see 5 or something bigger, in this case 6 shows up. Then read lines from second file until you see 6 or something bigger. In this case 6 appears in both. Then advance to next numbers. I guess you have to do something if numbers appears twice.

If you made a pass that added leading zeroes so that all numbers of interest contain the same number of digits, then you can use string compare for gt,lt,eq instead of numeric equality. There is also a bigInt module that handles >32 bit ints.

Update: I don't know how slow the sort will be, but since you are only interested in column one in both files, you could combine the two files and sort by the first column, then you could make one pass through the sorted file and look for repeated numbers that are the same. The format varies between the two files (second column number vs string) so you could tell which file each line came from. In this case, numeric vs ascii sort wouldn't matter as you would just be looking for lines that have the same column one. I don't know how these sorting approaches would compare performance wise with a DB based solution.


In reply to Re: associations and sorting by Marshall
in thread associations and sorting by baxy77bax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.