antonn,
I am sorry if your first post at the Monastery didn't go as well as you might have hoped or expected. I encourage you to stick around but also to read How (Not) To Ask A Question and How do I post a question effectively?.

Since you already have a working solution, I am going to give you far more than you wanted to know about how this problem might be approached as a research project. First consider the most straight forward implementation. Read 1 file line by line while repeatedly opening/reading the lines of the second file. As a rule of thumb, don't reach for a regex when eq or index will do. This solution will work but is suboptimal because it does not scale well O(N * M) and IO is notoriously slow.

Assuming you have enough memory, you could improve this solution by completely reading 1 file into an in-memory array. This still leaves an O(N * M) algorithm but you have dramatically reduced the IO. If instead we use an in-memory hash, we can reduce IO and effectively change the algorithm to O(N) or O(N + M) if we don't ignore the time to construct the hash is based on the size of the file.

You may decide using an in-memory hash of the filter file is great until you run into a situation where the filter file will no longer fit into memory. An easy solution to this problem is to sort both of the files. You can then read them both at the same time. Essentially, you read one line from each file and compare them. Which ever is less than the other, you continue to read lines from that file comparing it until you find a line that is equal to or greater than your other line. I assume the rest of the algorithm is obvious, but if you need more help let me know. If you don't look too closely, this algorithm is O(N + M).

Again, this may look like a fine solution. You can even use existing solutions like comm. What happens when the output needs to appear in the same order as in the original file? In this case, you need to find a way to have our cake and eat it too. One solution may be to load the filter file into a database with an index. This effectively takes us back to the in-memory hash solution but on disk. You are now doing extra IO but it is minimized because of the efficiency in finding the data you are looking for. Since an entire DB is overkill, you might instead use DB_File.

If you have read my Necessity is the mother of invention meditation, perhaps you might want to further your skill level by trying to solve this problem without the aid of a DB as described above. Since we are contriving this situation, let's suppose environmental conditions prevent you. You could read the filter file line by line and generate a Digest::MD5 hash and write them out to a new file. Next, we can sort the file numerically - remember the hash digest can be treated as 128 bit integer. Now we read the file to be filtered line by line, generate a MD5 hash and then perform a binary search of the filter file. The binary search is possible because we know how many lines are in the file (wc for instance) and each line is fixed width. This way we can calculate where we need to seek to. I haven't conducted rigorous algorithm analysis but this would appear to be O((N * Log M) + M).

As you can see, something you thought of simple question can turn into an expedition into learning. I hope this helps.

Cheers - L~R


In reply to Re: Filtering lines from one file from another by Limbic~Region
in thread Filtering lines from one file from another by antonn

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.