Since you already have a working solution, I am going to give you far more than you wanted to know about how this problem might be approached as a research project. First consider the most straight forward implementation. Read 1 file line by line while repeatedly opening/reading the lines of the second file. As a rule of thumb, don't reach for a regex when eq or index will do. This solution will work but is suboptimal because it does not scale well O(N * M) and IO is notoriously slow.
Assuming you have enough memory, you could improve this solution by completely reading 1 file into an in-memory array. This still leaves an O(N * M) algorithm but you have dramatically reduced the IO. If instead we use an in-memory hash, we can reduce IO and effectively change the algorithm to O(N) or O(N + M) if we don't ignore the time to construct the hash is based on the size of the file.
You may decide using an in-memory hash of the filter file is great until you run into a situation where the filter file will no longer fit into memory. An easy solution to this problem is to sort both of the files. You can then read them both at the same time. Essentially, you read one line from each file and compare them. Which ever is less than the other, you continue to read lines from that file comparing it until you find a line that is equal to or greater than your other line. I assume the rest of the algorithm is obvious, but if you need more help let me know. If you don't look too closely, this algorithm is O(N + M).
Again, this may look like a fine solution. You can even use existing solutions like comm. What happens when the output needs to appear in the same order as in the original file? In this case, you need to find a way to have our cake and eat it too. One solution may be to load the filter file into a database with an index. This effectively takes us back to the in-memory hash solution but on disk. You are now doing extra IO but it is minimized because of the efficiency in finding the data you are looking for. Since an entire DB is overkill, you might instead use DB_File.
If you have read my Necessity is the mother of invention meditation, perhaps you might want to further your skill level by trying to solve this problem without the aid of a DB as described above. Since we are contriving this situation, let's suppose environmental conditions prevent you. You could read the filter file line by line and generate a Digest::MD5 hash and write them out to a new file. Next, we can sort the file numerically - remember the hash digest can be treated as 128 bit integer. Now we read the file to be filtered line by line, generate a MD5 hash and then perform a binary search of the filter file. The binary search is possible because we know how many lines are in the file (wc for instance) and each line is fixed width. This way we can calculate where we need to seek to. I haven't conducted rigorous algorithm analysis but this would appear to be O((N * Log M) + M).
As you can see, something you thought of simple question can turn into an expedition into learning. I hope this helps.
Cheers - L~R
In reply to Re: Filtering lines from one file from another
by Limbic~Region
in thread Filtering lines from one file from another
by antonn
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |