Suggestions re parsing one large file with elements of another large file

billie_t has asked for the wisdom of the Perl Monks concerning the following question:

This is a completely clueless Perl newbie question, but at this point I am just wanting some ideas of where to start.

I have a list of usernames that is 6000+ lines long. I need to parse each line of this file and compare it with another file (10,000s lines long). If I find a match of my <username> from the first file embedded in a line of text in the larger file, I need to remove the matched line and all subsequent lines (the number varies!) up until and including the next matched line

I'm particularly wondering if Tie::File would be appropriate for this task. Other suggestions, methods and pointers most welcome.

Comment on Suggestions re parsing one large file with elements of another large file

Replies are listed 'Best First'.
Re: Suggestions re parsing one large file with elements of another large file by davido (Cardinal) on Jan 13, 2004 at 05:42 UTC
While the files seem pretty big if you look at number of lines, you can gain some comfort in knowing that even at 80 characters per line, and 50% overhead, that ten thousand line file will still only consume 1.2MB of RAM even if you slurped the whole thing into memory. Add to that the 6,000 line file with the same 80 characters and 50% overhead (another 720k), and you're consuming a whopping 1.92MB RAM if you slurp. If the ultimate in quick performance is necessary, and it's not possible to change the datasources to a database, slurp, but do so realizing that you'll run out of memory if those files get too big eventually. If, on the other hand, if the files may end up growing substantially larger, Tie::File is a good way to go. The POD for Tie::File does state that slurping isn't going on. (The actual text says, "The file is not* loaded into memory, so this will work even for gigantic files.*" Another possible solution involves changing the design of the data sources. If the two files could be converted to a couple of database tables, scalability will not be a concern anymore. Dave	[reply]
Re: Suggestions re parsing one large file with elements of another large file by neuroball (Pilgrim) on Jan 13, 2004 at 06:07 UTC
Hm... why do you want to go high-tech, when low-tech will do it for you? Just read the username file in line by line and add the names to a hash. Why a hash? Because you can lookup hash entries for existance. `while (<FH>) { next if (/^#/); chomp; if (not exists $usernames{$_}) { $usernames{$_} = 1; } else { die "found username duplicate in first file, script halted!\n" +; } }` [download] Now you can open the other file read it line by line (with or without array buffering it), parsing each line into it's words and then compare them with the usernames hash (`exists $hash{$word}`). If no match came up, write the line to an output file. If you found a match go on searching until you find the next match, then start from the beginning (of this paragraph, not the file). :) /oliver/ Hans: Not ze problem!	[reply] [d/l] [select]
Re: Re: Suggestions re parsing one large file with elements of another large file by billie_t (Sexton) on Jan 13, 2004 at 06:42 UTC
Thanks for the suggestions, guys. It's true that there probably is enough memory to read both of them at once. I can open both files into poxy notepad, so it must fit into memory. And I'll see how far I can get with your suggestion oliver, at least it has syntax I can understand. The big problem with outputting non-matches into another file is that there is a variable number of lines between matching the <username> the first time and the next(last) time - all of these lines need to be omitted. I suppose you match once, read all subsequent lines into a dummy hash until the next match, where you start reading lines into your output hash again. I can see the concept, but not quite how to execute it... Thanks again for the food for thought	[reply]