comment on

You basically have two possible strategies, choosing one or the other will depend one several factors, the main ones being the relative size of the the two lists and how well defined the names of the first list appear in the second one.

Suppose your list of names is very short and your document quite large. For example, the document is the King James Bible and the list of names has only four names : (God, David, Mary, Jesus). You will probably want to read each line of the document and use a regular expression to print out each line that matches the regex. Something like this:

# ...
while (<$INPUT>) {
     print $OUT if /God/ or /David/ or /Mary/ or /Jesus/;
     # could also be written: print $OUT if /God|David|Mary|Jesus/;
[download]

The first solution seems to be probably slightly faster than the one in the commented-out line, but it is essentially irrelevant because it is really fast anyway (about 0.1 second with the edition of the Bible that I used).

The opposite case is when your name list is very large (say for example 10,000 words or more) and the document quite small. In this case, it is probably better to first load your name list into a hash, and then to read the document line by line, split each line into words and check if the word exists in the hash. Something like this (untested):

IN: while (<$INPUT>) {
     my @words = split /\b/, $_;
     foreach my $word (@words) {
          print $_ and next IN if exists $name_hash{$word};
     }
}
[download]

With the same small list as above and the same document, execution time is at least 15 times longer (about 1.5 sec). (But I would not care in many cases, 0.1 sec. or 1.5 sec. often if an irrelevant difference.) But if the name list has a few hundred words or above, or if the document is significantly shorter, this second solution is likely to be the better one.

Quite possibly you don't even care of speed, because it is so fast anyway, then chose the easiest algorithm (probably the first one).

In reply to Re: Help from the Perliest monks by Laurent_R
in thread Help from the Perliest monks by perlmonknoob

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.