You basically have two possible strategies, choosing one or the other will depend one several factors, the main ones being the relative size of the the two lists and how well defined the names of the first list appear in the second one.
Suppose your list of names is very short and your document quite large. For example, the document is the King James Bible and the list of names has only four names : (God, David, Mary, Jesus). You will probably want to read each line of the document and use a regular expression to print out each line that matches the regex. Something like this:
# ...
while (<$INPUT>) {
print $OUT if /God/ or /David/ or /Mary/ or /Jesus/;
# could also be written: print $OUT if /God|David|Mary|Jesus/;
The first solution seems to be probably slightly faster than the one in the commented-out line, but it is essentially irrelevant because it is really fast anyway (about 0.1 second with the edition of the Bible that I used).
The opposite case is when your name list is very large (say for example 10,000 words or more) and the document quite small. In this case, it is probably better to first load your name list into a hash, and then to read the document line by line, split each line into words and check if the word exists in the hash. Something like this (untested):
IN: while (<$INPUT>) {
my @words = split /\b/, $_;
foreach my $word (@words) {
print $_ and next IN if exists $name_hash{$word};
}
}
With the same small list as above and the same document, execution time is at least 15 times longer (about 1.5 sec). (But I would not care in many cases, 0.1 sec. or 1.5 sec. often if an irrelevant difference.) But if the name list has a few hundred words or above, or if the document is significantly shorter, this second solution is likely to be the better one.
Quite possibly you don't even care of speed, because it is so fast anyway, then chose the easiest algorithm (probably the first one).
|