First, apologies for taking so long to get back to you
on this. (And it took even longer, as my browser decided
to nose dive after I wrote up this response the first
time. Sigh.)
Looking at this code, though, I again have to say that
it is almost a complete non-sequitor to me. I think I
understand your overall goals, but your actual program
construction seems to be very haphazard. I can't figure
out your larger issues without grasping your entire
system (which I'm not equipped to do!), but here are a
few comments on your "programming in the small":
open (SIDX, "$data_dir/search.idx");
open (SIDX2, "$data_dir/search2.idx");
my @sidx = <SIDX>;
my @sidx2 = <SIDX2>;
Assuming that these are what you're actually searching
through, you probably don't want to load them into
memory up front like this. Build up your comparison /
scoring function, then apply it to each line of input,
keeping only those that match to a certain level. (This
all depends on what portion of the index you expect to
return; in a typical situation, I would assume that only
a small part of the index is relevant to any given query,
so I would want to have only those values in memory.)
(@skeyw) = split(/ /,$fields{'keywords'}); # SPLITTING SIDX FILE
(@premiumskeyw) = split(/ /,$fields{'keywords'}); ## SPLITTING SIDX2 F
+ILE
At the very least, the comments are grossly inaccurate.
Also, you're splitting the same value into two different
arrays. Finally, you are using a pattern which can
result in the null strings that you later have to
filter out; a better split pattern can fix that. The
most trivial change is that simple hash keys don't need
to be in quotes. All together, we can just say:
my @keywords = split ' ', $fields{keywords};
The use of ' ' has special meaning to
split: break it up on one or more whitespace
characters, and ignore leading whitespace as well. This
should guarantee that you have no null keywords in the
array, making your checks unnecessary.
(Side note: don't skimp on variable names. Use an editor
that does expansions if you have to, but I personally
find @keywords more evocative and
self-descriptive than @skeyw. Same
comment applies to filehandle names, and pretty much
every other name in your program...)
$nrkeywords = push(@skeyw);
$premiumnrkeywords = push(@premiumskeyw);
Assuming you just want to get the number of elements
in those arrays (which, as pointed out above, are the
same array, so you don't need to do it twice!), you
can simply evaluate the array in scalar context. The
slightly long way of saying that is:
my $n_keywords = scalar @keywords;
I say this is the long way, because the
scalar is unnecessary. You're asasigning
something into a scalar, and that puts that thing
into a scalar context. So, we could have written:
my $n_keywords = @keywords;
Either way, you certainly don't need push.
Finally, now that you know there is only one set of
keywords, you can construct the regex out of them. One
point to consider is whether thes keywords are themselves
regexes, or if they are just plain text. If the latter,
we want to make sure we neutralize any characters that
are special to the regex engine. The built-in function
quotemeta is just the ticket. Putting it
all together, we have:
my $regex = join '|', map { quotemeta $_ } @keywords;
$regex = qr/$regex/;
I addressed this in more detail in my original response,
but now you can basically do:
my @std_hits = grep { m/$regex/ } <STD_IDX>;
my @prem_hits = grep { m/$regex/ } <PREM_IDX>;
Hopefully this has helped you further. I do worry that
there are archetectural issues that I am unable to
address, and untill you get the chance to rethink what
you are doing (as opposed to just making random changes
and hoping for the best), you're not going to have much
luck.
Regardless, I hope it works out for you.
|