Re: Re: Separating multiple keyword search input

Replies are listed 'Best First'.
Re^3: Separating multiple keyword search input by tkil (Monk) on May 13, 2004 at 07:10 UTC
First, apologies for taking so long to get back to you on this. (And it took even longer, as my browser decided to nose dive after I wrote up this response the first time. Sigh.) Looking at this code, though, I again have to say that it is almost a complete non-sequitor to me. I think I understand your overall goals, but your actual program construction seems to be very haphazard. I can't figure out your larger issues without grasping your entire system (which I'm not equipped to do!), but here are a few comments on your "programming in the small": `open (SIDX, "$data_dir/search.idx"); open (SIDX2, "$data_dir/search2.idx"); my @sidx = <SIDX>; my @sidx2 = <SIDX2>;` [download] Assuming that these are what you're actually searching through, you probably don't want to load them into memory up front like this. Build up your comparison / scoring function, then apply it to each line of input, keeping only those that match to a certain level. (This all depends on what portion of the index you expect to return; in a typical situation, I would assume that only a small part of the index is relevant to any given query, so I would want to have only those values in memory.) `(@skeyw) = split(/ /,$fields{'keywords'}); # SPLITTING SIDX FILE (@premiumskeyw) = split(/ /,$fields{'keywords'}); ## SPLITTING SIDX2 F +ILE` [download] At the very least, the comments are grossly inaccurate. Also, you're splitting the same value into two different arrays. Finally, you are using a pattern which can result in the null strings that you later have to filter out; a better split pattern can fix that. The most trivial change is that simple hash keys don't need to be in quotes. All together, we can just say: `my @keywords = split ' ', $fields{keywords};` [download] The use of `' '` has special meaning to `split`: break it up on one or more whitespace characters, and ignore leading whitespace as well. This should guarantee that you have no null keywords in the array, making your checks unnecessary. (Side note: don't skimp on variable names. Use an editor that does expansions if you have to, but I personally find `@keywords` more evocative and self-descriptive than `@skeyw`. Same comment applies to filehandle names, and pretty much every other name in your program...) `$nrkeywords = push(@skeyw); $premiumnrkeywords = push(@premiumskeyw);` [download] Assuming you just want to get the number of elements in those arrays (which, as pointed out above, are the same array, so you don't need to do it twice!), you can simply evaluate the array in scalar context. The slightly long way of saying that is: `my $n_keywords = scalar @keywords;` [download] I say this is the long way, because the `scalar` is unnecessary. You're asasigning something into a scalar, and that puts that thing into a scalar context. So, we could have written: `my $n_keywords = @keywords;` [download] Either way, you certainly don't need `push`. Finally, now that you know there is only one set of keywords, you can construct the regex out of them. One point to consider is whether thes keywords are themselves regexes, or if they are just plain text. If the latter, we want to make sure we neutralize any characters that are special to the regex engine. The built-in function `quotemeta` is just the ticket. Putting it all together, we have: `my $regex = join '\|', map { quotemeta $_ } @keywords; $regex = qr/$regex/;` [download] I addressed this in more detail in my original response, but now you can basically do: `my @std_hits = grep { m/$regex/ } <STD_IDX>; my @prem_hits = grep { m/$regex/ } <PREM_IDX>;` [download] Hopefully this has helped you further. I do worry that there are archetectural issues that I am unable to address, and untill you get the chance to rethink what you are doing (as opposed to just making random changes and hoping for the best), you're not going to have much luck. Regardless, I hope it works out for you.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Separating multiple keyword search input
by tkil (Monk) on May 13, 2004 at 07:10 UTC

First, apologies for taking so long to get back to you on this. (And it took even longer, as my browser decided to nose dive after I wrote up this response the first time. Sigh.)

Looking at this code, though, I again have to say that it is almost a complete non-sequitor to me. I think I understand your overall goals, but your actual program construction seems to be very haphazard. I can't figure out your larger issues without grasping your entire system (which I'm not equipped to do!), but here are a few comments on your "programming in the small":

open (SIDX, "$data_dir/search.idx");
open (SIDX2, "$data_dir/search2.idx");

my @sidx = <SIDX>;
my @sidx2 = <SIDX2>;
[download]

Assuming that these are what you're actually searching through, you probably don't want to load them into memory up front like this. Build up your comparison / scoring function, then apply it to each line of input, keeping only those that match to a certain level. (This all depends on what portion of the index you expect to return; in a typical situation, I would assume that only a small part of the index is relevant to any given query, so I would want to have only those values in memory.)

(@skeyw) = split(/ /,$fields{'keywords'}); # SPLITTING SIDX FILE
(@premiumskeyw) = split(/ /,$fields{'keywords'}); ## SPLITTING SIDX2 F
+ILE
[download]

At the very least, the comments are grossly inaccurate. Also, you're splitting the same value into two different arrays. Finally, you are using a pattern which can result in the null strings that you later have to filter out; a better split pattern can fix that. The most trivial change is that simple hash keys don't need to be in quotes. All together, we can just say:

  my @keywords = split ' ', $fields{keywords};
[download]

The use of ' ' has special meaning to split: break it up on one or more whitespace characters, and ignore leading whitespace as well. This should guarantee that you have no null keywords in the array, making your checks unnecessary.

(Side note: don't skimp on variable names. Use an editor that does expansions if you have to, but I personally find @keywords more evocative and self-descriptive than @skeyw. Same comment applies to filehandle names, and pretty much every other name in your program...)

$nrkeywords = push(@skeyw);
$premiumnrkeywords = push(@premiumskeyw);
[download]

Assuming you just want to get the number of elements in those arrays (which, as pointed out above, are the same array, so you don't need to do it twice!), you can simply evaluate the array in scalar context. The slightly long way of saying that is:

  my $n_keywords = scalar @keywords;
[download]

I say this is the long way, because the scalar is unnecessary. You're asasigning something into a scalar, and that puts that thing into a scalar context. So, we could have written:

  my $n_keywords = @keywords;
[download]

Either way, you certainly don't need push.

Finally, now that you know there is only one set of keywords, you can construct the regex out of them. One point to consider is whether thes keywords are themselves regexes, or if they are just plain text. If the latter, we want to make sure we neutralize any characters that are special to the regex engine. The built-in function quotemeta is just the ticket. Putting it all together, we have:

  my $regex = join '|', map { quotemeta $_ } @keywords;
  $regex = qr/$regex/;
[download]

I addressed this in more detail in my original response, but now you can basically do:

  my @std_hits  = grep { m/$regex/ } <STD_IDX>;
  my @prem_hits = grep { m/$regex/ } <PREM_IDX>;
[download]

Hopefully this has helped you further. I do worry that there are archetectural issues that I am unable to address, and untill you get the chance to rethink what you are doing (as opposed to just making random changes and hoping for the best), you're not going to have much luck.

Regardless, I hope it works out for you.

[reply]
[d/l]
[select]