bvulnerbility has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone!
I'm new here and pretty much new to Perl, sorry if my question should sound lame but I really need your help..
I have a text file containing queries, one per row. I put these query inside an array. What I want to do next is search a whole text file - sort of a dictionary (which has one entry per row and is formatted like this: word1]definition)for occurrences of the words contained in the array.
My code is the following:
open(HH, "<$PATHDATA/query.txt"); open(XX, "<$PATHDATA/dictionary.txt"); open(DD, ">$PATHDATA/query_results.txt"); my $query; my $line=""; $i=0; my @query; while(<HH>){ chomp; push @query,$_; } foreach $query (@query) { while($line=<XX>){ if ($line =~ m/^($query)\].*/i) { print DD "MATCH:$line\n"; if($debug){ print GREEN, "QUERY:$query\n"; print MAGENTA, "MATCH:$line\n"; } } } }
I'd like to have something like this as an output:
QUERY:facebook
MATCH:facebook]social network....
The code I posted get me only the first result (because facebook is actually the first element in my array). How can I get it to print every match for every possible query?
Thanks for your help,
Giu

Replies are listed 'Best First'.
Re: Sort of basic search engine/pattern matching problems
by moritz (Cardinal) on Apr 28, 2010 at 16:46 UTC
    If I understood your problem, the best solution is to read the dictionary file into a hash once, and then read the query file line by line, looking into the dictionary hash each time.

    There are plenty of examples for doing stuff like that on this site; I hope Super Search finds some of them.

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: Sort of basic search engine/pattern matching problems
by toolic (Bishop) on Apr 28, 2010 at 16:47 UTC
    You should move your foreach loop inside your while loop. The problem is that your first query loops through all the lines of the dictionary file. Your second query does not loop through any lines of the file. Something like:
    while ($line = <XX>) { for my $query (@query) { } }
      Thanks everyone for your help. I'm really glad to have find you.
      I don't have the code to test with me now, but I'll let you know if I make it work tomorrow morning
      Thanks again..
      Giu
Re: Sort of basic search engine/pattern matching problems
by AR (Friar) on Apr 28, 2010 at 16:41 UTC

    Hello Giu,

    I believe the answer to your question is that the <XX> calls aren't going back to the beginning of the file when you start a new query. My recommendation is that you should read a line and then run every query against it.

    I also heavily recommend that you use warnings; and use strict; in everything you write. As you said, you're just starting out, and it will help you down the road.

    My other suggestion is that you use my when you need it and not at the top of your program. In your script, that would be:

    foreach my $query (@query) { while ( my $line = <XX> ) {

    This prevents lingering queries and lines after you have exited your loops which might cause unexpected things to happen if you add to this code.

Re: Sort of basic search engine/pattern matching problems
by johngg (Canon) on Apr 28, 2010 at 22:50 UTC

    Other Monks have given you good advice regarding the structure of your code. I would just like to point out a more succinct way of reading your queries into an array. Reading a filehandle in list context with an array on the LHS will read the whole file, one line per array element and if passed an array of lines, chomp will very kindly operate on each and every line so your

    my @query; while(<HH>){ chomp; push @query,$_; }

    could be written as

    my @query = <HH>; chomp @query;

    and it could even be shortened to a single line

    chomp( my @query = <HH> );

    I hope this is helpful.

    Cheers,

    JohnGG

      Hi everyone!
      I tried what you all suggested and now my code looks like this:
      open(HH, "<$PATHDATA/query04.txt"); chomp( my @query_arr = <HH> ); close HH; open(XX, "<$PATHDATA/multisearch_final_sorted_3.txt"); open(DD, ">$PATHDATA/query_results.txt"); my $count=0; while ( my $line = <XX> ) { if ($debug) { print MAGENTA, "LINEA:$line\n"; } for my $query_el (@query_arr) { if ($debug) { print GREEN, "QUERY:$query_el\n"; } if ($line =~ /^$query_el\]\[.*/i) { if ($debug) { print GREEN, "QUERY:$query_el\n"; print MAGENTA, "MATCH:$line\n"; } print DD "QUERY:$query_el\n"; print DD "MATCH:$line\n"; print DD "#################"; $count ++; } } } print DD "Entrate trovate: $count"; close XX; close DD; exit 0;
      Unfortunately, due to the large amount of data I have, I cannot understand if it's working, since it's not printing anything and it seems to take forever to finish.
      Is there a way I can make it a little faster?
      Every suggestion is really appreciated.
      Thanks again,
      Giu
        You said:

        Unfortunately, due to the large amount of data I have, I cannot understand if it's working...

        How many lines in "query04.txt"? How many lines in "multisearch_final_sorted_3.txt"? If you create a test version of each file, containing just a few lines that should produce some output, does the script work correctly on those test files? (Hint: Allowing file names to be provided as command line args can help with testing.)

        One way to try speeding things up is to create a single regex from your query file, by joining the lines with "|":

        #!/usr/bin/perl use strict; use warnings; my $PATHDATA = "."; # (you didn't say how this was being set) my ( $query_list_file, $file_to_search ) = ( @ARGV == 2 ) ? @ARGV : ( "$PATHDATA/query04.txt", "$PATHDATA/multisearch_final_sorted_3.t +xt" ); open( HH, "<", "$PATHDATA/$query_list_file") or die "$PATHDATA/$query_ +list_file: $!\n"; chomp( my @query_arr = <HH> ); close HH; my $query_regex = join( '|', @query_arr ); open(XX, "<", "$PATHDATA/$file_to_search") or die "$PATHDATA/$file_to_ +search: $!\n"; open(DD, ">", "$PATHDATA/query_results.txt") or die "$PATHDATA/query_r +esults.txt: $!\n"; my $count=0; while ( <XX> ) { if ( /^($query_regex)\]/ ) { print DD "############\nQUERY: $1\nMATCH: $_\n"; $count++; } } print DD "Entrate trovate: $count\n";
        (In addition to allowing for other input files and using a single regex to check all matches, I also left out the "debug" stuff, rearranged the output format a little, and changed the "open" statements to use the 3-arg style.) UPDATED to add "or die ..." on each of the "open" statements -- that should be a habit.

        If you still have a problem when using some small test files, post a complete and runnable script (like the one shown here) with the test data.

        I'm not a master of regex's (yet?) but I think I see the problem...
        if ($line =~ /^$query_el\]\[.*/i) {
        Remember that $ has special meaning inside a regex (end of line). I *think* that you need to use the qr operator before the regex to evaluate the contents of $query_el before you perform the regex. Someone please correct me if I'm wrong here.