Sort of basic search engine/pattern matching problems

bvulnerbility has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Sort of basic search engine/pattern matching problems by moritz (Cardinal) on Apr 28, 2010 at 16:46 UTC
If I understood your problem, the best solution is to read the dictionary file into a hash once, and then read the query file line by line, looking into the dictionary hash each time. There are plenty of examples for doing stuff like that on this site; I hope Super Search finds some of them. Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re: Sort of basic search engine/pattern matching problems by toolic (Bishop) on Apr 28, 2010 at 16:47 UTC
You should move your foreach loop inside your while loop. The problem is that your first query loops through all the lines of the dictionary file. Your second query does not loop through any lines of the file. Something like: `while ($line = <XX>) { for my $query (@query) { } }` [download]	[reply] [d/l]
Re^2: Sort of basic search engine/pattern matching problems by bvulnerbility (Novice) on Apr 28, 2010 at 19:44 UTC
Thanks everyone for your help. I'm really glad to have find you. I don't have the code to test with me now, but I'll let you know if I make it work tomorrow morning Thanks again.. Giu	[reply]
Re: Sort of basic search engine/pattern matching problems by AR (Friar) on Apr 28, 2010 at 16:41 UTC
Hello Giu, I believe the answer to your question is that the `<XX>` calls aren't going back to the beginning of the file when you start a new query. My recommendation is that you should read a line and then run every query against it. I also heavily recommend that you `use warnings;` and `use strict;` in everything you write. As you said, you're just starting out, and it will help you down the road. My other suggestion is that you use my when you need it and not at the top of your program. In your script, that would be: `foreach my $query (@query) { while ( my $line = <XX> ) {` [download] This prevents lingering queries and lines after you have exited your loops which might cause unexpected things to happen if you add to this code.	[reply] [d/l] [select]
Re: Sort of basic search engine/pattern matching problems by johngg (Canon) on Apr 28, 2010 at 22:50 UTC
Other Monks have given you good advice regarding the structure of your code. I would just like to point out a more succinct way of reading your queries into an array. Reading a filehandle in list context with an array on the LHS will read the whole file, one line per array element and if passed an array of lines, chomp will very kindly operate on each and every line so your `my @query; while(<HH>){ chomp; push @query,$_; }` [download] could be written as `my @query = <HH>; chomp @query;` [download] and it could even be shortened to a single line `chomp( my @query = <HH> );` [download] I hope this is helpful. Cheers, JohnGG	[reply] [d/l] [select]
Re^2: Sort of basic search engine/pattern matching problems by bvulnerbility (Novice) on Apr 29, 2010 at 09:35 UTC
Hi everyone! I tried what you all suggested and now my code looks like this: open(HH, "<$PATHDATA/query04.txt"); chomp( my @query_arr = <HH> ); close HH; open(XX, "<$PATHDATA/multisearch_final_sorted_3.txt"); open(DD, ">$PATHDATA/query_results.txt"); my $count=0; while ( my $line = <XX> ) { if ($debug) { print MAGENTA, "LINEA:$line\n"; } for my $query_el (@query_arr) { if ($debug) { print GREEN, "QUERY:$query_el\n"; } if ($line =~ /^$query_el\]\[.*/i) { if ($debug) { print GREEN, "QUERY:$query_el\n"; print MAGENTA, "MATCH:$line\n"; } print DD "QUERY:$query_el\n"; print DD "MATCH:$line\n"; print DD "#################"; $count ++; } } } print DD "Entrate trovate: $count"; close XX; close DD; exit 0; [download] Unfortunately, due to the large amount of data I have, I cannot understand if it's working, since it's not printing anything and it seems to take forever to finish. Is there a way I can make it a little faster? Every suggestion is really appreciated. Thanks again, Giu	[reply] [d/l]
Re^3: Sort of basic search engine/pattern matching problems by graff (Chancellor) on Apr 30, 2010 at 09:42 UTC
You said: Unfortunately, due to the large amount of data I have, I cannot understand if it's working... How many lines in "query04.txt"? How many lines in "multisearch_final_sorted_3.txt"? If you create a test version of each file, containing just a few lines that should produce some output, does the script work correctly on those test files? (Hint: Allowing file names to be provided as command line args can help with testing.) One way to try speeding things up is to create a single regex from your query file, by joining the lines with "\|": #!/usr/bin/perl use strict; use warnings; my $PATHDATA = "."; # (you didn't say how this was being set) my ( $query_list_file, $file_to_search ) = ( @ARGV == 2 ) ? @ARGV : ( "$PATHDATA/query04.txt", "$PATHDATA/multisearch_final_sorted_3.t +xt" ); open( HH, "<", "$PATHDATA/$query_list_file") or die "$PATHDATA/$query_ +list_file: $!\n"; chomp( my @query_arr = <HH> ); close HH; my $query_regex = join( '\|', @query_arr ); open(XX, "<", "$PATHDATA/$file_to_search") or die "$PATHDATA/$file_to_ +search: $!\n"; open(DD, ">", "$PATHDATA/query_results.txt") or die "$PATHDATA/query_r +esults.txt: $!\n"; my $count=0; while ( <XX> ) { if ( /^($query_regex)\]/ ) { print DD "############\nQUERY: $1\nMATCH: $_\n"; $count++; } } print DD "Entrate trovate: $count\n"; [download] (In addition to allowing for other input files and using a single regex to check all matches, I also left out the "debug" stuff, rearranged the output format a little, and changed the "open" statements to use the 3-arg style.) UPDATED to add "or die ..." on each of the "open" statements -- that should be a habit. If you still have a problem when using some small test files, post a complete and runnable script (like the one shown here) with the test data.	[reply] [d/l]
Re^3: Sort of basic search engine/pattern matching problems by sierpinski (Chaplain) on Apr 30, 2010 at 03:24 UTC
I'm not a master of regex's (yet?) but I think I see the problem... `if ($line =~ /^$query_el\]\[./i) {` [download] Remember that $ has special meaning inside a regex (end of line). I think* that you need to use the qr operator before the regex to evaluate the contents of $query_el before you perform the regex. Someone please correct me if I'm wrong here.	[reply] [d/l]
Re^4: Sort of basic search engine/pattern matching problems by graff (Chancellor) on Apr 30, 2010 at 09:18 UTC