in reply to Re: Sort of basic search engine/pattern matching problems
in thread Sort of basic search engine/pattern matching problems

Hi everyone!
I tried what you all suggested and now my code looks like this:
open(HH, "<$PATHDATA/query04.txt"); chomp( my @query_arr = <HH> ); close HH; open(XX, "<$PATHDATA/multisearch_final_sorted_3.txt"); open(DD, ">$PATHDATA/query_results.txt"); my $count=0; while ( my $line = <XX> ) { if ($debug) { print MAGENTA, "LINEA:$line\n"; } for my $query_el (@query_arr) { if ($debug) { print GREEN, "QUERY:$query_el\n"; } if ($line =~ /^$query_el\]\[.*/i) { if ($debug) { print GREEN, "QUERY:$query_el\n"; print MAGENTA, "MATCH:$line\n"; } print DD "QUERY:$query_el\n"; print DD "MATCH:$line\n"; print DD "#################"; $count ++; } } } print DD "Entrate trovate: $count"; close XX; close DD; exit 0;
Unfortunately, due to the large amount of data I have, I cannot understand if it's working, since it's not printing anything and it seems to take forever to finish.
Is there a way I can make it a little faster?
Every suggestion is really appreciated.
Thanks again,
Giu

Replies are listed 'Best First'.
Re^3: Sort of basic search engine/pattern matching problems
by graff (Chancellor) on Apr 30, 2010 at 09:42 UTC
    You said:

    Unfortunately, due to the large amount of data I have, I cannot understand if it's working...

    How many lines in "query04.txt"? How many lines in "multisearch_final_sorted_3.txt"? If you create a test version of each file, containing just a few lines that should produce some output, does the script work correctly on those test files? (Hint: Allowing file names to be provided as command line args can help with testing.)

    One way to try speeding things up is to create a single regex from your query file, by joining the lines with "|":

    #!/usr/bin/perl use strict; use warnings; my $PATHDATA = "."; # (you didn't say how this was being set) my ( $query_list_file, $file_to_search ) = ( @ARGV == 2 ) ? @ARGV : ( "$PATHDATA/query04.txt", "$PATHDATA/multisearch_final_sorted_3.t +xt" ); open( HH, "<", "$PATHDATA/$query_list_file") or die "$PATHDATA/$query_ +list_file: $!\n"; chomp( my @query_arr = <HH> ); close HH; my $query_regex = join( '|', @query_arr ); open(XX, "<", "$PATHDATA/$file_to_search") or die "$PATHDATA/$file_to_ +search: $!\n"; open(DD, ">", "$PATHDATA/query_results.txt") or die "$PATHDATA/query_r +esults.txt: $!\n"; my $count=0; while ( <XX> ) { if ( /^($query_regex)\]/ ) { print DD "############\nQUERY: $1\nMATCH: $_\n"; $count++; } } print DD "Entrate trovate: $count\n";
    (In addition to allowing for other input files and using a single regex to check all matches, I also left out the "debug" stuff, rearranged the output format a little, and changed the "open" statements to use the 3-arg style.) UPDATED to add "or die ..." on each of the "open" statements -- that should be a habit.

    If you still have a problem when using some small test files, post a complete and runnable script (like the one shown here) with the test data.

Re^3: Sort of basic search engine/pattern matching problems
by sierpinski (Chaplain) on Apr 30, 2010 at 03:24 UTC
    I'm not a master of regex's (yet?) but I think I see the problem...
    if ($line =~ /^$query_el\]\[.*/i) {
    Remember that $ has special meaning inside a regex (end of line). I *think* that you need to use the qr operator before the regex to evaluate the contents of $query_el before you perform the regex. Someone please correct me if I'm wrong here.
      When you're not sure, you should try it yourself and find out, rather than posting a guess. As it happens, you are wrong in this case: perl interpolates the variable $query_el into the regex.