in reply to Sort of basic search engine/pattern matching problems

Other Monks have given you good advice regarding the structure of your code. I would just like to point out a more succinct way of reading your queries into an array. Reading a filehandle in list context with an array on the LHS will read the whole file, one line per array element and if passed an array of lines, chomp will very kindly operate on each and every line so your

my @query; while(<HH>){ chomp; push @query,$_; }

could be written as

my @query = <HH>; chomp @query;

and it could even be shortened to a single line

chomp( my @query = <HH> );

I hope this is helpful.

Cheers,

JohnGG

Replies are listed 'Best First'.
Re^2: Sort of basic search engine/pattern matching problems
by bvulnerbility (Novice) on Apr 29, 2010 at 09:35 UTC
    Hi everyone!
    I tried what you all suggested and now my code looks like this:
    open(HH, "<$PATHDATA/query04.txt"); chomp( my @query_arr = <HH> ); close HH; open(XX, "<$PATHDATA/multisearch_final_sorted_3.txt"); open(DD, ">$PATHDATA/query_results.txt"); my $count=0; while ( my $line = <XX> ) { if ($debug) { print MAGENTA, "LINEA:$line\n"; } for my $query_el (@query_arr) { if ($debug) { print GREEN, "QUERY:$query_el\n"; } if ($line =~ /^$query_el\]\[.*/i) { if ($debug) { print GREEN, "QUERY:$query_el\n"; print MAGENTA, "MATCH:$line\n"; } print DD "QUERY:$query_el\n"; print DD "MATCH:$line\n"; print DD "#################"; $count ++; } } } print DD "Entrate trovate: $count"; close XX; close DD; exit 0;
    Unfortunately, due to the large amount of data I have, I cannot understand if it's working, since it's not printing anything and it seems to take forever to finish.
    Is there a way I can make it a little faster?
    Every suggestion is really appreciated.
    Thanks again,
    Giu
      You said:

      Unfortunately, due to the large amount of data I have, I cannot understand if it's working...

      How many lines in "query04.txt"? How many lines in "multisearch_final_sorted_3.txt"? If you create a test version of each file, containing just a few lines that should produce some output, does the script work correctly on those test files? (Hint: Allowing file names to be provided as command line args can help with testing.)

      One way to try speeding things up is to create a single regex from your query file, by joining the lines with "|":

      #!/usr/bin/perl use strict; use warnings; my $PATHDATA = "."; # (you didn't say how this was being set) my ( $query_list_file, $file_to_search ) = ( @ARGV == 2 ) ? @ARGV : ( "$PATHDATA/query04.txt", "$PATHDATA/multisearch_final_sorted_3.t +xt" ); open( HH, "<", "$PATHDATA/$query_list_file") or die "$PATHDATA/$query_ +list_file: $!\n"; chomp( my @query_arr = <HH> ); close HH; my $query_regex = join( '|', @query_arr ); open(XX, "<", "$PATHDATA/$file_to_search") or die "$PATHDATA/$file_to_ +search: $!\n"; open(DD, ">", "$PATHDATA/query_results.txt") or die "$PATHDATA/query_r +esults.txt: $!\n"; my $count=0; while ( <XX> ) { if ( /^($query_regex)\]/ ) { print DD "############\nQUERY: $1\nMATCH: $_\n"; $count++; } } print DD "Entrate trovate: $count\n";
      (In addition to allowing for other input files and using a single regex to check all matches, I also left out the "debug" stuff, rearranged the output format a little, and changed the "open" statements to use the 3-arg style.) UPDATED to add "or die ..." on each of the "open" statements -- that should be a habit.

      If you still have a problem when using some small test files, post a complete and runnable script (like the one shown here) with the test data.

      I'm not a master of regex's (yet?) but I think I see the problem...
      if ($line =~ /^$query_el\]\[.*/i) {
      Remember that $ has special meaning inside a regex (end of line). I *think* that you need to use the qr operator before the regex to evaluate the contents of $query_el before you perform the regex. Someone please correct me if I'm wrong here.
        When you're not sure, you should try it yourself and find out, rather than posting a guess. As it happens, you are wrong in this case: perl interpolates the variable $query_el into the regex.