Re^2: Sort of basic search engine/pattern matching problems

Hi everyone!
I tried what you all suggested and now my code looks like this:

open(HH, "<$PATHDATA/query04.txt");


    chomp( my @query_arr = <HH> );

close HH;


open(XX, "<$PATHDATA/multisearch_final_sorted_3.txt");
open(DD, ">$PATHDATA/query_results.txt");
my $count=0;
while ( my $line = <XX> ) {
if ($debug) {
                print MAGENTA, "LINEA:$line\n";
                    }
    for my $query_el (@query_arr) {
        if ($debug) {
                    print GREEN, "QUERY:$query_el\n";
        }
        if ($line =~ /^$query_el\]\[.*/i) {
                    if ($debug) {
                    print GREEN, "QUERY:$query_el\n";
                    print MAGENTA, "MATCH:$line\n";
                    }
                    print DD "QUERY:$query_el\n";
                    print DD "MATCH:$line\n";
                    print DD "#################";
                    $count ++;
        }
    }
}
                print DD "Entrate trovate: $count";
        
    
        
        close XX;
        close DD;

    exit 0;
[download]

Unfortunately, due to the large amount of data I have, I cannot understand if it's working, since it's not printing anything and it seems to take forever to finish.
Is there a way I can make it a little faster?
Every suggestion is really appreciated.
Thanks again,
Giu

Comment on Re^2: Sort of basic search engine/pattern matching problems Download Code

Replies are listed 'Best First'.
Re^3: Sort of basic search engine/pattern matching problems by graff (Chancellor) on Apr 30, 2010 at 09:42 UTC
You said: Unfortunately, due to the large amount of data I have, I cannot understand if it's working... How many lines in "query04.txt"? How many lines in "multisearch_final_sorted_3.txt"? If you create a test version of each file, containing just a few lines that should produce some output, does the script work correctly on those test files? (Hint: Allowing file names to be provided as command line args can help with testing.) One way to try speeding things up is to create a single regex from your query file, by joining the lines with "\|": #!/usr/bin/perl use strict; use warnings; my $PATHDATA = "."; # (you didn't say how this was being set) my ( $query_list_file, $file_to_search ) = ( @ARGV == 2 ) ? @ARGV : ( "$PATHDATA/query04.txt", "$PATHDATA/multisearch_final_sorted_3.t +xt" ); open( HH, "<", "$PATHDATA/$query_list_file") or die "$PATHDATA/$query_ +list_file: $!\n"; chomp( my @query_arr = <HH> ); close HH; my $query_regex = join( '\|', @query_arr ); open(XX, "<", "$PATHDATA/$file_to_search") or die "$PATHDATA/$file_to_ +search: $!\n"; open(DD, ">", "$PATHDATA/query_results.txt") or die "$PATHDATA/query_r +esults.txt: $!\n"; my $count=0; while ( <XX> ) { if ( /^($query_regex)\]/ ) { print DD "############\nQUERY: $1\nMATCH: $_\n"; $count++; } } print DD "Entrate trovate: $count\n"; [download] (In addition to allowing for other input files and using a single regex to check all matches, I also left out the "debug" stuff, rearranged the output format a little, and changed the "open" statements to use the 3-arg style.) UPDATED to add "or die ..." on each of the "open" statements -- that should be a habit. If you still have a problem when using some small test files, post a complete and runnable script (like the one shown here) with the test data.	[reply] [d/l]
Re^3: Sort of basic search engine/pattern matching problems by sierpinski (Chaplain) on Apr 30, 2010 at 03:24 UTC
I'm not a master of regex's (yet?) but I think I see the problem... `if ($line =~ /^$query_el\]\[./i) {` [download] Remember that $ has special meaning inside a regex (end of line). I think* that you need to use the qr operator before the regex to evaluate the contents of $query_el before you perform the regex. Someone please correct me if I'm wrong here.	[reply] [d/l]
Re^4: Sort of basic search engine/pattern matching problems by graff (Chancellor) on Apr 30, 2010 at 09:18 UTC
When you're not sure, you should try it yourself and find out, rather than posting a guess. As it happens, you are wrong in this case: perl interpolates the variable $query_el into the regex.	[reply]