comment on

Dear Perl Monks, I have a text file (tab-delimited) that has three columns. The first column is an ID, and the next two columns are phrases (descriptions). I am trying to write a code that will compare the two phrases and see if they are "similar". The way I chose to go about it is to split the first phrase into individual words, skip the words if they are shorter than 3 chars, and then check if the word is part of the second phrase. As the output I chose to print every match. And here is the code I wrote so far

 
my @data;
while(<>) {
    push @data, $_;
}

foreach my $line (@data) {
    my @temp_array = split "\t", $line; # Split columns into an array
    $temp_array[1] =~ tr/\"\-\/,/ /; #Change all potential word ending
+s to a single space
    $temp_array[1] =~ tr/\(\)//d; # Remove parentheses to avoid mishap
+s during pattern matching
    $temp_array[2] =~ tr/\"\-\/,/ /; #Same as above
    $temp_array[2] =~ tr/\(\)//d; #Same as above
    my @words = split " ", $temp_array[1]; # Split first phrase into i
+ndividual words
    for(my $i = 0; $i < @words; $i++) {
        my $match_count = 1;
        if(length ($words[$i]) < 3) { next; }
        elsif(length ($words[$i]) < 5) {
            if($words[$i] =~ /$temp_array[2]/i) {
                    print "Match $match_count (probable): $words[$i]\n
+";
                    $match_count++;
            }
            else { next; }
        }
        else {
            if($words[$i] =~ /$temp_array[2]/i) {
                    print "Match $match_count: $words[$i] \n";
                    $match_count++;
            }
            else { next; }
        }
    }
}
[download]

Running this code is producing no output and warning "Unmatched parenthesis in regex" though I'm removing all parenthesis from the text. All my debugging and testing my code points the error to be in pattern matching. Is there any other way to achieve what I want (a case-insensitive substring matching that is)? Or, even better, has someone else already wrote such a code? Here are the first five lines of my input for your reference:


MIP_00001     Chromosomal replication initiator protein dnaA      chro
+mosomal replication initiationprotein
MIP_00002     DNA polymerase III subunit beta      DNA polymerase III 
+subunit beta
MIP_00003     DNA replication and repair protein recF      recombinati
+on protein F
MIP_00004     Hypothetical protein      hypothetical protein Rv0004
MIP_00006     DNA gyrase subunit B      DNA gyrase subunit B
[download]

Kindly help me out. Thanks!

TEJ

Edit: Changed the code as suggested by BrowserUk

Edit No.2: Got it to work guys! I just had to interchange the variables to either sides in my pattern matching! Stupid mistake, LOL :P Thanks for all the support :)

In reply to help with comparing two arrays of phrases by sdtej

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.