Dear Perl Monks, I have a text file (tab-delimited) that has three columns. The first column is an ID, and the next two columns are phrases (descriptions). I am trying to write a code that will compare the two phrases and see if they are "similar". The way I chose to go about it is to split the first phrase into individual words, skip the words if they are shorter than 3 chars, and then check if the word is part of the second phrase. As the output I chose to print every match. And here is the code I wrote so far

my @data; while(<>) { push @data, $_; } foreach my $line (@data) { my @temp_array = split "\t", $line; # Split columns into an array $temp_array[1] =~ tr/\"\-\/,/ /; #Change all potential word ending +s to a single space $temp_array[1] =~ tr/\(\)//d; # Remove parentheses to avoid mishap +s during pattern matching $temp_array[2] =~ tr/\"\-\/,/ /; #Same as above $temp_array[2] =~ tr/\(\)//d; #Same as above my @words = split " ", $temp_array[1]; # Split first phrase into i +ndividual words for(my $i = 0; $i < @words; $i++) { my $match_count = 1; if(length ($words[$i]) < 3) { next; } elsif(length ($words[$i]) < 5) { if($words[$i] =~ /$temp_array[2]/i) { print "Match $match_count (probable): $words[$i]\n +"; $match_count++; } else { next; } } else { if($words[$i] =~ /$temp_array[2]/i) { print "Match $match_count: $words[$i] \n"; $match_count++; } else { next; } } } }

Running this code is producing no output and warning "Unmatched parenthesis in regex" though I'm removing all parenthesis from the text. All my debugging and testing my code points the error to be in pattern matching. Is there any other way to achieve what I want (a case-insensitive substring matching that is)? Or, even better, has someone else already wrote such a code? Here are the first five lines of my input for your reference:

MIP_00001 Chromosomal replication initiator protein dnaA chro +mosomal replication initiationprotein MIP_00002 DNA polymerase III subunit beta DNA polymerase III +subunit beta MIP_00003 DNA replication and repair protein recF recombinati +on protein F MIP_00004 Hypothetical protein hypothetical protein Rv0004 MIP_00006 DNA gyrase subunit B DNA gyrase subunit B

Kindly help me out. Thanks!

TEJ

Edit: Changed the code as suggested by BrowserUk

Edit No.2: Got it to work guys! I just had to interchange the variables to either sides in my pattern matching! Stupid mistake, LOL :P Thanks for all the support :)


In reply to help with comparing two arrays of phrases by sdtej

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.