Good evening wise monks,

I wrote this perl script to help filter out the raw data from a Pubmed article reader (called ppaxe, by Sergio Castillo). Basically, ppaxe reads for me thousands of articles on Pubmed and searches for possible interactions between proteins/genes. I end up with verbs that do not actually indicate an interaction or lines with multiple verbs, of which some of those verbs do and others do not.

My perl script basically needs to filter out any line that does not have a verb that indicates an interaction. I have a file of approved verbs, a file of discarded verbs and my ppaxe results file. I put my verb lists into arrays and used index instead of exists function for matching. I am not allowed to use regex so that the next generation that takes over can understand the program better.

When I run my perl program it just ends up printing the whole data file without actually filtering. Can anyone help me in correcting my program and teaching me what I am doing wrong?

Thanks so much,

#!/usr/bin/perl # discard_lines_by_verbs.pl use strict; use warnings; die "Please use suitable files" if (@ARGV != 3); my $dis_verbs = shift @ARGV; my $apr_verbs = shift @ARGV; my $ppaxe = shift @ARGV; open(my $in1, "<", "$dis_verbs") or die "error reading $dis_verbs. $!"; open(my $in2, "<", "$apr_verbs") or die "error reading $apr_verbs. $!"; open(my $in3, "<", "$ppaxe") or die "error reading $ppaxe. $!"; my @dis_dic; my @apr_dic; while (my $f1_line = <$in1>) { chomp($f1_line); @dis_dic = $f1_line; } while (my $f2_line = <$in2>) { chomp($f2_line); @apr_dic = $f2_line; } while (my $f3_line = <$in3>) { chomp($f3_line); if ( index($f3_line, @apr_dic) != -1 ) { print "$f3_line\n"; } elsif ( index($f3_line, @apr_dic && @dis_dic) != -1 ) { print "$f3_line\n"; } else { next; } } close($in1); close($in2); close($in3);

These files are small test versions:

approved_verbs_test:

ACTIVATES ADPRIBOSYLATED ALTERS ARGINYLATED ASSOCIATES BINDS

discarded_verbs_test:

ARE ASK ASSESS BASED BECAME IS

sample_ppaxe_data:

RPSA AKT1 18628488 0.634 BINDS,ALTERS RUNX2 DKK1 22960397 0.746 ADPRIBOSYLATED,ALTERS ARHGAP31 RASA1 17158447 0.56 ASSOCIATES ARHGAP31 RNASE1 17158447 0.602 BECOME RASA1 RNASE1 17158447 0.554 BASED NOS1 NOS3 19799911 0.628 ARGINYLATED,BASED VTN PRAP1 27189837 0.582 IS MAPK8 RHOD 11414711 0.698 ARGINYLATED,BINDS IL2 SETBP1 8398987 0.556 BINDS S100A8 S100A9 20105291 0.596 ASSESS

Desired outcome:

RPSA AKT1 18628488 0.634 BINDS,ALTERS RUNX2 DKK1 22960397 0.746 ADPRIBOSYLATED,ALTERS ARHGAP31 RASA1 17158447 0.56 ASSOCIATES NOS1 NOS3 19799911 0.628 ARGINYLATED,BASED MAPK8 RHOD 11414711 0.698 ARGINYLATED,BINDS IL2 SETBP1 8398987 0.556 BINDS

In reply to Matching multiple substrings of a string to arrays and printing those that match by rarenas

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.