comment on

Good evening wise monks,

I wrote this perl script to help filter out the raw data from a Pubmed article reader (called ppaxe, by Sergio Castillo). Basically, ppaxe reads for me thousands of articles on Pubmed and searches for possible interactions between proteins/genes. I end up with verbs that do not actually indicate an interaction or lines with multiple verbs, of which some of those verbs do and others do not.

My perl script basically needs to filter out any line that does not have a verb that indicates an interaction. I have a file of approved verbs, a file of discarded verbs and my ppaxe results file. I put my verb lists into arrays and used index instead of exists function for matching. I am not allowed to use regex so that the next generation that takes over can understand the program better.

When I run my perl program it just ends up printing the whole data file without actually filtering. Can anyone help me in correcting my program and teaching me what I am doing wrong?

Thanks so much,

#!/usr/bin/perl
# discard_lines_by_verbs.pl
use strict;
use warnings;

die "Please use suitable files" if (@ARGV != 3);
my $dis_verbs = shift @ARGV;
my $apr_verbs = shift @ARGV;
my $ppaxe = shift @ARGV;

open(my $in1, "<", "$dis_verbs")
  or die "error reading $dis_verbs. $!";
open(my $in2, "<", "$apr_verbs")
  or die "error reading $apr_verbs. $!";
open(my $in3, "<", "$ppaxe")
  or die "error reading $ppaxe. $!";

my @dis_dic;
my @apr_dic;

while (my $f1_line = <$in1>) {
  chomp($f1_line);
  @dis_dic = $f1_line;
}

while (my $f2_line = <$in2>) {
  chomp($f2_line);
  @apr_dic = $f2_line;
}

while (my $f3_line = <$in3>) {
  chomp($f3_line);
  if ( index($f3_line, @apr_dic) != -1 ) {
    print "$f3_line\n";
  }
  elsif ( index($f3_line, @apr_dic && @dis_dic) != -1 ) {
    print "$f3_line\n";
  }
  else {
    next;
  }
}
close($in1);
close($in2);
close($in3);
[download]

These files are small test versions:

approved_verbs_test:

ACTIVATES
ADPRIBOSYLATED
ALTERS
ARGINYLATED
ASSOCIATES
BINDS
[download]

discarded_verbs_test:

ARE
ASK
ASSESS
BASED
BECAME
IS
[download]

sample_ppaxe_data:

RPSA    AKT1    18628488    0.634    BINDS,ALTERS
RUNX2    DKK1    22960397    0.746    ADPRIBOSYLATED,ALTERS
ARHGAP31    RASA1    17158447    0.56    ASSOCIATES
ARHGAP31    RNASE1    17158447    0.602    BECOME
RASA1    RNASE1    17158447    0.554    BASED
NOS1    NOS3    19799911    0.628    ARGINYLATED,BASED
VTN    PRAP1    27189837    0.582    IS
MAPK8    RHOD    11414711    0.698    ARGINYLATED,BINDS
IL2    SETBP1    8398987    0.556    BINDS
S100A8    S100A9    20105291    0.596    ASSESS
[download]

Desired outcome:

RPSA    AKT1    18628488    0.634    BINDS,ALTERS
RUNX2    DKK1    22960397    0.746    ADPRIBOSYLATED,ALTERS
ARHGAP31    RASA1    17158447    0.56    ASSOCIATES
NOS1    NOS3    19799911    0.628    ARGINYLATED,BASED
MAPK8    RHOD    11414711    0.698    ARGINYLATED,BINDS
IL2    SETBP1    8398987    0.556    BINDS
[download]

In reply to Matching multiple substrings of a string to arrays and printing those that match by rarenas

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.