comment on

Hi guys. I am totally new to perl. Just started to work on it a week ago..please excuse my stupidity in btwn...Glad to be a part of this monastery..I'm trying to build a simple RegEX text parser. I will be having a three input files . first one is GENE which contains a list of gene names second one TARGET contains a list of protein locations inside the cell. third one IF is my input full text which i have made it into a single line one. Now while parsing I will be looking for the GENE entry, verb, and TARGET entry. in a single line of the text. if three are present then I print them out.

 #! usr/bin/perl
use strict;
use warnings;

# opening the input lexicons

open (GENE,"/home/stanley/Desktop/Gene.txt");
open (TARGET, "/home/stanley/Desktop/Target.txt");

my $target;
my $gene;

# opening fulltext

open (IF, "/home/stanley/Desktop/18048320.txt");
my $text = <IF>;
my @splittext = split (/[.] [A-Z]/, $text);
close (IF);

# opening output file

open (OF, ">/home/stanley/Desktop/Local.txt");

# Parsing Text

for $gene (<GENE>) {

chomp $gene;
    while ($target = <TARGET>) {
    chomp $target;

            foreach my $line (@splittext) {
                if ($line =~ /.+?($gene).*(localizes to|held|located i
+n|localization|translocated to|targets|reaches|exported|export).*($ta
+rget).+?/ig) {
                print OF $1."\t".$2."\t".$3."\n";
                }
            }
    
    }
} 
close (OF);
close (GENE);
close (TARGET);
[download]

____DATA___

gene.txt

pfrom1
pfama1
ama1
ha-pfrom1

target.txt

apicoplast
mitochondrion
rhoptry
rhoptries
golgi
dense granules
parasitophorous vacuole
micronemes
food vacuole
secretory vesicle
host cell

input txt
ompartmentalization of proteins into subcellular organelles in eukaryo
+tic cells is a fundamental mechanism of regulating complex cellular f
+unctions. Many proteins of Plasmodium falciparum merozoites involved 
+in invasion are compartmentalized into apical organelles. We have ide
+ntified a new merozoite organelle that contains P. falciparum rhomboi
+d-1 (PfROM1), a protease that cleaves the transmembrane regions of pr
+oteins involved in invasion. By immunoconfocal microscopy, PfROM1 was
+ localized to a single, thread-like structure on one side of the mero
+zoites that appears to be in close proximity to the subpellicular mic
+rotubules. Using antibodies to the merozoite surface protein-1 (MSP1)
+, a protein that is located in the merozoite plasma membrane (Fig. 3A
+), we demonstrated that HA-PfROM1 staining is intracellular, not colo
+calizing with the plasma membrane. In merozoites, PfAMA1 is located i
+n micronemes and thus separated from PfROM1. Toxoplasma gondii ROM1, 
+the orthologue of PfROM1, is located in the Golgi, secretory vesicles
+, and in micronemes (10; L. D. Sibley, unpublished data). For example
+, Plasmodium falciparum apical membrane antigen 1 (AMA1) is held in t
+he micronemes in merozoites inside of erythrocytes. For example, Plas
+modium falciparum apical membrane antigen 1 (AMA1) is held in the mic
+ronemes in merozoites inside of erythrocytes.
[download]

This a just a small part of the input file. Now what happens is that the parser checks for only the first GENE entry and then quits the loop. what I want is foreach of the geneentry it should take up all the target possibilities one by one and check for the pattern in each line of the text

In reply to Simple RegEX text parser by I-Box

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.