Hi guys. I am totally new to perl. Just started to work on it a week ago..please excuse my stupidity in btwn...Glad to be a part of this monastery..I'm trying to build a simple RegEX text parser. I will be having a three input files . first one is GENE which contains a list of gene names second one TARGET contains a list of protein locations inside the cell. third one IF is my input full text which i have made it into a single line one. Now while parsing I will be looking for the GENE entry, verb, and TARGET entry. in a single line of the text. if three are present then I print them out.
#! usr/bin/perl use strict; use warnings; # opening the input lexicons open (GENE,"/home/stanley/Desktop/Gene.txt"); open (TARGET, "/home/stanley/Desktop/Target.txt"); my $target; my $gene; # opening fulltext open (IF, "/home/stanley/Desktop/18048320.txt"); my $text = <IF>; my @splittext = split (/[.] [A-Z]/, $text); close (IF); # opening output file open (OF, ">/home/stanley/Desktop/Local.txt"); # Parsing Text for $gene (<GENE>) { chomp $gene; while ($target = <TARGET>) { chomp $target; foreach my $line (@splittext) { if ($line =~ /.+?($gene).*(localizes to|held|located i +n|localization|translocated to|targets|reaches|exported|export).*($ta +rget).+?/ig) { print OF $1."\t".$2."\t".$3."\n"; } } } } close (OF); close (GENE); close (TARGET);
____DATA___ gene.txt pfrom1 pfama1 ama1 ha-pfrom1 target.txt apicoplast mitochondrion rhoptry rhoptries golgi dense granules parasitophorous vacuole micronemes food vacuole secretory vesicle host cell input txt ompartmentalization of proteins into subcellular organelles in eukaryo +tic cells is a fundamental mechanism of regulating complex cellular f +unctions. Many proteins of Plasmodium falciparum merozoites involved +in invasion are compartmentalized into apical organelles. We have ide +ntified a new merozoite organelle that contains P. falciparum rhomboi +d-1 (PfROM1), a protease that cleaves the transmembrane regions of pr +oteins involved in invasion. By immunoconfocal microscopy, PfROM1 was + localized to a single, thread-like structure on one side of the mero +zoites that appears to be in close proximity to the subpellicular mic +rotubules. Using antibodies to the merozoite surface protein-1 (MSP1) +, a protein that is located in the merozoite plasma membrane (Fig. 3A +), we demonstrated that HA-PfROM1 staining is intracellular, not colo +calizing with the plasma membrane. In merozoites, PfAMA1 is located i +n micronemes and thus separated from PfROM1. Toxoplasma gondii ROM1, +the orthologue of PfROM1, is located in the Golgi, secretory vesicles +, and in micronemes (10; L. D. Sibley, unpublished data). For example +, Plasmodium falciparum apical membrane antigen 1 (AMA1) is held in t +he micronemes in merozoites inside of erythrocytes. For example, Plas +modium falciparum apical membrane antigen 1 (AMA1) is held in the mic +ronemes in merozoites inside of erythrocytes.
This a just a small part of the input file. Now what happens is that the parser checks for only the first GENE entry and then quits the loop. what I want is foreach of the geneentry it should take up all the target possibilities one by one and check for the pattern in each line of the text
In reply to Simple RegEX text parser by I-Box
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |