Hi guys. I am totally new to perl. Just started to work on it a week ago..please excuse my stupidity in btwn...Glad to be a part of this monastery..I'm trying to build a simple RegEX text parser. I will be having a three input files . first one is GENE which contains a list of gene names second one TARGET contains a list of protein locations inside the cell. third one IF is my input full text which i have made it into a single line one. Now while parsing I will be looking for the GENE entry, verb, and TARGET entry. in a single line of the text. if three are present then I print them out.

#! usr/bin/perl use strict; use warnings; # opening the input lexicons open (GENE,"/home/stanley/Desktop/Gene.txt"); open (TARGET, "/home/stanley/Desktop/Target.txt"); my $target; my $gene; # opening fulltext open (IF, "/home/stanley/Desktop/18048320.txt"); my $text = <IF>; my @splittext = split (/[.] [A-Z]/, $text); close (IF); # opening output file open (OF, ">/home/stanley/Desktop/Local.txt"); # Parsing Text for $gene (<GENE>) { chomp $gene; while ($target = <TARGET>) { chomp $target; foreach my $line (@splittext) { if ($line =~ /.+?($gene).*(localizes to|held|located i +n|localization|translocated to|targets|reaches|exported|export).*($ta +rget).+?/ig) { print OF $1."\t".$2."\t".$3."\n"; } } } } close (OF); close (GENE); close (TARGET);

____DATA___ gene.txt pfrom1 pfama1 ama1 ha-pfrom1 target.txt apicoplast mitochondrion rhoptry rhoptries golgi dense granules parasitophorous vacuole micronemes food vacuole secretory vesicle host cell input txt ompartmentalization of proteins into subcellular organelles in eukaryo +tic cells is a fundamental mechanism of regulating complex cellular f +unctions. Many proteins of Plasmodium falciparum merozoites involved +in invasion are compartmentalized into apical organelles. We have ide +ntified a new merozoite organelle that contains P. falciparum rhomboi +d-1 (PfROM1), a protease that cleaves the transmembrane regions of pr +oteins involved in invasion. By immunoconfocal microscopy, PfROM1 was + localized to a single, thread-like structure on one side of the mero +zoites that appears to be in close proximity to the subpellicular mic +rotubules. Using antibodies to the merozoite surface protein-1 (MSP1) +, a protein that is located in the merozoite plasma membrane (Fig. 3A +), we demonstrated that HA-PfROM1 staining is intracellular, not colo +calizing with the plasma membrane. In merozoites, PfAMA1 is located i +n micronemes and thus separated from PfROM1. Toxoplasma gondii ROM1, +the orthologue of PfROM1, is located in the Golgi, secretory vesicles +, and in micronemes (10; L. D. Sibley, unpublished data). For example +, Plasmodium falciparum apical membrane antigen 1 (AMA1) is held in t +he micronemes in merozoites inside of erythrocytes. For example, Plas +modium falciparum apical membrane antigen 1 (AMA1) is held in the mic +ronemes in merozoites inside of erythrocytes.

This a just a small part of the input file. Now what happens is that the parser checks for only the first GENE entry and then quits the loop. what I want is foreach of the geneentry it should take up all the target possibilities one by one and check for the pattern in each line of the text


In reply to Simple RegEX text parser by I-Box

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.