Simple RegEX text parser

I-Box has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys. I am totally new to perl. Just started to work on it a week ago..please excuse my stupidity in btwn...Glad to be a part of this monastery..I'm trying to build a simple RegEX text parser. I will be having a three input files . first one is GENE which contains a list of gene names second one TARGET contains a list of protein locations inside the cell. third one IF is my input full text which i have made it into a single line one. Now while parsing I will be looking for the GENE entry, verb, and TARGET entry. in a single line of the text. if three are present then I print them out.

 #! usr/bin/perl
use strict;
use warnings;

# opening the input lexicons

open (GENE,"/home/stanley/Desktop/Gene.txt");
open (TARGET, "/home/stanley/Desktop/Target.txt");

my $target;
my $gene;

# opening fulltext

open (IF, "/home/stanley/Desktop/18048320.txt");
my $text = <IF>;
my @splittext = split (/[.] [A-Z]/, $text);
close (IF);

# opening output file

open (OF, ">/home/stanley/Desktop/Local.txt");

# Parsing Text

for $gene (<GENE>) {

chomp $gene;
    while ($target = <TARGET>) {
    chomp $target;

            foreach my $line (@splittext) {
                if ($line =~ /.+?($gene).*(localizes to|held|located i
+n|localization|translocated to|targets|reaches|exported|export).*($ta
+rget).+?/ig) {
                print OF $1."\t".$2."\t".$3."\n";
                }
            }
    
    }
} 
close (OF);
close (GENE);
close (TARGET);
[download]

____DATA___

gene.txt

pfrom1
pfama1
ama1
ha-pfrom1

target.txt

apicoplast
mitochondrion
rhoptry
rhoptries
golgi
dense granules
parasitophorous vacuole
micronemes
food vacuole
secretory vesicle
host cell

input txt
ompartmentalization of proteins into subcellular organelles in eukaryo
+tic cells is a fundamental mechanism of regulating complex cellular f
+unctions. Many proteins of Plasmodium falciparum merozoites involved 
+in invasion are compartmentalized into apical organelles. We have ide
+ntified a new merozoite organelle that contains P. falciparum rhomboi
+d-1 (PfROM1), a protease that cleaves the transmembrane regions of pr
+oteins involved in invasion. By immunoconfocal microscopy, PfROM1 was
+ localized to a single, thread-like structure on one side of the mero
+zoites that appears to be in close proximity to the subpellicular mic
+rotubules. Using antibodies to the merozoite surface protein-1 (MSP1)
+, a protein that is located in the merozoite plasma membrane (Fig. 3A
+), we demonstrated that HA-PfROM1 staining is intracellular, not colo
+calizing with the plasma membrane. In merozoites, PfAMA1 is located i
+n micronemes and thus separated from PfROM1. Toxoplasma gondii ROM1, 
+the orthologue of PfROM1, is located in the Golgi, secretory vesicles
+, and in micronemes (10; L. D. Sibley, unpublished data). For example
+, Plasmodium falciparum apical membrane antigen 1 (AMA1) is held in t
+he micronemes in merozoites inside of erythrocytes. For example, Plas
+modium falciparum apical membrane antigen 1 (AMA1) is held in the mic
+ronemes in merozoites inside of erythrocytes.
[download]

This a just a small part of the input file. Now what happens is that the parser checks for only the first GENE entry and then quits the loop. what I want is foreach of the geneentry it should take up all the target possibilities one by one and check for the pattern in each line of the text

Comment on Simple RegEX text parser Select or Download Code

Replies are listed 'Best First'.
Re: Simple RegEX text parser by almut (Canon) on Dec 30, 2008 at 10:58 UTC
You need to reset the TARGET file pointer for every gene, otherwise you'll be at end-of-file after the first time through the file. Try `... seek(TARGET, 0, 0); while ($target = <TARGET>) { ...` [download] or read the TARGET data into memory, as you've done with the IF data.	[reply] [d/l]
Re^2: Simple RegEX text parser by I-Box (Acolyte) on Dec 30, 2008 at 11:16 UTC
Thanks a lot almut. The seek command helped...	[reply]
Re: Simple RegEX text parser by linuxer (Curate) on Dec 30, 2008 at 11:00 UTC
In your lines 15-18 you only read the first line of text. `open (IF, "/home/stanley/Desktop/18048320.txt"); my $text = <IF>; my @splittext = split (/[.] [A-Z]/, $text); close (IF);` [download] Better something like this: `open my $if, '<', '/path/to/file' or die "/path/to/file: $!"; # complete file content in one scalar my $text = do { local $/; <$if> } close $if;` [download] Then, instead using `for` to iterate over a filehandle, use `while`: `#for my $gene ( <HANDLE> ) { while ( my $gene = <HANDLE> ) {` [download]	[reply] [d/l] [select]
Re: Simple RegEX text parser by oko1 (Deacon) on Dec 30, 2008 at 15:48 UTC
Since your data set is fairly small, you might want to consider building the regex out of the pieces you're looking for: #!/usr/bin/perl -w use strict; our ($Gene, $Target, $Input) = qw/gene.txt target.txt input.txt/; open Gene or die "$Gene: $!\n"; open Target or die "$Target: $!\n"; open Input or die "$Input: $!\n"; my $gene = join "\|", grep {chomp} <Gene>; my $target = join "\|", grep {chomp} <Target>; chomp(my $input = <Input>); close $_ for qw/Gene Target Input/; my $verbs = 'localizes to\|held\|located in\|localization\|translocated to +\|targets\|reaches\|exported\|export'; # Note corrected 'split' regex for my $sentence (split /\. [A-Z]/, $input){ my $found; for ($sentence =~ /($gene).?($verbs).?($target)/ig){ print "$_\t"; $found++; } print "\n" if $found; } [download] Output: `PfAMA1 located in micronemes PfROM1 located in Golgi AMA1 held micronemes AMA1 held micronemes` [download] -- "Language shapes the way we think, and determines what we can think about." -- B. L. Whorf	[reply] [d/l] [select]
Re^2: Simple RegEX text parser by planetscape (Chancellor) on Dec 30, 2008 at 16:02 UTC
you might want to consider building the regex out of the pieces you're looking for grinder's most excellent Regexp::Assemble can certainly help with this sort of thing, identifying bits common to multiple words. Take a look at Why machine-generated solutions will never cease to amaze me for a sample of what this module can do; you'll be impressed. Oh, and grinder's scratchpad too... HTH, planetscape	[reply]
Re^3: Simple RegEX text parser by ikegami (Patriarch) on Dec 30, 2008 at 19:02 UTC
Regexp::Assemble is to join regexps. Regexp::List is to join strings into a regexp. They're in the same distribution.	[reply]
Re^4: Simple RegEX text parser by bart (Canon) on Jan 03, 2009 at 19:19 UTC
Re^3: Simple RegEX text parser by oko1 (Deacon) on Dec 31, 2008 at 02:06 UTC
As noted by ikegami, Regexp::Assemble is slightly off the mark... but ++ nevertheless! Thank you for introducing me to a very fun, very useful module. The "Why machine-generated solutions will never cease to amaze me" link is also great. Much appreciated! -- "Language shapes the way we think, and determines what we can think about." -- B. L. Whorf	[reply]
Re^2: Simple RegEX text parser by I-Box (Acolyte) on Dec 30, 2008 at 16:42 UTC
Thanks Oko1..your code was really handy....I always wanted my code to be short and smart...	[reply]
Re: Simple RegEX text parser by I-Box (Acolyte) on Jan 02, 2009 at 19:19 UTC
Hi guys I have dropped the plan of embedding a java application for sentence breaking. Guess what..we have the best in Perl itself...Lingua::EN:Sentence module..a brilliant work..it works fine for me..now please have a look on my latest code. I still havent been able to fix my regex part...the current regex returns with least no: of matches...I would really appreciate some help in getting it fixed.. #! usr/bin/perl use strict; use warnings; use Lingua::EN::Sentence qw( get_sentences add_acronyms ); # opening the input lexicons open (GENE,"Gene.txt") \|\| die "Cannot open Gene.txt !!"; open (TARGET, "Target.txt") \|\| die " Cannot open Target.txt !!"; my $target; my $gene; # opening fulltext and sentence breaking open (IF, "Input.txt") \|\| die " Cannot open Fulltext !!"; my $text = <IF>; my $sentences=get_sentences($text); close (IF); # opening output file open (OF, ">results.txt"); # Parsing Text my $verbs = "localized\|held\|located in\|localization\|translocated to\|ta +rgets\|reaches\|exported\|export"; while ($gene = <GENE>) { chomp $gene; seek (TARGET,0,0); while ($target = <TARGET>) { chomp $target; foreach my $sentence (@$sentences) { if ($sentence =~ /($gene).+($verbs).+($target)/ig) { print OF $1."\t".$2."\t".$3."\t\t".$sentence."\n"; } } } } close (OF); close (GENE); close (TARGET); [download] as Oko1 has suggested before ...I have built up a regex with "\|" ...but the results were much less than those by my intial code posted in my first thread in this node. I also tried using Regexp::List..but wasnt able to work out a solution...would be nice if someone could give me start with a small code involving Regexp::List _______MY RESULTS____ AMA1 held micronemes For example, Plasmodium falciparum a +pical membrane antigen 1 (AMA1) is held in the micronemes in merozoit +es inside of erythrocytes. AMA1 located in micronemes In merozoites, PfAMA1 is locat +ed in micronemes and thus separated from PfROM1. AMA1 translocated to subpellicular microtubules The prote +in AMA1 is then translocated to the food vacuole, apicoplast, subpell +icular microtubules. PfROM1 located in micronemes Toxoplasma gondii ROM1, the +orthologue of PfROM1, is located in the secretory vesicles, Golgi, an +d in micronemes (10; L. D. Sibley, unpublished data). PfROM1 located in micronemes PfROM1 was also thought to b +e located in micronemes (13), based on data localizing a PfROM1 const +ruct that was missing two 5′ exons which encode one of the tran +smembrane domains of PfROM1 (SI Figs. AMA1 held micronemes For example, Plasmodium falciparum a +pical membrane antigen 1 (AMA1) is held in the micronemes in merozoit +es inside of erythrocytes. AMA1 located in micronemes In merozoites, PfAMA1 is locat +ed in micronemes and thus separated from PfROM1. AMA1 translocated to subpellicular microtubules The prote +in AMA1 is then translocated to the food vacuole, apicoplast, subpell +icular microtubules. PfROM1 located in micronemes Toxoplasma gondii ROM1, the +orthologue of PfROM1, is located in the secretory vesicles, Golgi, an +d in micronemes (10; L. D. Sibley, unpublished data). PfROM1 located in micronemes PfROM1 was also thought to b +e located in micronemes (13), based on data localizing a PfROM1 const +ruct that was missing two 5′ exons which encode one of the tran +smembrane domains of PfROM1 (SI Figs. [download]	[reply] [d/l] [select]
Re^2: Simple RegEX text parser by planetscape (Chancellor) on Jan 03, 2009 at 18:40 UTC
As requested, a "small code involving Regexp::List": `#!/usr/bin/perl -w use strict; use warnings; use Regexp::List; my $l = Regexp::List->new; my $re = $l->list2re(qw/localized held located localization translocat +ed targets reaches exported export/); print "$re\n";` [download] Output: `(?-xism:(?=[ehlrt])(?:loca(?:liz(?:ed\|ation)\|ted)\|t(?:ranslocated\|arge +ts)\|export(?:ed)?\|held\|reaches))` [download] Note that I removed a few prepositions since Regexp::List works on a list of words. HTH, planetscape	[reply] [d/l] [select]
Re^2: Simple RegEX text parser by I-Box (Acolyte) on Jan 03, 2009 at 08:31 UTC
Here's another piece of code which I wrote for the same task...this works great..but it takes more resources...thats why I prefer the first code...my input files are to contain many entries (~10000 or more)...and I have to use this parser to parse thousands of articles too.. both of these codes are supposed to give the same results for the articles. Try adding more toy sentences to the article and you shall see #! usr/bin/perl use strict; use warnings; use Lingua::EN::Sentence qw( get_sentences add_acronyms ); # opening the input lexicons open (GENE,"Gene.txt") \|\| die "Cannot open Gene.txt !!"; open (TARGET, "Target.txt") \|\| die " Cannot open Target.txt !!"; my $target; my $gene; # opening fulltext and sentence breaking open (IF, "Input.txt") \|\| die " Cannot open Fulltext !!"; my $text = <IF>; my $sentences=get_sentences($text); close (IF); # opening output file open (OF, ">results.txt"); # Parsing Text my $verbs = "localized\|held\|located in\|localization\|translocated to\|ta +rgets\|reaches\|exported\|export"; while ($gene = <GENE>) { chomp $gene; seek (TARGET,0,0); while ($target = <TARGET>) { chomp $target; foreach my $sentence (@$sentences) { if ($sentence =~ /($gene).+($verbs).+($target)/ig) { print OF $1."\t".$2."\t".$3."\t\t".$sentence."\n"; } } } } close (OF); close (GENE); close (TARGET); [download] this code gave me 26 hits for my trial text and the code given in previous post gave only 5 hits...I cant figure what went wrong..please help me out ____MY RESULTS____ PfROM1 localized subpellicular microtubules By immunoconf +ocal microscopy, PfROM1 was localized to a single, thread-like struct +ure on one side of the merozoites that appears to be in close proximi +ty to the subpellicular microtubules. PfROM1 localized subpellicular microtubules HA-PfROM1 was + observed to be localized in close proximity to longitudinal subpelli +cular microtubules of the merozoite (Fig. PfROM1 localized rhoptries Thus, these results indicate t +hat HA-PfROM1 is localized in a subcellular compartment distinct from + the micronemes, rhoptries, and dense granules. PfROM1 localized rhoptries HA-PfROM1 is not localized to +known apical secretory organelles: rhoptries, micronemes, and dense g +ranules. PfROM1 located in Golgi Toxoplasma gondii ROM1, the ortho +logue of PfROM1, is located in the secretory vesicles, Golgi, and in +micronemes (10; L. D. Sibley, unpublished data). PfROM1 localized dense granules Thus, these results indic +ate that HA-PfROM1 is localized in a subcellular compartment distinct + from the micronemes, rhoptries, and dense granules. PfROM1 localized dense granules HA-PfROM1 is not localize +d to known apical secretory organelles: rhoptries, micronemes, and de +nse granules. PfROM1 localized micronemes Thus, these results indicate +that HA-PfROM1 is localized in a subcellular compartment distinct fro +m the micronemes, rhoptries, and dense granules. PfROM1 localized micronemes HA-PfROM1 is not localized to + known apical secretory organelles: rhoptries, micronemes, and dense +granules. PfROM1 localized micronemes HA-PfROM1 staining appeared t +o be colocalized, in part, with the PfAMA1 staining that translocates + from micronemes to the parasite surface on release of micronemal con +tents during invasion (Fig. PfROM1 located in micronemes Toxoplasma gondii ROM1, the +orthologue of PfROM1, is located in the secretory vesicles, Golgi, an +d in micronemes (10; L. D. Sibley, unpublished data). PfROM1 located in micronemes PfROM1 was also thought to b +e located in micronemes (13), based on data localizing a PfROM1 const +ruct that was missing two 5′ exons which encode one of the tran +smembrane domains of PfROM1 (SI Figs. PfROM1 located in secretory vesicle Toxoplasma gondii ROM +1, the orthologue of PfROM1, is located in the secretory vesicles, Go +lgi, and in micronemes (10; L. D. Sibley, unpublished data). HA-PfROM1 localized subpellicular microtubules HA-PfROM1 +was observed to be localized in close proximity to longitudinal subpe +llicular microtubules of the merozoite (Fig. HA-PfROM1 localized rhoptries Thus, these results indicat +e that HA-PfROM1 is localized in a subcellular compartment distinct f +rom the micronemes, rhoptries, and dense granules. HA-PfROM1 localized rhoptries HA-PfROM1 is not localized +to known apical secretory organelles: rhoptries, micronemes, and dens +e granules. HA-PfROM1 localized dense granules Thus, these results in +dicate that HA-PfROM1 is localized in a subcellular compartment disti +nct from the micronemes, rhoptries, and dense granules. HA-PfROM1 localized dense granules HA-PfROM1 is not local +ized to known apical secretory organelles: rhoptries, micronemes, and + dense granules. HA-PfROM1 localized micronemes Thus, these results indica +te that HA-PfROM1 is localized in a subcellular compartment distinct +from the micronemes, rhoptries, and dense granules. HA-PfROM1 localized micronemes HA-PfROM1 is not localized + to known apical secretory organelles: rhoptries, micronemes, and den +se granules. HA-PfROM1 localized micronemes HA-PfROM1 staining appeare +d to be colocalized, in part, with the PfAMA1 staining that transloca +tes from micronemes to the parasite surface on release of micronemal +contents during invasion (Fig. AMA1 translocated to apicoplast The protein AMA1 is then +translocated to the food vacuole, apicoplast, subpellicular microtubu +les. AMA1 translocated to subpellicular microtubules The prote +in AMA1 is then translocated to the food vacuole, apicoplast, subpell +icular microtubules. AMA1 held micronemes For example, Plasmodium falciparum a +pical membrane antigen 1 (AMA1) is held in the micronemes in merozoit +es inside of erythrocytes. AMA1 located in micronemes In merozoites, PfAMA1 is locat +ed in micronemes and thus separated from PfROM1. AMA1 translocated to food vacuole The protein AMA1 is the +n translocated to the food vacuole, apicoplast, subpellicular microtu +bules. [download]	[reply] [d/l] [select]
Re: Simple RegEX text parser by I-Box (Acolyte) on Jan 02, 2009 at 09:33 UTC
Hello guys . I have got a new problem. As you could see in my above code, the text splitting part was not working fine. Now I would like to know more about any Perl modules for text breaking / splitting. It wud be really helpful if u cud give my suggestions. And also there is a java application known as SentParBreaker for sentence breaking. It works great. I have got the .jar file from the site, but I dont knw how to call a java function from perl. I have read about Inline::Java, but I think that for tht module to work I need the source code of my application. which of course I dont have. Cud any1 please tell me how to invoke a java application from perl and use the input	[reply]
Re^2: Simple RegEX text parser by ikegami (Patriarch) on Jan 02, 2009 at 10:18 UTC
I have read about Inline::Java, but I think that for tht module to work I need the source code of my application. Nope, see the section called "STUDYING". By the way, the language is called Perl (not PERL).	[reply]
Re^3: Simple RegEX text parser by I-Box (Acolyte) on Jan 02, 2009 at 10:45 UTC
Thanks Ikegami for pointing out my mistake. I'm sorry for that...And I have read about the "STUDYING" Section. Yah it should help me. But I am totally new to Java..so I shall take some time in understanding the terms..Meanwhile I wish my code to be fully Perl-oriented i.e to minimise the use of other language applications...So it would be of great help if you could suggest some perl modules for the same task	[reply]
Re^2: Simple RegEX text parser by kevk (Initiate) on Aug 13, 2009 at 10:52 UTC
Hi I-Box, do you still have the SentParBreaker .jar file? I would like to try it out, but the only download link I can find for it is broken. Any chance you could YSI it for me or something? Many thanks in advance!	[reply]