I-Box has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys. I am totally new to perl. Just started to work on it a week ago..please excuse my stupidity in btwn...Glad to be a part of this monastery..I'm trying to build a simple RegEX text parser. I will be having a three input files . first one is GENE which contains a list of gene names second one TARGET contains a list of protein locations inside the cell. third one IF is my input full text which i have made it into a single line one. Now while parsing I will be looking for the GENE entry, verb, and TARGET entry. in a single line of the text. if three are present then I print them out.

#! usr/bin/perl use strict; use warnings; # opening the input lexicons open (GENE,"/home/stanley/Desktop/Gene.txt"); open (TARGET, "/home/stanley/Desktop/Target.txt"); my $target; my $gene; # opening fulltext open (IF, "/home/stanley/Desktop/18048320.txt"); my $text = <IF>; my @splittext = split (/[.] [A-Z]/, $text); close (IF); # opening output file open (OF, ">/home/stanley/Desktop/Local.txt"); # Parsing Text for $gene (<GENE>) { chomp $gene; while ($target = <TARGET>) { chomp $target; foreach my $line (@splittext) { if ($line =~ /.+?($gene).*(localizes to|held|located i +n|localization|translocated to|targets|reaches|exported|export).*($ta +rget).+?/ig) { print OF $1."\t".$2."\t".$3."\n"; } } } } close (OF); close (GENE); close (TARGET);

____DATA___ gene.txt pfrom1 pfama1 ama1 ha-pfrom1 target.txt apicoplast mitochondrion rhoptry rhoptries golgi dense granules parasitophorous vacuole micronemes food vacuole secretory vesicle host cell input txt ompartmentalization of proteins into subcellular organelles in eukaryo +tic cells is a fundamental mechanism of regulating complex cellular f +unctions. Many proteins of Plasmodium falciparum merozoites involved +in invasion are compartmentalized into apical organelles. We have ide +ntified a new merozoite organelle that contains P. falciparum rhomboi +d-1 (PfROM1), a protease that cleaves the transmembrane regions of pr +oteins involved in invasion. By immunoconfocal microscopy, PfROM1 was + localized to a single, thread-like structure on one side of the mero +zoites that appears to be in close proximity to the subpellicular mic +rotubules. Using antibodies to the merozoite surface protein-1 (MSP1) +, a protein that is located in the merozoite plasma membrane (Fig. 3A +), we demonstrated that HA-PfROM1 staining is intracellular, not colo +calizing with the plasma membrane. In merozoites, PfAMA1 is located i +n micronemes and thus separated from PfROM1. Toxoplasma gondii ROM1, +the orthologue of PfROM1, is located in the Golgi, secretory vesicles +, and in micronemes (10; L. D. Sibley, unpublished data). For example +, Plasmodium falciparum apical membrane antigen 1 (AMA1) is held in t +he micronemes in merozoites inside of erythrocytes. For example, Plas +modium falciparum apical membrane antigen 1 (AMA1) is held in the mic +ronemes in merozoites inside of erythrocytes.

This a just a small part of the input file. Now what happens is that the parser checks for only the first GENE entry and then quits the loop. what I want is foreach of the geneentry it should take up all the target possibilities one by one and check for the pattern in each line of the text

Replies are listed 'Best First'.
Re: Simple RegEX text parser
by almut (Canon) on Dec 30, 2008 at 10:58 UTC

    You need to reset the TARGET file pointer for every gene, otherwise you'll be at end-of-file after the first time through the file. Try

    ... seek(TARGET, 0, 0); while ($target = <TARGET>) { ...

    or read the TARGET data into memory, as you've done with the IF data.

      Thanks a lot almut. The seek command helped...

Re: Simple RegEX text parser
by linuxer (Curate) on Dec 30, 2008 at 11:00 UTC

    In your lines 15-18 you only read the first line of text.

    open (IF, "/home/stanley/Desktop/18048320.txt"); my $text = <IF>; my @splittext = split (/[.] [A-Z]/, $text); close (IF);

    Better something like this:

    open my $if, '<', '/path/to/file' or die "/path/to/file: $!"; # complete file content in one scalar my $text = do { local $/; <$if> } close $if;
    Then, instead using for to iterate over a filehandle, use while:
    #for my $gene ( <HANDLE> ) { while ( my $gene = <HANDLE> ) {
Re: Simple RegEX text parser
by oko1 (Deacon) on Dec 30, 2008 at 15:48 UTC

    Since your data set is fairly small, you might want to consider building the regex out of the pieces you're looking for:

    #!/usr/bin/perl -w use strict; our ($Gene, $Target, $Input) = qw/gene.txt target.txt input.txt/; open Gene or die "$Gene: $!\n"; open Target or die "$Target: $!\n"; open Input or die "$Input: $!\n"; my $gene = join "|", grep {chomp} <Gene>; my $target = join "|", grep {chomp} <Target>; chomp(my $input = <Input>); close $_ for qw/Gene Target Input/; my $verbs = 'localizes to|held|located in|localization|translocated to +|targets|reaches|exported|export'; # Note corrected 'split' regex for my $sentence (split /\. [A-Z]/, $input){ my $found; for ($sentence =~ /($gene).*?($verbs).*?($target)/ig){ print "$_\t"; $found++; } print "\n" if $found; }

    Output:

    PfAMA1 located in micronemes PfROM1 located in Golgi AMA1 held micronemes AMA1 held micronemes

    --
    "Language shapes the way we think, and determines what we can think about."
    -- B. L. Whorf
        Regexp::Assemble is to join regexps. Regexp::List is to join strings into a regexp. They're in the same distribution.

      Thanks Oko1..your code was really handy....I always wanted my code to be short and smart...

Re: Simple RegEX text parser
by I-Box (Acolyte) on Jan 02, 2009 at 19:19 UTC

    Hi guys I have dropped the plan of embedding a java application for sentence breaking. Guess what..we have the best in Perl itself...Lingua::EN:Sentence module..a brilliant work..it works fine for me..now please have a look on my latest code. I still havent been able to fix my regex part...the current regex returns with least no: of matches...I would really appreciate some help in getting it fixed..

    #! usr/bin/perl use strict; use warnings; use Lingua::EN::Sentence qw( get_sentences add_acronyms ); # opening the input lexicons open (GENE,"Gene.txt") || die "Cannot open Gene.txt !!"; open (TARGET, "Target.txt") || die " Cannot open Target.txt !!"; my $target; my $gene; # opening fulltext and sentence breaking open (IF, "Input.txt") || die " Cannot open Fulltext !!"; my $text = <IF>; my $sentences=get_sentences($text); close (IF); # opening output file open (OF, ">results.txt"); # Parsing Text my $verbs = "localized|held|located in|localization|translocated to|ta +rgets|reaches|exported|export"; while ($gene = <GENE>) { chomp $gene; seek (TARGET,0,0); while ($target = <TARGET>) { chomp $target; foreach my $sentence (@$sentences) { if ($sentence =~ /($gene).+($verbs).+($target)/ig) { print OF $1."\t".$2."\t".$3."\t\t".$sentence."\n"; } } } } close (OF); close (GENE); close (TARGET);

    as Oko1 has suggested before ...I have built up a regex with "|" ...but the results were much less than those by my intial code posted in my first thread in this node. I also tried using Regexp::List..but wasnt able to work out a solution...would be nice if someone could give me start with a small code involving Regexp::List

    _______MY RESULTS____ AMA1 held micronemes For example, Plasmodium falciparum a +pical membrane antigen 1 (AMA1) is held in the micronemes in merozoit +es inside of erythrocytes. AMA1 located in micronemes In merozoites, PfAMA1 is locat +ed in micronemes and thus separated from PfROM1. AMA1 translocated to subpellicular microtubules The prote +in AMA1 is then translocated to the food vacuole, apicoplast, subpell +icular microtubules. PfROM1 located in micronemes Toxoplasma gondii ROM1, the +orthologue of PfROM1, is located in the secretory vesicles, Golgi, an +d in micronemes (10; L. D. Sibley, unpublished data). PfROM1 located in micronemes PfROM1 was also thought to b +e located in micronemes (13), based on data localizing a PfROM1 const +ruct that was missing two 5&#8242; exons which encode one of the tran +smembrane domains of PfROM1 (SI Figs. AMA1 held micronemes For example, Plasmodium falciparum a +pical membrane antigen 1 (AMA1) is held in the micronemes in merozoit +es inside of erythrocytes. AMA1 located in micronemes In merozoites, PfAMA1 is locat +ed in micronemes and thus separated from PfROM1. AMA1 translocated to subpellicular microtubules The prote +in AMA1 is then translocated to the food vacuole, apicoplast, subpell +icular microtubules. PfROM1 located in micronemes Toxoplasma gondii ROM1, the +orthologue of PfROM1, is located in the secretory vesicles, Golgi, an +d in micronemes (10; L. D. Sibley, unpublished data). PfROM1 located in micronemes PfROM1 was also thought to b +e located in micronemes (13), based on data localizing a PfROM1 const +ruct that was missing two 5&#8242; exons which encode one of the tran +smembrane domains of PfROM1 (SI Figs.

      As requested, a "small code involving Regexp::List":

      #!/usr/bin/perl -w use strict; use warnings; use Regexp::List; my $l = Regexp::List->new; my $re = $l->list2re(qw/localized held located localization translocat +ed targets reaches exported export/); print "$re\n";

      Output:

      (?-xism:(?=[ehlrt])(?:loca(?:liz(?:ed|ation)|ted)|t(?:ranslocated|arge +ts)|export(?:ed)?|held|reaches))

      Note that I removed a few prepositions since Regexp::List works on a list of words.

      HTH,

      planetscape

      Here's another piece of code which I wrote for the same task...this works great..but it takes more resources...thats why I prefer the first code...my input files are to contain many entries (~10000 or more)...and I have to use this parser to parse thousands of articles too..

      both of these codes are supposed to give the same results for the articles. Try adding more toy sentences to the article and you shall see

      #! usr/bin/perl use strict; use warnings; use Lingua::EN::Sentence qw( get_sentences add_acronyms ); # opening the input lexicons open (GENE,"Gene.txt") || die "Cannot open Gene.txt !!"; open (TARGET, "Target.txt") || die " Cannot open Target.txt !!"; my $target; my $gene; # opening fulltext and sentence breaking open (IF, "Input.txt") || die " Cannot open Fulltext !!"; my $text = <IF>; my $sentences=get_sentences($text); close (IF); # opening output file open (OF, ">results.txt"); # Parsing Text my $verbs = "localized|held|located in|localization|translocated to|ta +rgets|reaches|exported|export"; while ($gene = <GENE>) { chomp $gene; seek (TARGET,0,0); while ($target = <TARGET>) { chomp $target; foreach my $sentence (@$sentences) { if ($sentence =~ /($gene).+($verbs).+($target)/ig) { print OF $1."\t".$2."\t".$3."\t\t".$sentence."\n"; } } } } close (OF); close (GENE); close (TARGET);

      this code gave me 26 hits for my trial text and the code given in previous post gave only 5 hits...I cant figure what went wrong..please help me out

      ____MY RESULTS____ PfROM1 localized subpellicular microtubules By immunoconf +ocal microscopy, PfROM1 was localized to a single, thread-like struct +ure on one side of the merozoites that appears to be in close proximi +ty to the subpellicular microtubules. PfROM1 localized subpellicular microtubules HA-PfROM1 was + observed to be localized in close proximity to longitudinal subpelli +cular microtubules of the merozoite (Fig. PfROM1 localized rhoptries Thus, these results indicate t +hat HA-PfROM1 is localized in a subcellular compartment distinct from + the micronemes, rhoptries, and dense granules. PfROM1 localized rhoptries HA-PfROM1 is not localized to +known apical secretory organelles: rhoptries, micronemes, and dense g +ranules. PfROM1 located in Golgi Toxoplasma gondii ROM1, the ortho +logue of PfROM1, is located in the secretory vesicles, Golgi, and in +micronemes (10; L. D. Sibley, unpublished data). PfROM1 localized dense granules Thus, these results indic +ate that HA-PfROM1 is localized in a subcellular compartment distinct + from the micronemes, rhoptries, and dense granules. PfROM1 localized dense granules HA-PfROM1 is not localize +d to known apical secretory organelles: rhoptries, micronemes, and de +nse granules. PfROM1 localized micronemes Thus, these results indicate +that HA-PfROM1 is localized in a subcellular compartment distinct fro +m the micronemes, rhoptries, and dense granules. PfROM1 localized micronemes HA-PfROM1 is not localized to + known apical secretory organelles: rhoptries, micronemes, and dense +granules. PfROM1 localized micronemes HA-PfROM1 staining appeared t +o be colocalized, in part, with the PfAMA1 staining that translocates + from micronemes to the parasite surface on release of micronemal con +tents during invasion (Fig. PfROM1 located in micronemes Toxoplasma gondii ROM1, the +orthologue of PfROM1, is located in the secretory vesicles, Golgi, an +d in micronemes (10; L. D. Sibley, unpublished data). PfROM1 located in micronemes PfROM1 was also thought to b +e located in micronemes (13), based on data localizing a PfROM1 const +ruct that was missing two 5&#8242; exons which encode one of the tran +smembrane domains of PfROM1 (SI Figs. PfROM1 located in secretory vesicle Toxoplasma gondii ROM +1, the orthologue of PfROM1, is located in the secretory vesicles, Go +lgi, and in micronemes (10; L. D. Sibley, unpublished data). HA-PfROM1 localized subpellicular microtubules HA-PfROM1 +was observed to be localized in close proximity to longitudinal subpe +llicular microtubules of the merozoite (Fig. HA-PfROM1 localized rhoptries Thus, these results indicat +e that HA-PfROM1 is localized in a subcellular compartment distinct f +rom the micronemes, rhoptries, and dense granules. HA-PfROM1 localized rhoptries HA-PfROM1 is not localized +to known apical secretory organelles: rhoptries, micronemes, and dens +e granules. HA-PfROM1 localized dense granules Thus, these results in +dicate that HA-PfROM1 is localized in a subcellular compartment disti +nct from the micronemes, rhoptries, and dense granules. HA-PfROM1 localized dense granules HA-PfROM1 is not local +ized to known apical secretory organelles: rhoptries, micronemes, and + dense granules. HA-PfROM1 localized micronemes Thus, these results indica +te that HA-PfROM1 is localized in a subcellular compartment distinct +from the micronemes, rhoptries, and dense granules. HA-PfROM1 localized micronemes HA-PfROM1 is not localized + to known apical secretory organelles: rhoptries, micronemes, and den +se granules. HA-PfROM1 localized micronemes HA-PfROM1 staining appeare +d to be colocalized, in part, with the PfAMA1 staining that transloca +tes from micronemes to the parasite surface on release of micronemal +contents during invasion (Fig. AMA1 translocated to apicoplast The protein AMA1 is then +translocated to the food vacuole, apicoplast, subpellicular microtubu +les. AMA1 translocated to subpellicular microtubules The prote +in AMA1 is then translocated to the food vacuole, apicoplast, subpell +icular microtubules. AMA1 held micronemes For example, Plasmodium falciparum a +pical membrane antigen 1 (AMA1) is held in the micronemes in merozoit +es inside of erythrocytes. AMA1 located in micronemes In merozoites, PfAMA1 is locat +ed in micronemes and thus separated from PfROM1. AMA1 translocated to food vacuole The protein AMA1 is the +n translocated to the food vacuole, apicoplast, subpellicular microtu +bules.
Re: Simple RegEX text parser
by I-Box (Acolyte) on Jan 02, 2009 at 09:33 UTC

    Hello guys . I have got a new problem. As you could see in my above code, the text splitting part was not working fine. Now I would like to know more about any Perl modules for text breaking / splitting. It wud be really helpful if u cud give my suggestions.

    And also there is a java application known as SentParBreaker for sentence breaking. It works great. I have got the .jar file from the site, but I dont knw how to call a java function from perl.

    I have read about Inline::Java, but I think that for tht module to work I need the source code of my application. which of course I dont have. Cud any1 please tell me how to invoke a java application from perl and use the input

      I have read about Inline::Java, but I think that for tht module to work I need the source code of my application.

      Nope, see the section called "STUDYING".

      By the way, the language is called Perl (not PERL).

        Thanks Ikegami for pointing out my mistake. I'm sorry for that...And I have read about the "STUDYING" Section. Yah it should help me. But I am totally new to Java..so I shall take some time in understanding the terms..Meanwhile I wish my code to be fully Perl-oriented i.e to minimise the use of other language applications...So it would be of great help if you could suggest some perl modules for the same task

      Hi I-Box, do you still have the SentParBreaker .jar file? I would like to try it out, but the only download link I can find for it is broken. Any chance you could YSI it for me or something? Many thanks in advance!