Hi guys I have dropped the plan of embedding a java application for sentence breaking. Guess what..we have the best in Perl itself...Lingua::EN:Sentence module..a brilliant work..it works fine for me..now please have a look on my latest code. I still havent been able to fix my regex part...the current regex returns with least no: of matches...I would really appreciate some help in getting it fixed..

#! usr/bin/perl use strict; use warnings; use Lingua::EN::Sentence qw( get_sentences add_acronyms ); # opening the input lexicons open (GENE,"Gene.txt") || die "Cannot open Gene.txt !!"; open (TARGET, "Target.txt") || die " Cannot open Target.txt !!"; my $target; my $gene; # opening fulltext and sentence breaking open (IF, "Input.txt") || die " Cannot open Fulltext !!"; my $text = <IF>; my $sentences=get_sentences($text); close (IF); # opening output file open (OF, ">results.txt"); # Parsing Text my $verbs = "localized|held|located in|localization|translocated to|ta +rgets|reaches|exported|export"; while ($gene = <GENE>) { chomp $gene; seek (TARGET,0,0); while ($target = <TARGET>) { chomp $target; foreach my $sentence (@$sentences) { if ($sentence =~ /($gene).+($verbs).+($target)/ig) { print OF $1."\t".$2."\t".$3."\t\t".$sentence."\n"; } } } } close (OF); close (GENE); close (TARGET);

as Oko1 has suggested before ...I have built up a regex with "|" ...but the results were much less than those by my intial code posted in my first thread in this node. I also tried using Regexp::List..but wasnt able to work out a solution...would be nice if someone could give me start with a small code involving Regexp::List

_______MY RESULTS____ AMA1 held micronemes For example, Plasmodium falciparum a +pical membrane antigen 1 (AMA1) is held in the micronemes in merozoit +es inside of erythrocytes. AMA1 located in micronemes In merozoites, PfAMA1 is locat +ed in micronemes and thus separated from PfROM1. AMA1 translocated to subpellicular microtubules The prote +in AMA1 is then translocated to the food vacuole, apicoplast, subpell +icular microtubules. PfROM1 located in micronemes Toxoplasma gondii ROM1, the +orthologue of PfROM1, is located in the secretory vesicles, Golgi, an +d in micronemes (10; L. D. Sibley, unpublished data). PfROM1 located in micronemes PfROM1 was also thought to b +e located in micronemes (13), based on data localizing a PfROM1 const +ruct that was missing two 5&#8242; exons which encode one of the tran +smembrane domains of PfROM1 (SI Figs. AMA1 held micronemes For example, Plasmodium falciparum a +pical membrane antigen 1 (AMA1) is held in the micronemes in merozoit +es inside of erythrocytes. AMA1 located in micronemes In merozoites, PfAMA1 is locat +ed in micronemes and thus separated from PfROM1. AMA1 translocated to subpellicular microtubules The prote +in AMA1 is then translocated to the food vacuole, apicoplast, subpell +icular microtubules. PfROM1 located in micronemes Toxoplasma gondii ROM1, the +orthologue of PfROM1, is located in the secretory vesicles, Golgi, an +d in micronemes (10; L. D. Sibley, unpublished data). PfROM1 located in micronemes PfROM1 was also thought to b +e located in micronemes (13), based on data localizing a PfROM1 const +ruct that was missing two 5&#8242; exons which encode one of the tran +smembrane domains of PfROM1 (SI Figs.

In reply to Re: Simple RegEX text parser by I-Box
in thread Simple RegEX text parser by I-Box

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.