comment on

Hi guys I have dropped the plan of embedding a java application for sentence breaking. Guess what..we have the best in Perl itself...Lingua::EN:Sentence module..a brilliant work..it works fine for me..now please have a look on my latest code. I still havent been able to fix my regex part...the current regex returns with least no: of matches...I would really appreciate some help in getting it fixed..

#! usr/bin/perl
use strict;
use warnings;
use Lingua::EN::Sentence qw( get_sentences add_acronyms );

# opening the input lexicons
open (GENE,"Gene.txt") || die "Cannot open Gene.txt !!";
open (TARGET, "Target.txt") || die " Cannot open Target.txt !!";

my $target;
my $gene;

# opening fulltext and sentence breaking
open (IF, "Input.txt") || die " Cannot open Fulltext !!";
my $text = <IF>;
my $sentences=get_sentences($text);

close (IF);    
        

# opening output file

open (OF, ">results.txt");

# Parsing Text
my $verbs = "localized|held|located in|localization|translocated to|ta
+rgets|reaches|exported|export";
while ($gene = <GENE>) {

chomp $gene;
    seek (TARGET,0,0);
    while ($target = <TARGET>) {
    chomp $target;

            foreach my $sentence (@$sentences) {
                if ($sentence =~ /($gene).+($verbs).+($target)/ig) {
                print OF $1."\t".$2."\t".$3."\t\t".$sentence."\n";
                }
            }
    
    }
} 
close (OF);
close (GENE);
close (TARGET);
[download]

as Oko1 has suggested before ...I have built up a regex with "|" ...but the results were much less than those by my intial code posted in my first thread in this node. I also tried using Regexp::List..but wasnt able to work out a solution...would be nice if someone could give me start with a small code involving Regexp::List


_______MY RESULTS____

AMA1    held    micronemes        For example, Plasmodium falciparum a
+pical membrane antigen 1 (AMA1) is held in the micronemes in merozoit
+es inside of erythrocytes.
AMA1    located in    micronemes        In merozoites, PfAMA1 is locat
+ed in micronemes and thus separated from PfROM1.
AMA1    translocated to    subpellicular microtubules        The prote
+in AMA1 is then translocated to the food vacuole, apicoplast, subpell
+icular microtubules.
PfROM1    located in    micronemes        Toxoplasma gondii ROM1, the 
+orthologue of PfROM1, is located in the secretory vesicles, Golgi, an
+d in micronemes (10; L. D. Sibley, unpublished data).
PfROM1    located in    micronemes        PfROM1 was also thought to b
+e located in micronemes (13), based on data localizing a PfROM1 const
+ruct that was missing two 5&#8242; exons which encode one of the tran
+smembrane domains of PfROM1 (SI Figs.
AMA1    held    micronemes        For example, Plasmodium falciparum a
+pical membrane antigen 1 (AMA1) is held in the micronemes in merozoit
+es inside of erythrocytes.
AMA1    located in    micronemes        In merozoites, PfAMA1 is locat
+ed in micronemes and thus separated from PfROM1.
AMA1    translocated to    subpellicular microtubules        The prote
+in AMA1 is then translocated to the food vacuole, apicoplast, subpell
+icular microtubules.
PfROM1    located in    micronemes        Toxoplasma gondii ROM1, the 
+orthologue of PfROM1, is located in the secretory vesicles, Golgi, an
+d in micronemes (10; L. D. Sibley, unpublished data).
PfROM1    located in    micronemes        PfROM1 was also thought to b
+e located in micronemes (13), based on data localizing a PfROM1 const
+ruct that was missing two 5&#8242; exons which encode one of the tran
+smembrane domains of PfROM1 (SI Figs.
[download]

In reply to Re: Simple RegEX text parser by I-Box
in thread Simple RegEX text parser by I-Box

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.