pdahal has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to list the protein names listed in the PubMed abstracts. I made a file containing the list of proteins I want to search. The code is running well but I got a problem. For example, I have a got a protein name "NOV" in the keyword list. But even if the abstract doesn't contain NOV, it lists NOV if there are words like novel. How can I solve this problem?

Replies are listed 'Best First'.
Re: searching keywords
by Corion (Patriarch) on Feb 28, 2017 at 08:31 UTC

    You don't show us the relevant code or a Short Self-Contained Code Example, so I can only guess.

    Most likely you are doing a simple substring match instead of looking for word boundaries. See perlre on word boundaries, \b.

    my $protein = 'NOV'; my @keywords = (qw(novice niveau november paramonov nov supernova), "h +e said 'nov'"); print "Left-matches /$protein/: $_\n" for grep { $_ =~ /\b$protein/i +} @keywords; print "Right-matches /$protein/: $_\n" for grep { $_ =~ /$protein\b/i +} @keywords; print "Contains /$protein/: $_\n" for grep { $_ =~ /$protein/i } +@keywords; print "Contains /$protein/ as word: $_\n" for grep { $_ =~ /\b$pr +otein\b/i } @keywords;
Re: searching keywords
by haukex (Archbishop) on Feb 28, 2017 at 08:13 UTC

    Welcome to the Monastery, pdahal. To get started, have a look at the following - the more better information you give us, the faster and better we can provide help: How do I post a question effectively? and Short, Self-Contained, Correct Example.

    But even if the abstract doesn't contain NOV, it lists NOV if there are words like novel.

    Since you haven't shown any code or sample input, I can only guess. Are you using a regular expression like /NOV/i? If so, then perhaps adding a "word boundary" anchor \b (see perlretut) will help: /\bNOV\b/i.

    Update: More is not better...

      Here I have attached my code.

      use warnings; use XML::Simple; use LWP::UserAgent; use HTTP::Request::Common; use URI::Escape; use Data::Dumper; use Text::CSV; my @keywords; my $file ="proteinlist.csv"; my $ua = LWP::UserAgent->new; my $csv = Text::CSV->new({ sep_char => ',' }); #Open result CSV file. open(my $fh, ">", "result1.csv"); print $fh "Pubmed ID, Drug name, Keyword(s) that matches, List of prot +eins in the abstract\n"; #Open the CSV file containing list of PubMed IDs open(my $data, '<', "pmid.csv"); while (my $line = <$data>) { chomp $line; if ($csv->parse($line)) { #Skip first line next if ($. == 1); my @fields = $csv->fields(); #Replace (-) with (,) $fields[0] =~ tr/-/,/; $fields[1] =~ tr/-/,/; #Split alt name my @id = split /[+]/, $fields[1]; for (my $i = 0; $i < scalar @id; $i++){ #Initialize http request my $args = "db=pubmed&id=$id[$i]&retmode=text&rettype=abstract"; my $req = new HTTP::Request POST => 'https://eutils.ncbi.nlm.nih.g +ov/entrez/eutils/efetch.fcgi'; $req->content_type('application/x-www-form-urlencoded'); $req->content($args); #Get response my $response = $ua->request($req); my $content = $response->content; $fields[0] =~ tr/,/-/; my $keystr = ""; #open csv file containing the protein list and compare with the co +ntent of abstract open(my $data, "<", $file) or die "Could not open '$file' $!\n"; while (my $readinline = <$data>) { chomp $readinline; #initialize the first data of csv as the first keyword my @fields = split "," , $readinline; $keywords[$i] = $fields[0]; if (regex(lc $content,lc $keywords[$i]) != -1) { if ($keystr eq ""){ $keystr = $keywords[$i]; }else{ $keystr = $keystr . "+$keywords[$i]"; } } } if ($keystr ne ""){ print $fh "$id[$i],$fields[0],$keystr,Yes\n"; print "$id[$i],$fields[0],$keystr,Yes\n"; }else{ print $fh "$id[$i],$fields[0],No keyword matches,No\n"; print "$id[$i],$fields[0],No keyword matches,No\n"; } } } } close($fh);

        Where is the "regex()" function declared and what does it do?

        if (regex(lc $content,lc $keywords[$i]) != -1) {

        Also, what is the input data to your program?

        Is the XML part and the download part necessary for your problem or could you maybe show just the relevant data and include that data in the program directly?