Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

regexp and substitute operator problem

by dannoura (Pilgrim)
on Jul 08, 2003 at 05:59 UTC ( #272204=perlquestion: print w/replies, xml ) Need Help??

dannoura has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm processing the text below, trying to extract all mentions of chromosomal bands (they're the ones with the numbers and p's and q's). Using this code:

$c='[\d\-\.pqxy]'; #regexp (@chroms)=($text=~/\s$c*?\s/sig); #extract all for ($i=0; $i<@chroms; $i++) { splice(@chroms, $i, 1) if (!($chroms[$i]=~/[pqxy]/i)); } #eliminate pure numbers print "$_\n" foreach (@chroms);

The print statement yields:

Xq27-q28 q11q12 22q121 1q422q43 19q 17p11 1q25 13q123 11p112 1q25 1q2331 8p22 8p22 7q1123 7p22 19q12q1311 52 15 19q12q1311

Two questions: why doesn't the splice statement get rid of the "52" and "15" elements? Why do none of the dots or hyphens appear for the captured elements?

Whole-genome scan studies recently identified a locus on Xq27-q28 Xq11-q12 22q12.1 1q42.2-q43 19q 17p11 1q25 13q12.3 11p11.2 10q25 10q23.31 8p22 8p22 7q11.23 7p22 chromosome segments 19q12-q13.11 linked to prostate tumor aggressiveness by use of the Gleason score as a quantitative trait. We have now completed finer-scale linkage mapping across this region that confirmed and narrowed the candidate region to 2 cM, with a peak between markers D19S875 and D19S433. We also performed allelic imbalance (AI) studies across this region in primary prostate tumors from 52 patients unselected for family history or disease status. A high level of AI was observed, with the highest rates at markers D19S875 (56%) and D19S433 (60%). Furthermore, these two markers defined a smallest common region of AI of 0.8 Mb, with 15 (29%) prostate tumors displaying interstitial AI involving one or both markers. In addition, we noted a positive association between AI at marker D19S875 and extension of tumor beyond the margin (P = 0.02) as well as a higher Gleason score (P = 0.06). These data provide strong evidence that we have mapped a prostate tumor aggressiveness locus to chromosome segments 19q12-q13.11 that may play a role in both familial and non-familial forms of prostate cancer.

Replies are listed 'Best First'.
Re: regexp and substitute operator problem
by blokhead (Monsignor) on Jul 08, 2003 at 06:18 UTC
    There's a problem in the loop that removes items according to a regex. Whenever an item is removed, the items to the right are shifted over, but $i is still incremented. See:
    my @arr = qw/ foo foozle bar bar2 foo bar foo /; for (my $i=0; $i<@arr; $i++) { splice(@arr, $i, 1) if ($arr[$i] =~ /bar/); } print "$_\n" for (@arr); ## output: bar2 is NOT removed
    It will fail to check the next item after each match. It's much simpler (and correct-er) to write:
    @chroms = grep { ! /[pqxy]/i } @chroms;
    Although, if you're dealing with a huge amount of genetic data, this will be very slow. A quick fix to the existing loop would be the slight change:
    for (my $i=0; $i<@chroms; $i++) { splice(@chroms, $i--, 1) if not $chroms[$i] =~ /[pqxy]/i; } ####
    This still may not be optimal, but it's in-place so should be faster than the grep.

    About your second question... Off the top of my head, I'm not sure what's up with your regex... It's late for me!

    blokhead

      Yep, you're right, that was a pretty stupid mistake. Thanks for your help. This:

      @chroms = grep {  /[pqxy]/i } @chroms;

      solves it.

        As for your second question, just copy/pasting your code and the example text does output the hyphens and dots here. What are you displaying it on?

        Also, you might at least want to put some () around the central part of your regex, as in: /\s($c*?)\s/ so you won't capture the whitespace around the matches as you do now. Or maybe that is what you wanted.


        You have moved into a dark place.
        It is pitch black. You are likely to be eaten by a grue.
Re: regexp and substitute operator problem
by BrowserUk (Patriarch) on Jul 08, 2003 at 06:51 UTC

    When I run your code, I get a quite different set of results from those you posted.

    P:\test>test Xq27-q28 22q12.1 19q 1q25 11p11.2 10q23.31 8p22 7p22 19q12-q13.11 52 15 19q12-

    Which makes it difficult to try and answer your two questions.

    In any case, it is probably better to specify the regex more accurately so that you don't capture the unwanted values in the first place. This seems to do the trick on the sample you supplied.

    #! perl -w use strict; my $text = <DATA>; my $re_1chrom = '[\dXY]*[pq]\d+(?:.\d+)?'; my $re = qr[\b$re_1chrom(?:-$re_1chrom)?\b]i; my @chroms = ( $text =~ /$re/g ); print $_, $/ for @chroms __END__ P:\test>test Xq27-q28 Xq11-q12 22q12.1 1q42.2-q43 17p11 1q25 13q12.3 11p11.2 10q25 10q23.31 8p22 8p22 7q11.23 7p22 19q12-q13.11 19q12-q13.11

    The sequential double reference to 8p22 seems odd in context, but eliminating that and the other duplicate(s) is a simple step.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: regexp and substitute operator problem
by allolex (Curate) on Jul 08, 2003 at 10:36 UTC

    Well, I tried my own code:

    #!/usr/bin/perl use strict; use warnings; my @bands; while (<DATA>) { chomp; my @elements = split(' ',$_); foreach my $element (@elements) { next unless ($element =~ /[pq]/i); next unless ($element =~ /[0-9]/); # put them into an array just in case you # want to do something other than print them push @bands, $element; print $element . "\n"; } } __DATA__ Whole-genome scan studies recently identified a locus on Xq27-q28 Xq11-q12 22q12.1 1q42.2-q43 19q 17p11 1q25 13q12.3 11p11.2 10q25 10q23.31 8p22 8p22 7q11.23 7p22 chromosome segments 19q12-q13.11 linked to prostate tumor aggressiveness by use of the Gleason score as a quantitative trait. We have now completed finer-scale linkage mapping across this region that confirmed and narrowed the candidate region to 2 cM, with a peak between markers D19S875 and D19S433. We also performed allelic imbalance (AI) studies across this region in primary prostate tumors from 52 patients unselected for family history or disease status. A high level of AI was observed, with the highest rates at markers D19S875 (56%) and D19S433 (60%). Furthermore, these two markers defined a smallest common region of AI of 0.8 Mb, with 15 (29%) prostate tumors displaying interstitial AI involving one or both markers. In addition, we noted a positive association between AI at marker D19S875 and extension of tumor beyond the margin (P = 0.02) as well as a higher Gleason score (P = 0.06). These data provide strong evidence that we have mapped a prostate tumor aggressiveness locus to chromosome segments 19q12-q13.11 that may play a role in both familial and non-familial forms of prostate cancer.

    And got this output:

    Xq27-q28 Xq11-q12 22q12.1 1q42.2-q43 19q 17p11 1q25 13q12.3 11p11.2 10q25 10q23.31 8p22 8p22 7q11.23 7p22 19q12-q13.11 19q12-q13.11

    I hope that helps...

    --
    Allolex

Re: regexp and substitute operator problem
by aquarium (Curate) on Jul 08, 2003 at 06:18 UTC
    what is $text to start with??

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://272204]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2022-10-03 21:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My preferred way to holiday/vacation is:











    Results (15 votes). Check out past polls.

    Notices?