biobee07 has asked for the wisdom of the Perl Monks concerning the following question:

I want to extract the some info from the following input line using perl regular expression. I will appreciate any help in doing so.

input line: hg19_ensGene_ENST00000237247 range=chr1:67208779-67210057 5'pad=0 3'pad=0 strand=+ repeatMasking=none

info to be extracted: chr1:67208779-67210057:+

The perl code i use till now that works successfully is:
while(<LOC>){ chomp; if(/^>(\w+)\s\w+\=(chr\w+)\:(\d+)\-(\d+)/) $loc{$1} = "$2:$3:$4:$5"; print $loc{$1}."\n"; } }

extracts the following info:chr1:67208779-67210057

however i am unable to extract + info from the input line above.

Replies are listed 'Best First'.
Re: Using regular expression to extract info from input line
by toolic (Bishop) on Mar 19, 2010 at 00:18 UTC
    I fixed a syntax error, added a leading '>' character to your input string, used perltidy to neaten up the indentation, added the strictures and used the DATA handle to create a self-contained, runnable example.
    use strict; use warnings; my %loc; while (<DATA>) { chomp; if (/^>(\w+)\s\w+\=(chr\w+)\:(\d+)\-(\d+).*\+(.*)/) { $loc{$1} = "$2:$3:$4:$5"; print $loc{$1} . "\n"; } } __DATA__ >hg19_ensGene_ENST00000237247 range=chr1:67208779-67210057 5'pad=0 3'p +ad=0 strand=+ repeatMasking=none
    Prints:
    chr1:67208779:67210057: repeatMasking=none
    Also, there is no reason to back-whack = or : or - in your regex. This will also work:
    if (/^>(\w+)\s\w+=(chr\w+):(\d+)-(\d+).*\+(.*)/) {
      Hi Toolic, Thanks for making the code look neat and self-contained. I used your corrected code, made a little modification based on jethro's advice
      if (/^>(\w+)\s\w+\=(chr\w+)\:(\d+)\-(\d+).*strand=(.)/)
      and it works perfectly. Thanks a lot
Re: Using regular expression to extract info from input line
by ww (Archbishop) on Mar 19, 2010 at 00:37 UTC
    And neither will you extract what you claim with the code posted.

    Among the other obvious defects, you have four sets of capturing parens and purport to print a fifth match. Not today. Moreover the ">" just after the caret (anchor) means the regex can't match the data you supplied. Please, be careful to post code that does what you say, to avoid wasting the Monks time and -- all too often -- sending folks off on a wild goose ^H^H^H^H^H camel chase.

    Did you perhaps mean this?

    #!/usr/bin/perl use strict; use warnings; # 829503 # info to be extracted: chr1:67208779-67210057:+ my %loc; =head original with captures highlighted while(<LOC>){ chomp; if(/^>(\w+)\s\w+\=(chr\w+)\:(\d+)\-(\d+)/) -------------$1------------$2------$3-----$4- $loc{$1} = "$2:$3:$4:$5"; print $loc{$1}."\n"; } } =cut while(<DATA>) { chomp; if(/^(.+)\s.+=(chr\d):(\d+)\-(\d+).+(?:=\+\s)(.*)/) { print "\t$1 | $2 | $3 | $4 | $5 |\n\n"; $loc{$1} = "$1:$2:$3:$4:$5"; print $loc{$1}."\n"; } else { print "No matches\n"; } } __DATA__ hg19_ensGene_ENST00000237247 range=chr1:67208779-67210057 5'pad=0 3'pa +d=0 strand=+ repeatMasking=none

    The above produces the following:

    ww@GIG:~/pl_test$ perl 829503.pl hg19_ensGene_ENST00000237247 | chr1 | 67208779 | 67210057 | repeat +Masking=none | hg19_ensGene_ENST00000237247:chr1:67208779:67210057:repeatMasking=none ww@GIG:~/pl_test$

    Update: s/ahchor/anchor/

      Apologise for the incovenience..
Re: Using regular expression to extract info from input line
by jethro (Monsignor) on Mar 19, 2010 at 00:33 UTC

    Is it always "strand=" before the character you are looking for? Is it always only one character or could it be more or less (i.e. an empty string) ?

    In case it is always a single character, simply have something like .*strand=(.) at the end of your regex.

      Thanks a lot. It worked like a charm! Yes,my input line is the same in all the case, so I got the result I wanted.