Using regular expression to extract info from input line

biobee07 has asked for the wisdom of the Perl Monks concerning the following question:

I want to extract the some info from the following input line using perl regular expression. I will appreciate any help in doing so.

input line: hg19_ensGene_ENST00000237247 range=chr1:67208779-67210057 5'pad=0 3'pad=0 strand=+ repeatMasking=none

info to be extracted: chr1:67208779-67210057:+

The perl code i use till now that works successfully is:

while(<LOC>){
        chomp;
     if(/^>(\w+)\s\w+\=(chr\w+)\:(\d+)\-(\d+)/)    
               $loc{$1} = "$2:$3:$4:$5"; 
        print $loc{$1}."\n";
        }
}
[download]

extracts the following info:chr1:67208779-67210057

however i am unable to extract + info from the input line above.

Comment on Using regular expression to extract info from input line Download Code

Replies are listed 'Best First'.
Re: Using regular expression to extract info from input line by toolic (Bishop) on Mar 19, 2010 at 00:18 UTC
I fixed a syntax error, added a leading '>' character to your input string, used perltidy to neaten up the indentation, added the strictures and used the DATA handle to create a self-contained, runnable example. `use strict; use warnings; my %loc; while (<DATA>) { chomp; if (/^>(\w+)\s\w+\=(chr\w+)\:(\d+)\-(\d+).\+(.)/) { $loc{$1} = "$2:$3:$4:$5"; print $loc{$1} . "\n"; } } __DATA__ >hg19_ensGene_ENST00000237247 range=chr1:67208779-67210057 5'pad=0 3'p +ad=0 strand=+ repeatMasking=none` [download] Prints: `chr1:67208779:67210057: repeatMasking=none` [download] Also, there is no reason to back-whack `=` or `:` or `-` in your regex. This will also work: `if (/^>(\w+)\s\w+=(chr\w+):(\d+)-(\d+).\+(.)/) {` [download]	[reply] [d/l] [select]
Re^2: Using regular expression to extract info from input line by biobee07 (Novice) on Mar 19, 2010 at 01:23 UTC
Hi Toolic, Thanks for making the code look neat and self-contained. I used your corrected code, made a little modification based on jethro's advice `if (/^>(\w+)\s\w+\=(chr\w+)\:(\d+)\-(\d+).*strand=(.)/)` [download] and it works perfectly. Thanks a lot	[reply] [d/l]
Re: Using regular expression to extract info from input line by ww (Archbishop) on Mar 19, 2010 at 00:37 UTC
And neither will you extract what you claim with the code posted. Among the other obvious defects, you have four sets of capturing parens and purport to print a fifth match. Not today. Moreover the ">" just after the caret (anchor) means the regex can't match the data you supplied. Please, be careful to post code that does what you say, to avoid wasting the Monks time and -- all too often -- sending folks off on a wild goose ^H^H^H^H^H camel chase. Did you perhaps mean this? #!/usr/bin/perl use strict; use warnings; # 829503 # info to be extracted: chr1:67208779-67210057:+ my %loc; =head original with captures highlighted while(<LOC>){ chomp; if(/^>(\w+)\s\w+\=(chr\w+)\:(\d+)\-(\d+)/) -------------$1------------$2------$3-----$4- $loc{$1} = "$2:$3:$4:$5"; print $loc{$1}."\n"; } } =cut while(<DATA>) { chomp; if(/^(.+)\s.+=(chr\d):(\d+)\-(\d+).+(?:=\+\s)(.*)/) { print "\t$1 \| $2 \| $3 \| $4 \| $5 \|\n\n"; $loc{$1} = "$1:$2:$3:$4:$5"; print $loc{$1}."\n"; } else { print "No matches\n"; } } __DATA__ hg19_ensGene_ENST00000237247 range=chr1:67208779-67210057 5'pad=0 3'pa +d=0 strand=+ repeatMasking=none [download] The above produces the following: `ww@GIG:~/pl_test$ perl 829503.pl hg19_ensGene_ENST00000237247 \| chr1 \| 67208779 \| 67210057 \| repeat +Masking=none \| hg19_ensGene_ENST00000237247:chr1:67208779:67210057:repeatMasking=none ww@GIG:~/pl_test$` [download] Update: s/ahchor/anchor/	[reply] [d/l] [select]
Re^2: Using regular expression to extract info from input line by biobee07 (Novice) on Mar 19, 2010 at 01:40 UTC
Apologise for the incovenience..	[reply]
Re: Using regular expression to extract info from input line by jethro (Monsignor) on Mar 19, 2010 at 00:33 UTC
Is it always "strand=" before the character you are looking for? Is it always only one character or could it be more or less (i.e. an empty string) ? In case it is always a single character, simply have something like `.*strand=(.)` at the end of your regex.	[reply] [d/l]
Re^2: Using regular expression to extract info from input line by biobee07 (Novice) on Mar 19, 2010 at 01:20 UTC
Thanks a lot. It worked like a charm! Yes,my input line is the same in all the case, so I got the result I wanted.	[reply]