Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I am relatively new to perl code and I have what should be a simple question for the perl monks. Basically my problem is that I have to match a base position to the corresponding region in the corresponding chromosome. I have two input files: a file for the base position that looks like

pos.txt

chr1 415 0 0 +

chr1 600 0 0 +

chr3 205 0 0 +

chr4 681 0 0 +

chr7 110 0 0 +

chr7 350 0 0 +

where the 1st col. is the chromosome and 2nd col. is the position. The cols. are separated by tabs as well.

The second file is reg.txt:

chr1 400 500 0 0 +

chr1 600 700 0 0 +

chr3 200 225 0 0 +

chr4 650 700 0 0 +

chr7 100 120 0 0 +

chr7 300 400 0 0 +

where 1st col. is chromosome, 2nd col. is start of region, and 3rd col. is end of region. Separated by tabs as well.

Essentially, I have to find the region in reg.txt for the corresponding position in pos.txt

Here is the code I'm working on:

#!/usr/bin/perl use warnings; use strict; my $region = 'testReg.txt'; my $position = 'testPos.txt'; my $writeOut = '>>testOut.txt'; open(R,$region) or die "error reading file"; open(OUT,$writeOut) or die "error writing to the file "; open(P, $position) or die "error reading file "; my $rline; my $pline; while ($rline=<R>) { chomp($rline); my @r_arr=split("\t",$rline); chomp($r_arr[0]); my @rID = split("r",$r_arr[0]); $r_arr[0] = $rID[1]; #this removes the "chr" portion of the fi +rst element and leaves number #i.e. instead of [0] -> "chr24"; [0] -> "24" while($pline=<P>) { if(!$rline) { last; } #end if chomp($pline); my @p_arr=split("\t",$pline); chomp($p_arr[0]); my @pID = split("r",$p_arr[0]); $p_arr[0] = $pID[1]; if($p_arr[1]>$r_arr[2]) { $rline=<R>; redo; } #end if else { if($p_arr[0] == $r_arr[0] && $p_arr[1] >= $r_arr[1] +&& $p_arr[1] <= $r_arr[2]) { #NOTE: [0] element in each array now corresponds t +o chr number # r[1] is start of region and r[2] is end of regio +n # p[1] is the position of the base pair shift(@p_arr); print (OUT "chr$r_arr[0]\t$r_arr[1]\t$r_arr[2]\t$r +_arr[3]\t"); print OUT join ("\t", @p_arr), "\n"; #essentially I'm joining the two files with ma +tching lines #w/ columns separated by tab } #end if } #end else } # end while <P> } #end while <R> close R; close P; close OUT;

I have working code that can find a region for a position in one chromosome, but for multiple chromosomes I am having a difficult time getting an answer. My problem seems to be the loops as the code only outputs answers for the first chromosome (i.e. chr1). Any help would be appreciated, a217

Replies are listed 'Best First'.
Re: Help with locating bp region in chromosome
by Anonymous Monk on Jun 23, 2011 at 02:27 UTC
    Good start :) You should also put the data in <code></code> aka <c></c> tags

    Or better yet, make it part of your program with in-memory filehandles like this

    The essence of your program and your problem is unchanged, except now its confined in sub RegPosOut910992.

    The next step you should take is to replace @p_arr with meaningful variable names, say $Chomosomes, $StartOfRegion, $EndOfRegion...

    Also, the data you presented contains no tabs, so split on whitespace

    Next problem, filehandles are iterators. Once advance the iterator, once you reach the end, you're always at the end, unless you rewind the iterator. You can rewind filehandles with seek.

      Sorry about that, but here is the data input as well as the output for my code.

      Below is pos.txt, or where the position of the bp is located (1st col chromosome, 2nd col position).

      chr1 104 104 0 0 + chr1 145 145 0 0 + chr1 205 205 0 0 + chr1 600 600 0 0 + chr3 500 500 0 0 + chr4 150 150 0 0 + chr4 175 175 0 0 + chr7 400 400 0 0 + chr7 550 550 0 0 + chr9 100 100 0 0 + chr11 680 680 0 0 + chr11 681 681 0 0 + chr22 105 105 0 0 + chr22 110 110 0 0 + chr22 350 350 0 0 +

      Below is reg.txt, or where the region is located (1st col chromosome, 2nd col start of region, 3rd col end of region).

      chr1 100 159 0 chr1 200 260 0 chr1 500 750 0 chr3 450 700 0 chr4 100 300 0 chr7 350 600 0 chr9 100 125 0 chr11 679 687 0 chr22 100 200 0 chr22 300 400 0

      Below is my output, where first 4 col are from reg.txt and last 5 are from pos.txt. As you can see, my code only correctly outputs answers for part of the first chromosome, and it does not continue past that. This is the main problem I face, to understand how I can get a loop to cover all cases.

      chr1 100 159 0 104 104 0 0 + chr1 100 159 0 145 145 0 0 +
        The code you posted produces no output for me (try downloading the code you posted yourself).

        You should also post the exact output you expect to get.