perllearner007 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, So I know many monks here aren't biologist but I have posted my question on bioinfo forums aswell but haven't been able to solve the issue. Since it involves perl I thought maybe you all will be able to point out something in the code. My question, I am trying to find overlap between CNV(a variant basically) which has genomic region(start and stop basepair position), eg: 23487 72875 and genes which also have this (start and stop bp position) eg: BRCA1 41196312 41277500. Overlap can be in different ways: i). either there wont be any overlap i.e the gene region will not fall in the CNV region or ii). it will fall completely within the region or iii) it may fall partially i.e gene region might overlap with the CNV start position or gene region overlapping with the stop postion of the CNV. So if there will be an overlap it will return 1 and if not then it will return 0 and the output file finally will have all the overlap and the gene name with which this cnv region is overlapping. So I am reading in the main CNV file and two other gene list files. Three tab delim files I have are
file1=cnv file start stop size(in bp) 5389769 98256008 3567 7452344 871875466 64547
file2=genelist1 name start stop BRCA1 41196312 41277500 TP53 7571720 7590863
file3=genelist3 name start stop OMG 29621668 29624380 NR3C1 142657496..142815077
note: I cannot merge the two gene lists as they are different sets and need to be kept separate. Am I doing this correctly? Any suggestions?
!/usr/bin/perl #Open the cnv file with start, stop my $input_gene_file = "path to file"; die "Cannot open $input_gene_file\n" unless (open(IN, $input_gene_ +file)); my %size_CNV_start_CNV_stop; while (chomp($line = <IN>)) { my (@columns) = split /\s+/, $line; my $size = $columns[3]; my $CNV_start = $columns[1]; my $CNV_stop = $columns[2]; $size_and_CNV_start{$size} = $CNV_start; $size_and_CNV_stop{$size} = $CNV_stop; } close(IN); #Now open files for genelist1 and genelist2 with respective start +and stops open (MYFILE, "/Users/pathtofile"); @file1 = <MYFILE>; close MYFILE; open (MYFILE2, "/Users/pathtofile"); @file2 = <MYFILE2>; close MYFILE2; #Open output file and write the locations/position of each symbol die "output.txt" unless(open( OUT,"> output.txt")); while(defined(my $line1 = <$file1>) and defined(my $line2 = <$file2>)){ sub overlap { my ($CNV_start, $CNV_stop, $Gene_start, $Gene_stop) = @_; if ($Gene_stop < $CNV_start || $Gene_start < $CNV_stop) { return 0; if (Gene_stop > $CNV_start || $Gene_stop > $CNV_stop){ } return 1; } while (defined($symbol = <IN>)) { chomp($symbol); { my $CNV_start = $size_and_CNV_start{$size}; my $CNV_stop = $size_and_CNV_stop{$size}; print OUT " $name\ $size\ $CNV_start\ $CNV_stop \n"; } } } close(OUT);
I get a blank output file. There is no error or warning that I am getting so I am not sure what is going on and what I should be correcting? My output file should be if overlapping regions found
gene name size start stop xyz 5464 83654272 6735353 ... ...

Replies are listed 'Best First'.
Re: overlapping regions
by brx (Pilgrim) on Apr 24, 2012 at 15:15 UTC
    Strange things:
    • You define a sub overlap { ... } inside a while () loop.
    • You don't call this sub (overlap).
    • You use @file1 = <MYFILE> then while(defined(my $line1 = <$file1>.
    Add use strict; use warnings; in the begining of your script will help.
      Thanks brx...I've been revising the code a bit..Will be posting any update soon.
Re: overlapping regions
by Cristoforo (Curate) on Apr 25, 2012 at 00:31 UTC
    I thought that I would post a possible solution (using the module Set::IntSpan). It eliminates making the logic to detect overlap by testing for set intersection between the spans in the cnv file and the spans in the genelist* files. For each line in the genelist file, it checks to see if it intersects with any/all lines in the cnv file.

    However, this may not be what you were looking for as far as output/results. :-(

    #!/usr/bin/perl use strict; use warnings; use Set::IntSpan; my $cnv = <<EOF; start stop size(in bp) 5389769 98256008 3567 7452344 871875466 64547 EOF my $genelist1 = <<EOF; name start stop BRCA1 41196312 41277500 TP53 7571720 7590863 EOF my $genelist3 = <<EOF; name start stop OMG 29621668 29624380 NR3C1 142657496..142815077 EOF my @cnv; open my $fh, "<", \$cnv; <$fh>; # toss header while (<$fh>) { chomp; my ($start, $stop, $size) = split /\t/; push @cnv, { span => Set::IntSpan->new("$start-$stop"), size => $size, }; } close $fh or die $!; printf "%-10s%10s%10s%10s\n", ('gene name', qw/ size start stop /); for (\$genelist1, \$genelist3) { open $fh, "<", $_; <$fh>; # toss header while (<$fh>) { chomp; my ($name, $start, $stop) = split /\t|(?<=\d)\.\.(?=\d)/; my $span = Set::IntSpan->new("$start-$stop"); for my $href (@cnv) { my $intersect = $span->intersect( $href->{span} ); if ($intersect) { printf "%-10s%10s%10s%10s\n", $name, $href->{size}, $intersect->min, $intersect- +>max; } } } close $fh or die $!; }
    And produced this output
    gene name size start stop BRCA1 3567 41196312 41277500 BRCA1 64547 41196312 41277500 TP53 3567 7571720 7590863 TP53 64547 7571720 7590863 OMG 3567 29621668 29624380 OMG 64547 29621668 29624380 NR3C1 64547 142657496 142815077

    Hoping this is something like you were asking. The same genename overlapped with both spans in the cnv file for all of the genenames except for the last one, NR3C1. It overlapped with the span in the cnv file with size=64547 but not with the span with size=3567

    Chris

      Thanks Chris. This also makes a bit of sense. Its great if the same gene overlaps with both spans in the cnv file. Let me try and run this with my gene list which is much larger and see if this works or not. As for my previous code.. I fixed a few things but I think it is something to do with my sub function. I am working on fixing that.