Re: Use of Uninitialized in Concatenation or String Error?
by kcott (Archbishop) on Aug 09, 2013 at 09:23 UTC
|
G'day ccelt09,
There's some fundamental flaws in your code.
"I then receive a continuous
Use of uninitialized value within @SNPs in concatenation (.) or string at temp_file_test.pl line 36, <CG> line 1112424."
You're in an infinite loop!
As soon as you enter "while ($switch == 1) {", you perform a test: "(($position < $end) && ($position >= $start))".
Obviously, that test must have been TRUE on the first iteration or you would never reach line 36.
The only way to exit the loop, is to set the value of $switch to something other than 1;
you only do this if the aforesaid condition is FALSE but, the variables involved in that condition ($start, $end and $position) never change within the loop, so the loop never ends!
At line 36 you have: "print OUT "$SNPs[$placeholder]\n";".
The variable $placeholder in incremented (infinitely) a few lines below this: "$placeholder++;".
At some point, the index exceeds the number of assigned elements in @SNPs and you start getting "uninitialized value" warnings.
"I print out the correct first temp_2.txt file but no lines are printed to it"
In your infinite loop, you open an output file like this: "open(OUT, ">$output_file");".
Every time you call this, you delete $output_file and create a new one (see open, taking particular note of the 3-argument form shown in the examples).
When you kill the script, $output_file will probably have a size of 1 and contain a single newline.
There are other discrepancies between your description and code but I'm not sure which is correct.
You talk about the locus being the fifth element but nowhere do you access the fifth element of anything.
Your initial code (but not description) has a "chrX_1Mbwindow_nonoverlapping.interval file;
I see you mention this elsewhere in the thread but you show (what should be unique) ranges as: 1000001-2000001, 2000001-3000001.
So, fix those problems first (both code and data) and maybe other issues (such as "spaces at the end of my input file" — whatever the significance of that might be) will resolve themselves.
When you read the open documentation, take note of $! and see perlvar if you don't know what it is.
You'd also do well to take a look at perlstyle for some tips on code layout: what you've posted is messy, not that easy to read and a potential source of errors.
| [reply] [d/l] [select] |
|
|
Thank you for the well thought out and thorough feedback, this helped me immensely! I worked back through the program paying special attention to layout. I also noticed I said the 5th element when indeed it was the fourth element, or 3rd array slice, that corresponded to chromosome locus.
Currently the revised program will print the correct number of output files and increments the position and placeholder variables correctly :). All the files are blank because, as you said, start and end values don't change so the test is always FLASE. This is because the starting locus in my data sits at the 2.6 million mark, while $start = 1 and $end = 1,000,001
It seems unnecessarily convoluted but also within the scope of my perl knowledge base to assign two more place-holding variables to $start and $end respectively, incrementing them within the else loop when the test is FALSE. I think this will work but I have to think about how to accomplish this. Thanks again, Ken.
| [reply] |
|
|
| [reply] |
Re: Use of Uninitialized in Concatenation or String Error?
by Loops (Curate) on Aug 08, 2013 at 23:52 UTC
|
Hi ccelt09,
In truth I couldn't quite work out what your code was trying to do. But your description of what you wanted to accomplish seemed clear. Below is a different take on how to sort your input into separate files. It could be easily done with fewer CPAN modules, but I reached for them anyway. So to use this code you'd have to install the following from CPAN:
Text::CSV_XS DBI DBD::CSV
If that's not a problem, then the following code should work well for you:
use strict;
use warnings;
use Text::CSV;
use DBI;
# input filename, and output file template with %d for interval #
my $input_filename = 'td.data';
my $output_filename = 'split_%d.data';
# Divide loci into groups of one million per output file
sub calculate_interval { return int((shift) / 1000000) };
my $dbh = DBI->connect ("dbi:CSV:", undef, undef, {
csv_eol => "\n",
csv_sep_char => "\t",
csv_class => "Text::CSV_XS",
csv_null => 1,
csv_tables => { genetics => {
f_file => $input_filename,,
col_names => [qw(a b c d locus f g h i j k l m n o)],
}},
RaiseError => 1,
PrintError => 1,
}) or die $DBI::errstr;
# Magic
my $sth = $dbh->prepare("select * from genetics order by locus");
$sth->execute;
# Grunt work to output into separate files
$, = "\t";
my $output;
my $output_interval = -1;
while (my @row = $sth->fetchrow_array) {
my $interval = calculate_interval $row[4];
if ($interval ne $output_interval) {
$output_interval = $interval;
open $output, '>', sprintf($output_filename, $interval)
or die "$output_filename $!";
}
print $output @row, "\n";
}
With this input data in a file named td.data:
0 50 4 46 723430 0 2 1 2 1 1 1 1
+ 3 1
0 50 4 46 5533723430 0 2 1 2 1 1 1
+ 1 3 1
0 50 4 46 33723430 0 2 1 2 1 1 1 1
+ 3 1
0 50 2 48 654732 0 1 1 1 0 2 3 2
+ 1 3
This was the result:
split_0.data:0 50 2 48 654732 0 1 1 1 0
+2 3 2 1 3
split_0.data:0 50 4 46 723430 0 2 1 2 1
+1 1 1 3 1
split_33.data:0 50 4 46 33723430 0 2 1 2 1
+ 1 1 1 3 1
split_5533.data:0 50 4 46 5533723430 0 2 1 2
+ 1 1 1 1 3 1
| [reply] [d/l] [select] |
|
|
This looks fantastic but as a novice perl user a good portion of it is over my head, I don't know that I can correctly interpret it. Many thanks for your input though, I will continue to study this and experiment with it!
| [reply] |
Re: Use of Uninitialized in Concatenation or String Error?
by 2teez (Vicar) on Aug 09, 2013 at 02:32 UTC
|
Hi ccelt09,
With the description of what you want to achieve, I suppose it all comes down to sorting your data with respect to the 5th column and then printing those out in different files ( correct me please if am wrong ) with respect to the range of the same column ( being between 1 and 1000,000,.. etc, 1 inclusive. Making the range 1000,000).
So, if assumption of what you wanted to do is correct, modifying Schwartzian transform a bit should work like so:
use warnings;
use strict;
use Data::Dumper;
push my @array, map { [ int( $_->[1] / 1_000_000 ), $_->[0] ] }
sort { $a->[1] <=> $b->[1] }
map { [ $_, ( split /\s+/, $_ )[4] ] } <DATA>;
print Dumper \@array;
__DATA__
0 50 4 46 723430 0 2 1 2 1 1 1 1
+ 3 1
0 50 4 46 5533723430 0 2 1 2 1 1 1
+ 1 3 1
0 50 4 46 33723430 0 2 1 2 1 1 1 1
+ 3 1
0 50 2 48 654732 0 1 1 1 0 2 3 2
+ 1 3
Produces ...
So, printing to different files is just a "function" of placement. The first element in the Array of Array being the file to save to.
BUT have this in mind that I don't know how large your data is. Am using data as posted by Loops and in fact his solution might be better.
If you tell me, I'll forget.
If you show me, I'll remember.
if you involve me, I'll understand.
--- Author unknown to me
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
Thank you for the consideration, I truly appreciate it! My data spans 155,000,000 nucleotides. In my original code I opened the file: INTERVALS which stores the windows of varying sizes I'd like to use for sorting,
chrX 1 1000000
chrX 1000001 2000001
chrX 2000001 3000001
...etc.
the largest being 1Mb (1,000,000) and assigned it to a scalar variable $interval. I chomp this scalar, assign an array to the split function of that $interval scalar and establish $start and $end variables by assigning them to the 1 and 2 array slices respectively. I don't wish to hard code my range because I have multiple windows to work with
open (INTERVAL, "/Users/logancurtis-whitchurch/Dropbox/thesis_folder/g
+alaxy_chrX_data/chrX_1Mbwindow_nonoverlapping.interval") or die "can'
+t open file\n";
while (my $interval = <INTERVAL>){
chomp($interval);
my @find_interval = split(/\t/, $interval);
my $start = $find_interval[1];
my $end = $find_interval[2];
from here i used an arbitrary switch variable to control when printing to a given output file should stop and a new file should begin to be printed to.
My condition, while $switch == 1 I open my data input file, specify and open my output file
my $switch = 1;
while ($switch == 1) {
open (CG, "/Users/logancurtis-whitchurch/Dropbox/thesis_folder
+/CompleteGenomics/28_males_inAll/CGS.inall.28.chr.23.txt") or die "ca
+n't open CG file\n";
my $output_file = "/Users/logancurtis-whitchurch/Desktop/temp_
+$count.txt";
open(OUT, ">$output_file");
Then with these 3 lines I create an array for the whole input file (@SNPs), make an array from each line or string in the input file (@get_SNP) and create a variable accounting for position ($position) that increments as the data is read via my placeholder variable ($placeholder)
my @SNPs = <CG>;
my @get_SNP = split(/\t/, $SNPs[$placeholder]);
my $position = $get_SNP[3];
I then use an if statement to say if my position in lt or eq to the end and greater than the start, print the $SNP[$placeholder] string corresponding to one data line, then increment $placeholder value, repeating the loop until the if statement is false. Then state else set $switch = 0 ending the while loop. Once this is done I increment a global variable $count that tells the open function to create a new output file since i interpolated the variable $count into the output file name earlier
my $switch = 1;
while ($switch == 1) {
if (($position < $end) && ($position >= $start)) {
print OUT "$SNPs[$placeholder]\n";
$placeholder++;
}
else {
$switch = 0;
$count++;
}
}
}
| [reply] [d/l] [select] |
Re: Use of Uninitialized in Concatenation or String Error?
by Laurent_R (Canon) on Aug 09, 2013 at 09:27 UTC
|
Hi,
I am not sure that it will work on every system, but I was able to create on my system an array of 155 file handlers. Then, I only need to read the input data, find out in which range the line should be output and print it to the right fila handler. Thus, no need to sort the data.
use strict;
use warnings;
use integer;
no strict "vars";
my @FH;
for my $i (0..154) {
my $file = "file$i" . ".txt";
open $FH[$i], ">", $file or die "could not open $file $!\n";
my $fh = $FH[$i];
print $fh "File number $i \n";
}
while (<DATA>) {
my $locus = (split /\t/, $_)[4];
my $range_nr = $locus / 1e6;
warn "out of range: $_" and next if $range_nr > 154;
my $fh = $FH[$range_nr];
print $fh $_;
}
close $FH[$_] for (1..154)
__DATA__
0 50 2 48 654732 0 1 1 1 0 2 3 2
+ 1 3
0 50 4 46 1723430 0 2 1 2 1 1 1 1
+ 3 1
0 50 2 48 14654732 0 1 1 1 0 2 3 2
+ 1 3
0 50 4 46 7723430 0 2 1 2 1 1 1 1
+ 3 1
0 50 2 48 2654732 0 1 1 1 0 2 3 2
+ 1 3
0 50 2 48 2654733 0 1 1 1 0 2 3 2
+ 1 3
0 50 2 48 2654734 0 1 1 1 0 2 3 2
+ 1 3
0 50 2 48 2654735 0 1 1 1 0 2 3 2
+ 1 3
0 50 2 48 2654736 0 1 1 1 0 2 3 2
+ 1 3
0 50 4 46 6723430 0 2 1 2 1 1 1 1
+ 3 1
Content of file2.txt:
File number 2
0 50 2 48 2654732 0 1 1 1
+ 0 2 3 2 1 3
0 50 2 48 2654733 0 1 1 1
+ 0 2 3 2 1 3
0 50 2 48 2654734 0 1 1 1
+ 0 2 3 2 1 3
0 50 2 48 2654735 0 1 1 1
+ 0 2 3 2 1 3
0 50 2 48 2654736 0 1 1 1
+ 0 2 3 2 1 3
| [reply] [d/l] [select] |
|
|
This looks like an interesting solution. I'll have to put in some time to fully understand what you did, but I very much appreciate the help. It's great to see how others approach a problem differently and, certainly here, more eloquently.
| [reply] |
|
|
Just to explain a bit more what the program does. In the first for loop, the program is opening 155 files, named from file0.txt to file154.txt (and writes a header line "File number #", this is not really needed, this was just a debugging mean of checking that the program does actually write correctly in all 155 files).
Then I read sequentially the input data file; for each line, I look for the locus and divide its value by 1000000 (integer division), which returns a range between 0 and 154. This range is then used to choose into which output file to print the current line.
Then there is a final for loop to close all the files.
| [reply] |
|
|
|
|
Re: Use of Uninitialized in Concatenation or String Error?
by Laurent_R (Canon) on Aug 09, 2013 at 06:22 UTC
|
How many ranges do you expect to have? My question is really: can you open all your output files at the same time? If yes, then the solution can be even simpler and not require any sorting.
| [reply] |
|
|
1 1000001
1000001 2000001
2000001 3000001
3000001 4000001
...etc until 155000000 is reached
I planned to use one range per program, each program would print as many output files as" total number of sites divided by size of the window size." So the 1Mb range would be 155 million / 1million = 155 output files. I'm not quite sure how to obtain the output files needed without sorting the original file | [reply] [d/l] |