iangibson has asked for the wisdom of the Perl Monks concerning the following question:

I am still tring to successfully pass my data to a BioPerl module (as discussed here: Population of HoAoA based on file contents), but I now think that the problem that's making my program crash is the undef values in the data structure.

I've written some code to attempt to replace each undef with some fake data that flags a window as undef, but what I've written doesn't seem to be modifying the output. Perhaps someone can tell me how to fix this:

while (<$in_file>) { chomp; if (/^SAMPLE/) { my ( $placeholder, @coords ) = split /,/; foreach my $coord (@coords) { push @snp_bins, int( $coord / 100_000 ); } } else { my ( $id, @snps ) = split /,/; foreach my $snp (@snps) { # (@snps[0..$#snps-1] $snp =~ s/$snp/$snp,/ # put commas back in (preserve cs +v format) } foreach my $index ( 0 .. $#snp_bins ) { if ( $snps[$index] ) { push( @{ $data{$id}[ $snp_bins[$index] ] }, $snps[$index] +); # see note [1] } else { # replace 'undef' elements with f +lag data my @ones_array; foreach ( 1 .. 100) { push @ones_array, "1 1,"; } push( @{ $data{$id}[ $snp_bins[$index] ] }, @ones_arra +y ) } } } }

I am assuming that the problem lies with the line

if ( $snps[$index] )

Is this the case, and if so how can I fix it (I have examined my structure using Data::Printer, and all the 'undef's are still there)?

Input file sample (the file actually has hundreds of thousands of columns and hundreds of rows (obviously not shown)):

SAMPLE,16287215,16287226,16287365,16287649,16287784,16287851,16287912 HG00553,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00554,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00637,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00638,0 0,0 0,0 0,0 0,0 0,1 1,0 0 HG00640,0 0,0 0,0 0,0 0,0 0,1 1,0 0

Sample of HoAoA contents (HG00553 is the first individual, then we have a series of 100kb windows, the first 162 of which are undef, then the 163rd window has 7 entries, etc. There are many more individuals later in the input file. You can see from my code above that I'm trying to replace each undef with 100 '1 1,' entries, which would flag these windows as undef but (hopefully) allow the program to successfully pass the data to the module):

{ HG00553 [ [0] undef, [1] undef, <<< SNIP>>> [159] undef, [160] undef, [161] undef, [162] [ [0] "0 0,", [1] "0 0,", [2] "0 0,", [3] "0 0,", [4] "0 0,", [5] "0 0,", [6] "0 0," ], [163] undef, [164] [ [0] "0 0," ], [165] undef, [166] undef, [167] undef, [168] undef, [169] undef, [170] [ [0] "0 0,", [1] "0 0,", [2] "0 0,", [3] "0 0,", [4] "0 0,", [5] "1 1,", [6] "0 0,", [7] "0 0,", [8] "1 1,", [9] "0 0,", [10] "0 0,", [11] "0 0,", [12] "0 0," ], [171] undef, [172] [ [0] "0 0,", [1] "0 0,", [2] "0 0,", [3] "0 0,", [4] "0 0,", [5] "0 0,", [6] "0 0,", [7] "0 0,", [8] "0 0,", [9] "0 0,", [10] "0 0,", [11] "0 0,", [12] "0 0," ], [173] undef, [174] [ [0] "0 0,", [1] "0 0,", [2] "0 0,", [3] "0 0,", [4] "0 0,", [5] "0 0,", [6] "0 0,", [7] "0 0,", [8] "0 0,", [9] "0 0,", [10] "0 0,", [11] "0 0,", [12] "0 0,", [13] "1 1,", [14] "0 0,", [15] "0 0,", [16] "0 0,", [17] "0 0,", [18] "0 0,", [19] "0 0,", [20] "0 0,", [21] "0 0,", [22] "0 0,", [23] "0 0,", [24] "0 0,", [25] "0 0,", [26] "0 0,", [27] "0 0,", [28] "0 0,", [29] "0 0,", [30] "0 0,", [31] "0 0,", [32] "0 0,", [33] "0 0,", [34] "0 0,", [35] "0 0,", [36] "0 0,", [37] "0 0,", [38] "0 0,", [39] "0 0,", [40] "0 0,", [41] "0 0,", [42] "0 0,", [43] "0 1,", [44] "0 1,", [45] "0 0,", [46] "0 0,", [47] "0 0,", [48] "0 0,", [49] "0 0,", [50] "0 0,", [51] "0 0,", [52] "0 0,", [53] "0 0,", [54] "0 0," ], [175] [ [0] "1 0,", <<< SNIP >>>

Replies are listed 'Best First'.
Re: Trying to edit HoAoA entries
by Don Coyote (Hermit) on Jun 16, 2012 at 11:58 UTC

    Hello, I have been looking at this, and I believe I have a solution.

    The main problem I faced was that the auto-vivified hashes/arrays initialise undefined. So I could not determine to fill out an undefined array as all would be undefined. I gather this is why you are iterating through the snps arrays first then trying to map the indexes of those to the indexes of the snp_bins.

    Not being able to determine by whether an array was defined or not, I slept on it and in the morning decided that using your method to initially map the indexed array but then retro filling the undefined arrays with '1 1,' using an auto decrement unless/until loop may work.

    I had to hammer at the loops for a bit, and the flow control assumes that the SAMPLE line is always line '1' of the data file. But it does what I think you need it to do.

    Updated with do until loops and code works with tests printing at the end. The tests use the second slightly updated dnata file to include discontigous oasis coordinates - as per akme ordnance survey chart special duties to the pirates under the governance of silverbeard the undocumented. addenda: this scroll is only to be seen by the scribe on penalty of plank.

    dnata file

    SAMPLE,16287215,16287226,16287365,16287649,16287784,16287851,16287912 HG00553,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00554,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00637,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00638,0 0,0 0,0 0,0 0,0 0,1 1,0 0 HG00640,0 0,0 0,0 0,0 0,0 0,1 1,0 0
    SAMPLE,16287215,16287226,16287365,16287649,16287784,16887851,17187912 HG00553,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00554,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00637,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00638,0 0,0 0,0 0,0 0,0 0,1 1,0 0 HG00640,0 0,0 0,0 0,0 0,0 0,1 1,0 0

    solution

    #!/usr/bin/perl -T use strict; use warnings; my (@snp_bins, %data); open my $in_file, "<", "./dnata"; while (<$in_file>) { chomp; if ($. == 1 ) { # line number my ( $placeholder, @coords ) = split /,/; @snp_bins = map int( $_ / 100_000 ), @coords; next; } if ($. >= 2){ my ( $id, @snpspairs ) = split /,/; foreach my $oasis (@snp_bins){ my $os = $oasis; @{ $data{$id}[$os] } = @snpspairs; $os--; unless ( defined( @{ $data{$id}[$os] } ) ){ do { @{ $data{$id}[$os] } = map (q(1 1), 0..99); # do 0..9 for readability on output!! $os--; } until( defined( @{ $data{$id}[$os] } ) ) } } } } foreach my $k (sort keys %data){ print $k," Ref: ",@{ $data{$k} }->[0],$/; print $k," 0: ",@{ $data{$k}->[0] },$/; print $k," 161: ",join (',',@{ $data{$k}->[161]}),$/; print $k," 161:5 ",@{ $data{$k}->[161]}[5],$/; print $k," 162: ",join (',',@{ $data{$k}->[162]}),$/; print $k," 162:5 ",@{ $data{$k}->[162]}[5],$/; print $k," 165: ",join (',',@{ $data{$k}->[165]}),$/; print $k," 165:5 ",@{ $data{$k}->[165]}[5],$/; print $k," 168: ",join (',',@{ $data{$k}->[168]}),$/; print $k," 168:5 ",@{ $data{$shark}->[168]}[5],$/; print $k," 170: ",join (',',@{ $data{$k}->[170]}),$/; print $k," 170:7 ",@{ $data{$k}->[170]}[7],$/; print $k," 171: ",join (',',@{ $data{$k}->[171]}),$/; print $k," 171:7 ",@{ $data{$k}->[171]}[7],$/; }

    There may be better ways to control the flow re the line numbers and of course the flow depends on if /^SAMPLE/ lines occur more than once throughout the file. Also I expect my loop exits could be improved upon. However this should get the ball rolling. For a start there may be a bit of tweakery required when retro filling from say coord 175 back down to 162 so that the loop does not continue through 162 to 0, but the until(defined) condition I'm hoping should catch that.

    Coyote

    Updates: Discovered do until loops !!! Now this is working like a charm. Note, the output is controlling the comma placement for csv compatability, hence the join function in the tests. fixed: The '1 1' x 100 method was replaced with a map as the x 100 method was returning one very long string stored in index 0, rather than 100 index array each containing '1 1'. I also attempted to remove the oasis scalar but this broke the code so the island stays. :p

      Thanks for your reply. However, I think I'm going to have to rethink my approach, as with the size of files I'm dealing with I'm using way too much memory building the entire data structure before working with it. So I think I'm going to build and process one window at a time, and see how that works.

        Yes, I cannot imagine that any module of the nature you are dealing with is not expecting the kind of data you have. However the problem was an interesting one to me the longer I studied it. Once I understood the problem, I could not really make any suggestions as such, as I was too busy trying to figure out the solution too. Hence providing a whole solution rather than some pointers.

        Did your solution of submitting the full array indices actually work though?

        One thought which occured to me and may or not be relevant is that maybe the windows do not need to be 'full', but the array indices just may need to not be empty (or undefined). Without understanding the full layout of the incoming data and the way the data is expecting to be recieved once processed, it is impractical to start guessing what may or not be required though.

        The only assumption I did make is that the data you provided is extended a few (hundred?) thousand times. There was a bit of wondering if this was the case or not, or whether you had simplified the data to highlight the problem.

        Must the data really be output in the way you have detailed? Or at another angle, is the recipient of the data really expecting the data in the way you are outputing it? You did mention that you are attempting to get a certain module to accept the data. What is it about the data that is not being accepted by the module? What is the expected input/output of that module?

        I think you may be able to tell I have become rather inquisitive about this. Such being the nature of scientific research I suppose.

        further thought: taking a reference to the @onesarray would likely reduce the overall memory required considerably

        my $refonesarr = map '1 1', 0..99; for (0..170000){ push( @{ $data{$k}}->[$_]}, $refonesarray ) } print $k," 161: ",join (',',@${ $data{$k}->[161]}),$/;

        untested code.

        Also, I have just now glanced through the previous thread from where you have arrived at this point.

        Instead of using the indexes of the hashed array to hold the relevant window why not just enter the 'index' as the first value in the array? a classless class argument held as an array, if you will.

        This would reduce the HoAoA to an HoA and would be a lot easier to handle. When you need to extract particular windows on individuals you could then just run a grep search on the classless class value in each individuals array to compile the relevant window arrays for use. This would also eradicate any empty array difficulties. But would this be ok with the differing chromosome coords you mentioned momentarily?

        The crass argument may even be better represented as a HoHoA where the keys of the second Hash are string literals representing each numeric window to the array values.

        Coyote