in reply to Trying to edit HoAoA entries

Hello, I have been looking at this, and I believe I have a solution.

The main problem I faced was that the auto-vivified hashes/arrays initialise undefined. So I could not determine to fill out an undefined array as all would be undefined. I gather this is why you are iterating through the snps arrays first then trying to map the indexes of those to the indexes of the snp_bins.

Not being able to determine by whether an array was defined or not, I slept on it and in the morning decided that using your method to initially map the indexed array but then retro filling the undefined arrays with '1 1,' using an auto decrement unless/until loop may work.

I had to hammer at the loops for a bit, and the flow control assumes that the SAMPLE line is always line '1' of the data file. But it does what I think you need it to do.

Updated with do until loops and code works with tests printing at the end. The tests use the second slightly updated dnata file to include discontigous oasis coordinates - as per akme ordnance survey chart special duties to the pirates under the governance of silverbeard the undocumented. addenda: this scroll is only to be seen by the scribe on penalty of plank.

dnata file

SAMPLE,16287215,16287226,16287365,16287649,16287784,16287851,16287912 HG00553,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00554,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00637,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00638,0 0,0 0,0 0,0 0,0 0,1 1,0 0 HG00640,0 0,0 0,0 0,0 0,0 0,1 1,0 0
SAMPLE,16287215,16287226,16287365,16287649,16287784,16887851,17187912 HG00553,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00554,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00637,0 0,0 0,0 0,0 0,0 0,0 0,0 0 HG00638,0 0,0 0,0 0,0 0,0 0,1 1,0 0 HG00640,0 0,0 0,0 0,0 0,0 0,1 1,0 0

solution

#!/usr/bin/perl -T use strict; use warnings; my (@snp_bins, %data); open my $in_file, "<", "./dnata"; while (<$in_file>) { chomp; if ($. == 1 ) { # line number my ( $placeholder, @coords ) = split /,/; @snp_bins = map int( $_ / 100_000 ), @coords; next; } if ($. >= 2){ my ( $id, @snpspairs ) = split /,/; foreach my $oasis (@snp_bins){ my $os = $oasis; @{ $data{$id}[$os] } = @snpspairs; $os--; unless ( defined( @{ $data{$id}[$os] } ) ){ do { @{ $data{$id}[$os] } = map (q(1 1), 0..99); # do 0..9 for readability on output!! $os--; } until( defined( @{ $data{$id}[$os] } ) ) } } } } foreach my $k (sort keys %data){ print $k," Ref: ",@{ $data{$k} }->[0],$/; print $k," 0: ",@{ $data{$k}->[0] },$/; print $k," 161: ",join (',',@{ $data{$k}->[161]}),$/; print $k," 161:5 ",@{ $data{$k}->[161]}[5],$/; print $k," 162: ",join (',',@{ $data{$k}->[162]}),$/; print $k," 162:5 ",@{ $data{$k}->[162]}[5],$/; print $k," 165: ",join (',',@{ $data{$k}->[165]}),$/; print $k," 165:5 ",@{ $data{$k}->[165]}[5],$/; print $k," 168: ",join (',',@{ $data{$k}->[168]}),$/; print $k," 168:5 ",@{ $data{$shark}->[168]}[5],$/; print $k," 170: ",join (',',@{ $data{$k}->[170]}),$/; print $k," 170:7 ",@{ $data{$k}->[170]}[7],$/; print $k," 171: ",join (',',@{ $data{$k}->[171]}),$/; print $k," 171:7 ",@{ $data{$k}->[171]}[7],$/; }

There may be better ways to control the flow re the line numbers and of course the flow depends on if /^SAMPLE/ lines occur more than once throughout the file. Also I expect my loop exits could be improved upon. However this should get the ball rolling. For a start there may be a bit of tweakery required when retro filling from say coord 175 back down to 162 so that the loop does not continue through 162 to 0, but the until(defined) condition I'm hoping should catch that.

Coyote

Updates: Discovered do until loops !!! Now this is working like a charm. Note, the output is controlling the comma placement for csv compatability, hence the join function in the tests. fixed: The '1 1' x 100 method was replaced with a map as the x 100 method was returning one very long string stored in index 0, rather than 100 index array each containing '1 1'. I also attempted to remove the oasis scalar but this broke the code so the island stays. :p

Replies are listed 'Best First'.
Re^2: Trying to edit HoAoA entries
by iangibson (Scribe) on Jun 19, 2012 at 15:28 UTC

    Thanks for your reply. However, I think I'm going to have to rethink my approach, as with the size of files I'm dealing with I'm using way too much memory building the entire data structure before working with it. So I think I'm going to build and process one window at a time, and see how that works.

      Yes, I cannot imagine that any module of the nature you are dealing with is not expecting the kind of data you have. However the problem was an interesting one to me the longer I studied it. Once I understood the problem, I could not really make any suggestions as such, as I was too busy trying to figure out the solution too. Hence providing a whole solution rather than some pointers.

      Did your solution of submitting the full array indices actually work though?

      One thought which occured to me and may or not be relevant is that maybe the windows do not need to be 'full', but the array indices just may need to not be empty (or undefined). Without understanding the full layout of the incoming data and the way the data is expecting to be recieved once processed, it is impractical to start guessing what may or not be required though.

      The only assumption I did make is that the data you provided is extended a few (hundred?) thousand times. There was a bit of wondering if this was the case or not, or whether you had simplified the data to highlight the problem.

      Must the data really be output in the way you have detailed? Or at another angle, is the recipient of the data really expecting the data in the way you are outputing it? You did mention that you are attempting to get a certain module to accept the data. What is it about the data that is not being accepted by the module? What is the expected input/output of that module?

      I think you may be able to tell I have become rather inquisitive about this. Such being the nature of scientific research I suppose.

      further thought: taking a reference to the @onesarray would likely reduce the overall memory required considerably

      my $refonesarr = map '1 1', 0..99; for (0..170000){ push( @{ $data{$k}}->[$_]}, $refonesarray ) } print $k," 161: ",join (',',@${ $data{$k}->[161]}),$/;

      untested code.

      Also, I have just now glanced through the previous thread from where you have arrived at this point.

      Instead of using the indexes of the hashed array to hold the relevant window why not just enter the 'index' as the first value in the array? a classless class argument held as an array, if you will.

      This would reduce the HoAoA to an HoA and would be a lot easier to handle. When you need to extract particular windows on individuals you could then just run a grep search on the classless class value in each individuals array to compile the relevant window arrays for use. This would also eradicate any empty array difficulties. But would this be ok with the differing chromosome coords you mentioned momentarily?

      The crass argument may even be better represented as a HoHoA where the keys of the second Hash are string literals representing each numeric window to the array values.

      Coyote