comment on

I'd like to request help from a Monk on how to build a complex data structure, based on a file's contents. I am new to this, and despite having studied various tutorials on this topic, I cannot figure out the correct procedure.

I am trying to put the data from a file into a Hash of Arrays of Arrays, where the individuals (who each have their own line in the file) are the hash keys, and have the data (consisting of comma-separated pairs of zeros and ones) associated with them broken up into 100 kilobase (kb) windows, based on the genomic coordinates that are in the header line of each file. Each 100kb segment will therefore have its own array within a larger array that encompasses the entire data for that individual.

The code I have so far is listed below, followed by a sample of one of the files I need to work with. In this example, the first several interior arrays would be empty (we start with number 162, which all of the data in the few columns listed would go into), but for other files this would not be the case. For example, for each individual (beginning on line 2 of the input file) I want all data for the columns corresponding to header line coordinates between 1-99,999 (if any) to go into the first array, then 100,000-199,999 into the second array, and so on.

Help on this would be most appreciated.

#!/usr/bin/perl
use warnings;
use strict;
use v5.14;

die "need two arguments (i.e. chr cont) at invocation" unless @ARGV ==
+ 2;

chomp( my $chr_num = shift );
chomp( my $cont    = shift );

open my $out_file, ">", "chr${chr_num}_exome_snps_processed_${cont}_ST
+ATS"
  or die "Can't open output file: $!\n";

# Get a list of individuals (will be hash keys later):
open my $in_file, "<", "chr${chr_num}_exome_snps_processed_$cont"
  or die "Can't open input file: $!\n";

my @individuals;
my %data;

while (<$in_file>) {
    chomp;
    my @snp_bins;
    if (/^SAMPLE/) {
        my ( $placeholder, @coords ) = split /,/;
        foreach my $coord (@coords) {
            push @snp_bins, int( $coord / 100_000 );
        }
    }
    else {
        my ( $id, @snps ) = split /,/;
        push @individuals, $id;
        foreach my $individual (@individuals) {
            foreach my $snp (@snps) {
                $data{$individual}[ [ shift @snp_bins ] ] = $snp;
            }

        }

    }
}

close $in_file;
[download]

## Sample of data file. Each file has hundreds of thousands of columns
+ and hundreds of rows
SAMPLE,16287215,16287226,16287365,16287649,16287784,16287851,16287912
HG00553,0 0,0 0,0 0,0 0,0 0,0 0,0 0
HG00554,0 0,0 0,0 0,0 0,0 0,0 0,0 0
HG00637,0 0,0 0,0 0,0 0,0 0,0 0,0 0
HG00638,0 0,0 0,0 0,0 0,0 0,1 1,0 0
HG00640,0 0,0 0,0 0,0 0,0 0,1 1,0 0
[download]

In reply to Population of HoAoA based on file contents by iangibson

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.