Re: Creating a binary matrix

You failed to mention what part of the problem you are having trouble with. I'm going to make the assumption that you already know enough Perl to open files, and so my solution assumes that you've already got the files in an array of some sort. Because of this assumption (a consequence of your lack of specifying sufficient detail), you will have to adapt this solution to your needs.

my @genes = qw( Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 );

my @raw_files = (
  "Gene1 Gene2 Gene3",
  "Gene2 Gene3 Gene4",
  "Gene3 Gene4 Gene5",
);

my @gene_in_files = map {
  my %content;
  @content{ split " ", $_ } = ();
  \%content;
} @raw_files;

my @gene_matrix = map {
  my $gene = $_;
  [ map { ~~exists $_->{$gene} } @gene_in_files ]
} @genes;

print "Gene", $_+1, " @{$gene_matrix[$_]}\n" for 0 .. $#gene_matrix;
[download]

This solution puts the contents of each file into a hash so that it can be quickly determined if Gene1 can be found in File1. Then it just iterates over the genes, and tests each file to see if the gene is found in the file. If so, it flips a flag in the gene matrix on. Otherwise, it sets the flag to zero.

If your requirement is that you use actual bits rather than an array of 1's and 0's, that too is pretty simple, but I'm going to assume that you know how to read the documentation for vec, and are able to adapt the solution to fit that need.

Here is the output from my example script:

Gene1 1 0 0
Gene2 1 1 0
Gene3 1 1 1
Gene4 0 1 1
Gene5 0 0 1
Gene6 0 0 0
[download]

Also, I suggest that when you're trying to show us tabular input and output, that you simply wrap it in <code></code> tags; it's easier to maintain fixed column widths when you don't have to worry about how HTML gobbles up duplicated whitespace, and you won't have to put <br /> after each line of tabular data. See Writeup Formatting Tips. By way of example, when I posted my sample output, I did this:

<code>
[shift-insert, to paste output from my terminal]
</code>
[download]

Update: Simplified the solution by eliminating temp variables holding various stages of the data transform.

Update2: And here's my "just for fun" version:

my @genes = qw( Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 );

my @raw_files
  = ( "Gene1 Gene2 Gene3", "Gene2 Gene3 Gene4", "Gene3 Gene4 Gene5" );

my $gene_num = 1;
print "Gene", $gene_num++, " @{$_}\n" for sub {
  my @in_file = map { { map { $_ => 0 } split " ", $_ } } @{+shift};
  map {
    my $gene = $_;
    [ map { ~~exists $_->{$gene} } @in_file ]
  } @{+shift};
}->( \@raw_files, \@genes );
[download]

Dave

Comment on Re: Creating a binary matrix Select or Download Code

Replies are listed 'Best First'.
Re^2: Creating a binary matrix by perl_user123 (Initiate) on Mar 21, 2014 at 06:23 UTC
Dear Dave, Thanks for your response. I had read the master list into an array and then was reading each of the smaller lists into a separate array iteratively and was doing the comparison between the two arrays(master array and the smaller array). But I got stuck while creating/printing the matrix. A question about your code: Do I have to read all my files into a single array called @raw_files ? These 7 files have a single column of gene ids(~5000 in number). How do I read all these files into a single array ? Though I could create one single file by "pasting" the individual files as separate columns. Thanks in advance for any help. Regards, Anupam	[reply]
Re^3: Creating a binary matrix by davido (Cardinal) on Mar 21, 2014 at 14:32 UTC
In short, yes, you do. Here's why: Loading files into individual arrays such as "@array1", "@array2" ... "@arrayn" logically leads you to the situation where you are hard-coding the array names into your script. Iterating over them means something like this: `foreach my $aref ( \@array1, \@array2, ... \@arrayn ) {...` [download] So what if "n" changes from three to four? You're stuck; you have to go back into your source code and add a new variable for that new file. It's unmaintainable, and someone using your script would suffer the impatience of knowing that the computer could be doing more for them, if only the programmer had been more lazy (in the positive sort of way). You might then start to wonder, is it possible to automatically iterate over arrays named @array1, @array2, and so on? ...and if they are set up as package global variables, in fact, it is. But doing so crosses that barrier that "`strict 'vars'`" explicitly prohibits, because it's generally unsafe, or as some people put it "it's stupid to use a variable as a variable name"... and most of all, because there's usually a better solution; years of CS have moved us mostly away from symbolic style references. So the solution is to store your files all in an array, or possibly an array of arrays. Let's say you have your list of filenames in an array named "@filenames"; one filename per array element. In that case, you could: `use File::Slurp 'read_file'; my @raw_files; foreach my $filename ( @filenames ) { push @raw_files, scalar read_file( $filename, chomp => 1 ); }` [download] ...and then you should have a data structure that is already appropriate to feed to my script. Later on, if you're interested in making the code more elegant, you would probably refactor such that as you are reading the file into memory you are handling all the other processing on it such as splitting on whitespace, and even setting it up the elements thereof in hash keys. But that's not strictly necessary in this step, and doing so would require refactoring the rest of my solution as well. Dave	[reply] [d/l] [select]