Re: Merging/Rearranging Tables
by liverpole (Monsignor) on Feb 09, 2007 at 18:36 UTC
|
Hi homeveg,
I think you will want to do this with a couple of passes.
In the first pass, read each ID, and keep track of how many columns each table has, as well as saving individual values per table.
In the second pass, for each ID, collect the corresponding values from each table, or assign the appropriate number of "n.a." (if the ID isn't defined for the table).
For example:
# Strict
use strict;
use warnings;
# Libraries
use Data::Dumper;
# User-defined
+
# Define tables, each one a separate value in the hash "$ptables"
my $ptables = {
'tab1' => [
"ID column | column 1",
"gene 1 | value 1.1",
"gene 2 | value 2.1",
"gene 4 | value 4.1",
"gene 8 | value 8.1",
],
'tab2' => [
"ID column | column 1 | column2",
"gene 1 | value 1.1 | value 1.2",
"gene 3 | value 3.1 | value 3.2",
"gene 4 | value 4.1 | value 4.2",
]
};
# Globals
my %output;
my %ncolumns;
my %values;
my @tables = (sort keys %$ptables); # Get all table na
+mes
# Main program
# First pass -- parse each table to fetch all the IDs
print "=== Pass 1 ===\n";
foreach my $table (@tables) {
my $ptab = $ptables->{$table}; # Assign to table
my @rows = split(/\s*\|\s*/, shift @$ptab); # Get column headi
+ngs
shift @rows; # Discard "ID colu
+mn"
my $ncols = @rows; # Find number of c
+olumns
$ncolumns{$table} = $ncols; # Save # of column
+s
print "Reading $table; $ncols col(s)...\n"; # Announce table n
+ame
foreach my $line (@$ptab) {
my ($id,@vals) = split(/\s*\|\s*/, $line); # Get ID and value
+s
$output{$id} ||= [ ]; # Placeholder for
+ID
$values{$table}{$id} = [ @vals ]; # Save values for
+table/ID
}
}
# Second pass -- process each ID, adding values from each table
my @ids = (sort keys %output);
print "=== Pass 2 ===\n";
foreach my $id (@ids) {
print "Processing ID $id\n";
my $pout = $output{$id}; # Get current ID l
+ist
foreach my $table (@tables) {
my $ncols = $ncolumns{$table}; # Get number of co
+lumns
my $pvalues = $values{$table}{$id}; # Get values for t
+able/ID
if (defined($pvalues)) {
push @$pout, @$pvalues; # Save values
} else {
push @$pout, ( "n.a." ) x $ncols; # Missing value =
+N/A
}
}
}
# Verify results
print "=== Verify results ===\n";
foreach my $id (@ids) {
my $pvalues = $output{$id};
printf "%12.12s | %s\n", $id, join(" | ", @$pvalues);
}
The output of which is:
gene 1 | value 1.1 | value 1.1 | value 1.2
gene 2 | value 2.1 | n.a. | n.a.
gene 3 | n.a. | value 3.1 | value 3.2
gene 4 | value 4.1 | value 4.1 | value 4.2
gene 8 | value 8.1 | n.a. | n.a.
Of course, you can always add more tables to the master table $ptables.
s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
| [reply] [d/l] [select] |
|
|
Thank you a lot! It looks like ready solution for me!
I still have some problems with Hashes (but I am learning...), therefore, I still have a question:
-----------
How to fill HoA with the values? Will the following code work for me?
# define files containing tables (tab-delimited text files)
my @files = ['tab1.txt','tab2.txt'];
# Define master hash "$ptables"
my %ptables;
#read files and add data to the HoA:
foreach my $file (@files) {
my @string_array = Read_File($file); # define arrays of strings
my $hash_key =~ s/\.txt//; # generate hash key
$ptables{$hash_key} = [ @string_array ]; # save all strings to the
+ hash
}
| [reply] [d/l] |
|
|
"Thank you a lot!"
You're welcome a lot!
"Will the following code work for me?"
Well, I'm tempted to ask -- "what happens when you try?". The best way to learn is, after all, by trying.
I do see that you're trying to initialize @files from a list reference ['tab1.txt','tab2.txt']; which likely won't do what you're expecting (you'll get a single item in @files, which itself is a reference to the 2-item list).
Better to declare it like this:
my @files = ('tab1.txt','tab2.txt');
# Or, using the "quote-word" function "qw",
# which lets you omit the quotes and the comma:
my @files = qw( tab1.txt tab2.txt );
Furthermore, you're creating a variable $hash_key which you're never assigning to, but rather trying to perform a regex substitution on with:
my $hash_key =~ s/\.txt//; # generate hash key
That's why, when I run it with use strict; and use warnings; I get the error:
Use of uninitialized value in substitution (s///) at merge.pl line 16.
So I'm assuming what you want instead is to assign to the filename $file, and then perform the substitution to get the resulting hash key:
(my $hash_key = $file) =~ s/\.txt//; # generate hash key
A final thought: make liberal use of Data::Dumper to see what data a given data structure contains at any time. For example, to see the entire contents of $ptables after making each assignment:
$ptables{$hash_key} = [ @string_array ]; # save all strings to the has
+h
print Dumper(\%ptables); # use "\" to pass reference o
+f hash
Update: fixed typo (thanks johngg).
s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
| [reply] [d/l] [select] |
|
|
|
|
Re: Merging/Rearranging Tables
by talexb (Chancellor) on Feb 09, 2007 at 17:22 UTC
|
It sounds like a HoA (hash of arrays) would do the job fine.
If this doesn't make sense to you, I'm sure some of the other monks will be able to jump in with sample code.
Alex / talexb / Toronto
"Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds
| [reply] |
|
|
Thanks for suggestion!
Well, I was looking to the HoA, but I had some problems with proper hash organisation, thats why I was trying hard with pure arrays.
Anyway, below I see the code answerring my quesiton.
| [reply] |
Re: Merging/Rearranging Tables
by dsheroh (Monsignor) on Feb 09, 2007 at 18:12 UTC
|
If you're on a *nix system and don't absolutely have to do this with perl, the join command would take care of this quite simply... Dump the tables to files (sorted on Field1, as join does require input files to be sorted), then something like join -a1 -a2 -t\| file1 file2 should do it. The main disadvantage of this method is that join only works on two files at a time.
To do it in perl, your first idea is on the right track, and is one of the fastest/most efficient ways to do this with a couple minor optimizations:
- Make a single list of all unique ID values from all tables. As already suggested, a hash would be ideal for this, since it will eliminate any duplicates.
- Sort this list and sort all tables by ID.
- Walk the list of unique IDs and grab the matching record (if any) from each table. Since everything's sorted, you can just run a single query against each table and only advance a record when a match is found.
This gives you a total of two queries against each table, one to collect all IDs at the start, then one to grab the data by ID after that list is built. | [reply] [d/l] [select] |
|
|
Unfortunately I can use only Windows-based PC. That's why I choose Perl.
My prototype script is working more or less as you described, except I did not think before about making single Unique IDs list. First I thought about it, when I was writing my question here :)
And I was not clever enough to organize final arrangement in one query only.
Thanks a lot!
| [reply] |
Re: Merging/Rearranging Tables
by johngg (Canon) on Feb 09, 2007 at 17:58 UTC
|
add additional string, corresponding to missing IDs to each table
This step makes me wonder if you have a defined and ordered list of values that a gene ID might have. This would have a significant bearing on how to approach the problem.
Perhaps you could clarify this point. Cheers, JohnGG
| [reply] |
|
|
My idea was to bring tables to the same size, as minimum to the same rows number. Than I can sort them, and join together without thinking about difference in the table dimentions.
| [reply] |
Re: Merging/Rearranging Tables
by Cristoforo (Curate) on Feb 13, 2007 at 02:31 UTC
|
A little late, but wanted to show a slightly different way. Really, much of it follows liverpole's ideas but saves the data as it's string instead of an array of values.
Chris
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
use Sort::Naturally;
my %data;
my $table;
my %col_cnt;
@ARGV = glob "tab*.txt";
while (<>) {
next if /^ID/;
chomp;
($table = fileparse $ARGV) =~ s/\.txt$// unless $table;
$col_cnt{$table} ||= tr/|//;
my ($gene, $cols) = split /\s*\|\s*/, $_, 2;
$data{$gene}{$table} = $cols;
$table = '' if eof;
}
my @tables = nsort keys %col_cnt;
for my $gene (nsort keys %data) {
print $gene;
for my $table (@tables) {
if ($data{$gene}{$table}) {
print ' | ' . $data{$gene}{$table};
}
else {
print ' | ' . join " | ", ("n.a.") x $col_cnt{$table};
}
}
print "\n";
}
Update: Changed from storing the output in a string and waiting until the the entire output was stored and then printed to printing as the program proceeds.
Update: Changed from built-in sort to 'nsort' (use Sort::Naturally). If the fields being sorted have part of their name a number exceeding 9,(gene1 gene2 gene9 gene10) or (tab1 tab2...tab11 tab12), they won't sort properly using the default sort. | [reply] [d/l] |
|
|
Hi,
thanks for your solution. It is very interesting! Can you comment it a bit?
The only minor modification, I would do - keep the header line:
while (<>) {
# next if /^ID/;
chomp;
($table = fileparse $ARGV) =~ s/\.txt$// unless $table;
$col_cnt{$table} ||= tr/|//;
my ($gene, $cols) = split /\s*\|\s*/, $_, 2;
$data{$gene}{$table} = $cols;
$table = '' if eof;
}
| [reply] [d/l] |
|
|
while (<>) {
next if /^ID/;
chomp;
# if $ARGV = 'tab1.txt', $table will = 'tab1'
# unless $table is already initiallized.
# $table is uninitiallized on the first read
# and after end of file for every file
# (see below: $table = '' if eof;)
($table = fileparse $ARGV) =~ s/\.txt$// unless $table;
# If the column count for this table is not already
# stored, then use the transliteration operator to count
# the number of pipes (# of columns) and store it.
$col_cnt{$table} ||= tr/|//;
my ($gene, $cols) = split /\s*\|\s*/, $_, 2;
# Here columns are not stored in an array, but in a string
# if the data is:
# gene 3 | value 3.1 | value 3.2
# then $cols would be:
# 'value 3.1 | value 3.2'
$data{$gene}{$table} = $cols;
$table = '' if eof;
}
Does this explain better? I suspect that some of your question is about $ARGV and where that comes from and the input (angle) operator (while (<>)) and reading files from @ARGV.Chris | [reply] [d/l] |
|
|
|
|