A vertical (+/- random) split of a file

Hi,

today, i had to split a file vertically. I didn't find it in Q&A section, so i made something myself, and post it here because i think it could be usefull for others ? My apologies if it already exists on the site.

I have tab-delimited text files where first column is identifiers and following columns contain data. The aim was to split the file vertically into n files. The bonus is i wanted to shuffle the data columns (ie not conserve the order of the columns in the destination files). For my purpose, i had to repeat the original first column into all destination files, but this can easily be adapted for your purpose.

This is a first try, but i'm quite new to perl, so comments and improvements are welcomed.

#!/usr/bin/perl -w

# Split a tab-delimited text files into n files, repeating
# first column and shuffling columns.
# Julien Textoris

use strict;
use warnings;

open(IN, "$ARGV[0]");

my $n = $ARGV[1];

#Thanks to Q&A for this !
sub melange_fy {
    my $tab = shift;
    my $i;
    for($i = @$tab; --$i;) {
        my $j = int rand($i+1);
        next if ($i == $j);
        @$tableau[$i,$j] = @$tableau[$j,$i];
    }
}

my $perm =0 ;
my (@ind);

while ( my $ligne = <IN>) {
    $ligne =~ s/[\r\n]//g;
    
    my ($id,@elmt) = split(/\t/, $ligne);
    #make the permutation of tab indices only the first time
        unless($perm++) {
        @ind = (0..@elmt-1);
        melange_fy(\@ind);
    }

    
    my $w = int(scalar(@elmt)/$n+0.5);
    
    for(my $i = 0; $i < $n; $i++) {
        open (OUT, ">>$ARGV[0].$i");
        my $d = $i*$w;
        my($f);
        if($i == $n-1) {
            $f = scalar(@elmt)-1;
        }
            else {
            $f = $d+$w-1;
        }
        
        print OUT join("\t",$id,@elmt[@ind[$d..$f]])."\n";
        
        
        close(OUT);
        
    }
}
[download]

An example input file is :

Genes    setA    setB    setC    setD    setE    setF    setG
g1    1    2    3    4    5    6    7
g2    1    2    3    4    5    6    7
g3    1    2    3    4    5    6    7
[download]

code is to be launched like :
./vSplit <fileToVSplit> <integer>

<integer> is the number of part you want to split the file into.

Hopes it will help

Marsel

Comment on A vertical (+/- random) split of a file Select or Download Code

Replies are listed 'Best First'.
Re: A vertical (+/- random) split of a file by Hue-Bond (Priest) on Jul 21, 2006 at 20:26 UTC
I have here a little thing called `swap_row_col.pl`. It's very old (my filesystem says I haven't touched it in more than 4 years) so I just refactored it now. Once swapped in this way, the data can be List::Util::shuffle'd and swap_row_col'ed again. This is tested: `use warnings; use strict; use List::Util qw/shuffle/; sub swap_row_col { my @swapped; foreach (@_) { chomp; ## just in case my @elems = split /;/; for (my $i = 0; $i < @elems; $i++) { exists $swapped[$i] and $swapped[$i] .= ';'; $swapped[$i] .= $elems[$i]; } } return @swapped; } my @vshuffled = swap_row_col shuffle swap_row_col <DATA>; __DATA__ one;two;three;four foo;bar;baz;qux yellow;red;blue;green` [download] The output from Data::Dumper is something like this: `$VAR1 = [ 'one;three;four;two', 'foo;baz;qux;bar', 'yellow;blue;green;red' ];` [download] -- David Serrano	[reply] [d/l] [select]
Re^2: A vertical (+/- random) split of a file by Marsel (Sexton) on Jul 21, 2006 at 20:32 UTC
That's great, of course like that, you don't have to open and close output files at each line (which will save time, won't it ?). Thanks ! marsel	[reply]
Re: A vertical (+/- random) split of a file by jwkrahn (Abbot) on Jul 22, 2006 at 01:09 UTC
comments and improvements are welcomed. I added some error checking and moved the slice calculation so it is only performed once instead of for every line of the input file: #!/usr/bin/perl # Split a tab-delimited text files into n files, repeating # first column and shuffling columns. # Julien Textoris use strict; use warnings; #Thanks to Q&A for this ! sub melange_fy { my $tab = shift; for ( my $i = @$tab; --$i; ) { my $j = int rand( $i + 1 ); next if $i == $j; @$tab[ $i, $j ] = @$tab[ $j, $i ]; } } @ARGV == 2 or die "usage: $0 file N\n\n\tN is the number of new files +to create.\n\n"; my ( $file, $n ) = @ARGV; open IN, '<', $file or die "open '$file' $!"; my @fhs = map { open my $out, '>', "$file.$_" or die "open '$file.$_' $!"; $out; } 0 .. $n - 1; my @ind; while ( <IN> ) { tr/\r\n//d; my ( $id, @elmt ) = split /\t/; #make the permutation of tab indices only the first time if ( $. == 1 ) { die "N is too large, N must be less than " . ( @elmt + 1 ) . " +.\n" if $n > @elmt; my @temp = 0 .. $#elmt; melange_fy( \@temp ); @ind = map [ $_ == $n ? @temp : splice @temp, 0, int( @elmt / +$n + 0.5 ) ], 1 .. $n; } for my $i ( 0 .. $#fhs ) { print { $fhs[ $i ] } join( "\t", $id, @elmt[ @{ $ind[ $i ] } ] + ), "\n"; } } __END__ [download] HTH	[reply] [d/l]
Re^2: A vertical (+/- random) split of a file by Marsel (Sexton) on Jul 22, 2006 at 05:33 UTC
Thanks ! That's great. But just to be sure, my loop `unless($perm++) { #Do the permutation }` [download] was done only once, wasn't it ? Because, at first line, $perm is undef so evaluation of $perm++ is FALSE (and incremented) so the permutation is done, but after that, $perm++ is TRUE, so it isn't executed ? Is this wrong ? I'm quite sure it worked because a column was shuffled as a whole with that ?? But your script is really cool, and i didn't know i could make the output like that ! Marsel	[reply] [d/l]
Re^3: A vertical (+/- random) split of a file by jwkrahn (Abbot) on Jul 22, 2006 at 06:56 UTC
Yes, your use of `$perm` worked correctly it's just that I am more used to using `$.`	[reply] [d/l] [select]
Re: A vertical (+/- random) split of a file by ysth (Canon) on Jul 21, 2006 at 20:59 UTC
You can just open each file once: (untested) `my @outfiles; open $outfiles[$_], "> $ARGV[0].$_" or die "nope: $!" for 0..$n-1; ... print {$outfiles[$i]} join("\t",$id,@elmt[@ind[$d..$f]])."\n";` [download]	[reply] [d/l]