Hi,

today, i had to split a file vertically. I didn't find it in Q&A section, so i made something myself, and post it here because i think it could be usefull for others ? My apologies if it already exists on the site.

I have tab-delimited text files where first column is identifiers and following columns contain data. The aim was to split the file vertically into n files. The bonus is i wanted to shuffle the data columns (ie not conserve the order of the columns in the destination files). For my purpose, i had to repeat the original first column into all destination files, but this can easily be adapted for your purpose.

This is a first try, but i'm quite new to perl, so comments and improvements are welcomed.
#!/usr/bin/perl -w # Split a tab-delimited text files into n files, repeating # first column and shuffling columns. # Julien Textoris use strict; use warnings; open(IN, "$ARGV[0]"); my $n = $ARGV[1]; #Thanks to Q&A for this ! sub melange_fy { my $tab = shift; my $i; for($i = @$tab; --$i;) { my $j = int rand($i+1); next if ($i == $j); @$tableau[$i,$j] = @$tableau[$j,$i]; } } my $perm =0 ; my (@ind); while ( my $ligne = <IN>) { $ligne =~ s/[\r\n]//g; my ($id,@elmt) = split(/\t/, $ligne); #make the permutation of tab indices only the first time unless($perm++) { @ind = (0..@elmt-1); melange_fy(\@ind); } my $w = int(scalar(@elmt)/$n+0.5); for(my $i = 0; $i < $n; $i++) { open (OUT, ">>$ARGV[0].$i"); my $d = $i*$w; my($f); if($i == $n-1) { $f = scalar(@elmt)-1; } else { $f = $d+$w-1; } print OUT join("\t",$id,@elmt[@ind[$d..$f]])."\n"; close(OUT); } }
An example input file is :
Genes setA setB setC setD setE setF setG g1 1 2 3 4 5 6 7 g2 1 2 3 4 5 6 7 g3 1 2 3 4 5 6 7
code is to be launched like :
./vSplit <fileToVSplit> <integer>

<integer> is the number of part you want to split the file into.

Hopes it will help

Marsel

Replies are listed 'Best First'.
Re: A vertical (+/- random) split of a file
by Hue-Bond (Priest) on Jul 21, 2006 at 20:26 UTC

    I have here a little thing called swap_row_col.pl. It's very old (my filesystem says I haven't touched it in more than 4 years) so I just refactored it now. Once swapped in this way, the data can be List::Util::shuffle'd and swap_row_col'ed again. This is tested:

    use warnings; use strict; use List::Util qw/shuffle/; sub swap_row_col { my @swapped; foreach (@_) { chomp; ## just in case my @elems = split /;/; for (my $i = 0; $i < @elems; $i++) { exists $swapped[$i] and $swapped[$i] .= ';'; $swapped[$i] .= $elems[$i]; } } return @swapped; } my @vshuffled = swap_row_col shuffle swap_row_col <DATA>; __DATA__ one;two;three;four foo;bar;baz;qux yellow;red;blue;green

    The output from Data::Dumper is something like this:

    $VAR1 = [ 'one;three;four;two', 'foo;baz;qux;bar', 'yellow;blue;green;red' ];

    --
    David Serrano

      That's great,
      of course like that, you don't have to open and close output files at each line (which will save time, won't it ?).

      Thanks !

      marsel
Re: A vertical (+/- random) split of a file
by jwkrahn (Abbot) on Jul 22, 2006 at 01:09 UTC
    comments and improvements are welcomed.
    I added some error checking and moved the slice calculation so it is only performed once instead of for every line of the input file:
    #!/usr/bin/perl # Split a tab-delimited text files into n files, repeating # first column and shuffling columns. # Julien Textoris use strict; use warnings; #Thanks to Q&A for this ! sub melange_fy { my $tab = shift; for ( my $i = @$tab; --$i; ) { my $j = int rand( $i + 1 ); next if $i == $j; @$tab[ $i, $j ] = @$tab[ $j, $i ]; } } @ARGV == 2 or die "usage: $0 file N\n\n\tN is the number of new files +to create.\n\n"; my ( $file, $n ) = @ARGV; open IN, '<', $file or die "open '$file' $!"; my @fhs = map { open my $out, '>', "$file.$_" or die "open '$file.$_' $!"; $out; } 0 .. $n - 1; my @ind; while ( <IN> ) { tr/\r\n//d; my ( $id, @elmt ) = split /\t/; #make the permutation of tab indices only the first time if ( $. == 1 ) { die "N is too large, N must be less than " . ( @elmt + 1 ) . " +.\n" if $n > @elmt; my @temp = 0 .. $#elmt; melange_fy( \@temp ); @ind = map [ $_ == $n ? @temp : splice @temp, 0, int( @elmt / +$n + 0.5 ) ], 1 .. $n; } for my $i ( 0 .. $#fhs ) { print { $fhs[ $i ] } join( "\t", $id, @elmt[ @{ $ind[ $i ] } ] + ), "\n"; } } __END__
    HTH
      Thanks !
      That's great. But just to be sure, my loop
      unless($perm++) { #Do the permutation }
      was done only once, wasn't it ?

      Because, at first line, $perm is undef so evaluation of $perm++ is FALSE (and incremented) so the permutation is done, but after that, $perm++ is TRUE, so it isn't executed ? Is this wrong ? I'm quite sure it worked because a column was shuffled as a whole with that ??

      But your script is really cool, and i didn't know i could make the output like that !

      Marsel
        Yes, your use of $perm worked correctly it's just that I am more used to using $.
Re: A vertical (+/- random) split of a file
by ysth (Canon) on Jul 21, 2006 at 20:59 UTC
    You can just open each file once: (untested)
    my @outfiles; open $outfiles[$_], "> $ARGV[0].$_" or die "nope: $!" for 0..$n-1; ... print {$outfiles[$i]} join("\t",$id,@elmt[@ind[$d..$f]])."\n";