tschelineli has asked for the wisdom of the Perl Monks concerning the following question:

Hi my dear perl-friends, I'm trying to extract data from a big datafile to create many small ones where there is only the data from 1 person left. My datafile is 3.4GB big so I have to proceed wisely. The format is like this: - 4variables, which should find their way in every new file - 9columns per person My goal is, that I'm having in the end x files (where x=Number of people) where in each file there are 13 columns. I just don't quite get my head around how I can code, that perl opens the "motherfile", prints columns 0-4 and 5-13 in one file; for the next file it would be columns 0-4 again and 14-22 and so on. If anyone of you wise guys out there has a solution, I would be veeeeeery grateful to hear about it. Thanks in advance

Replies are listed 'Best First'.
Re: Spliting Table
by Laurent_R (Canon) on Aug 20, 2015 at 09:18 UTC
    You should at least supply the data separators (field separator and record separator, if any), or perhaps better provide a small sample of your file.

    And also explain how you intend to extract columns 14-22 from a file with 9 columns per person. But it is probably me not understanding your input format. Please explain.

    Also please note that with a 3.4 GB input file, you might end up with millions of output files, I am not sure that all file systems can cope with that and, even if yours does, it might not be very practical.

      So I don't know whether it really helps, but here's part of my "motherfile". There are more columns for more VP's and there are more lines for more TargetID's. So the motherfile contains all data and I want to extract per VP one file with the first four columns and the belonging VP*.***-columns. Huh, I'm sorry that I'm not really capable of explaining it well.
      open(RAUS,$outfile); while (<REIN>) { chomp(); @we = split(/\t/); $fix = "$we[0]\t$we[1]\t$we[2]\t$we[3]"; $out1 = "$we[4]\t$we[5]\t$we[6]\t$we[7]\t$we[8]\t$we[9]\t$we[10]\t +$we[11]\t$we[12]"; print RAUS $fix ."\t" .$out1."\n"; } close(RAUS); close(REIN);
       This is part of my very nasty code for printing the right stuff for one VP, but it is highly useless for the entire file.. =)
      Here's my data:
      
      Index	TargetID	ProbeID_A	ProbeID_B	VP1.AVG_Beta	VP1.Intensity	VP1.Avg_NBEADS_A	VP1.Avg_NBEADS_B	VP1.BEAD_STDERR_A	VP1.BEAD_STDERR_B	VP1.Signal_A	VP1.Signal_B	VP1.Detection Pval	VP2.AVG_Beta	VP2.Intensity	VP2.Avg_NBEADS_A	VP2.Avg_NBEADS_B	VP2.BEAD_STDERR_A	VP2.BEAD_STDERR_B	VP2.Signal_A	VP2.Signal_B	VP2.Detection Pval
      1	cg00000029	14782418	14782418	0,7469755	2793	15	15	33,82405	72,03749	632	2161	0,00	0,6678689	2950	18	18	96,40222	126,8078	913	2037	0,00	0,7469755	2793	15	15	33,82405	72,03749	632	2161	0,00	0,6678689	2950	18	18	96,40222	126,8078	913	2037	0,00
      2	cg00000108	12709357	12709357	0,9218118	3609	12	12	44,74464	155,0186	190	3419	0,00	0,9602971	7305	11	11	35,27683	130,8559	194	7111	0,00	0,7469755	2793	15	15	33,82405	72,03749	632	2161	0,00	0,6678689	2950	18	18	96,40222	126,8078	913	2037	0,00
      3	cg00000109	59755374	59755374	0,650519	767	4	4	51,5	151,5	203	564	0,00	0,8245264	1906	10	10	24,03331	136,6104	252	1654	0,00	0,7469755	2793	15	15	33,82405	72,03749	632	2161	0,00	0,6678689	2950	18	18	96,40222	126,8078	913	2037	0,00
      4	cg00000165	12637463	12637463	0,3073516	1029	20	20	59,47941	28,39806	682	347	0,00	0,2899073	1842	17	17	80,52183	51,41755	1279	563	0,00	0,7469755	2793	15	15	33,82405	72,03749	632	2161	0,00	0,6678689	2950	18	18	96,40222	126,8078	913	2037	0,00
      5	cg00000236	12649348	12649348	0,8236473	1397	14	14	18,17377	105,0337	164	1233	0,00	0,8691943	3065	13	13	42,71191	160,031	314	2751	0,00	0,7469755	2793	15	15	33,82405	72,03749	632	2161	0,00	0,6678689	2950	18	18	96,40222	126,8078	913	2037	0,00
      6	cg00000289	18766346	18766346	0,4625375	901	14	14	48,37429	46,23619	438	463	0,00	0,590708	1256	11	11	78,69446	89,85038	455	801	0,00	0,7469755	2793	15	15	33,82405	72,03749	632	2161	0,00	0,6678689	2950	18	18	96,40222	126,8078	913	2037	0,00
        Please, put your data inside <code>...</code> tags as well. It makes it easier and less error prone to download it.

        Are the data-files separated by a "tab" character and the records separated by an EOL code?

        It seems like you have a data-file with records of 22 fields each according to the header, but with 40 fields according to the data-records. How is that possible?

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics
        This is part of my very nasty code for printing the right stuff for one VP, but it is highly useless for the entire file
        No, you almost have it (at least if I understood correctly what you want to do). Your code is reading the file line by line, which is what you need. For each line, you just need to pickup the relevant fields and print them out.

        You only need to open a new output file for each input record.

        You did not say how you want to name your files, so I will pick up the first field as it seems to be a line number.

        while (<REIN>) { chomp(); my @we = split(/\t/); my $out_file_name = "out_file_nr_$we[0]"; open my $RAUS, ">", $out_file_name or die "could not open $out_fil +e_name $!"; my $fix = "$we[0]\t$we[1]\t$we[2]\t$we[3]"; my $out1 = "$we[4]\t$we[5]\t$we[6]\t$we[7]\t$we[8]\t$we[9]\t$we[10 +]\t$we[11]\t$we[12]"; print $RAUS $fix ."\t" .$out1."\n"; close $RAUS; } close(REIN)
        I do not understand, though, why you are splitting the data into fields and then join them back into the original format before printing it. So this could be simply:
        while (<$REIN>) { my $out_file_name = "out_file_nr_[$.]"; # $. is the line number on + the last read file handle open my $RAUS, ">", $out_file_name or die "could not open $out_fil +e_name $!"; print $RAUS $_; close $RAUS; } close(REIN)
        Or did I miss something in your requirement?
Re: Spliting Table (a solution)
by Anonymous Monk on Aug 20, 2015 at 07:22 UTC

    If anyone of you wise guys out there has a solution, I would be veeeeeery grateful to hear about it.

    Here is solution

    shortInfile.txt person 1 2 3 4 5 6 7 8 9 abe 1 2 3 4 5 6 7 8 9 lincoln 1 2 3 4 5 6 7 8 9

    Becomes

    person.txt 1 2 3 4 something something abe.txt 1 2 3 4 something something lincoln.txt 1 2 3 4 something something

    The solution is to describe the desired output based on representative input which is very short. After you do that, you can wrap your head around it, write code to do it.

      Thank you for your help. Your described solution is actually not my real problem as I do know, how it should look like, but I want to let the script run automatically through the whole Input.file, as there are over 100people and I do not want to give every file a name myself. My main problem is the part where it automatically counts which column to take for the next file/person:
      1st file: column 1-4 and 5-13;
      2nd file: column 1-4 and 14-22;
      3rd file: column 1-4 and 23-31;
      4th file: ...
      to give every new file a name will be the next problem.. =)

        my main problem is the part where it automatically counts which column to take for the next file/person:

        Think about it, how would you as a person do the counting?

        Now write down the description and post it here.

        Over and out

        #!/usr/bin/perl -- ## tinking.pl ## 2015-08-20-02:28:04 ## ## ## ## ## perltidy -olq -csc -csci=10 -cscl="sub : BEGIN END if while for " +-otr -opr -ce -nibc -i=4 -pt=0 "-nsak=*" ## perltidy -olq -csc -csci=3 -cscl="sub : BEGIN END " -otr -opr -ce +-nibc -i=4 -pt=0 "-nsak=*" #!/usr/bin/perl -- use strict; use warnings; use autodie qw/ open close /; use Data::Dump qw/ dd /; my $infile = \q{ abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz bcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyza cdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzab }; Roger( $infile ); exit( 0 ); sub Roger { my( $infile ) = @_; open my( $infh ), '<', $infile; ## "or die..." by autodie my $count = 5; while( <$infh> ) { next if !/\S/; my @what = split //, $_; #~ dd( \@what ); my $right = $count + 8; @what = grep defined, @what[ 0 .. 4, $count .. $right ]; #~ dd( \@what ); Dodger( \@what ); $count = $right; } ## end while( <$infh> ) } ## end sub Roger sub Dodger { my( $arref ) = @_; print join ',', @$arref, "\n"; } ## end sub Dodger
Re: Spliting Table
by jf1 (Beadle) on Aug 26, 2015 at 19:23 UTC
    It seems that you have genome wide methylation data measured using Illumina bead chips, and your file has been generated as a genome studio final report, correct? In this case I'd recommend considering the following points:
    1. You may be better off not starting from the report file but from the idat files instead (the original files that went into the genome studio project) and use the appropriate R-packages (methylumi, minfi, etc.) from the Bioconductor repository for processing the data
    2. Of course this requires sufficient memory. So the machine should have at least 8 GB RAM available (better 16)
    3. There are some packages that claim to be able to load genome studio reports (methylumi). But I'm not sure about that.
    4. If you still want to split the file using perl, the script should work for any number of columns per individual. Thus avoid hard coding this number in the file, as you may discover that you are missing some information and you have to regenerate the report with different settings and a different number of columns per subject.
    5. Unfortunately many available programs for data processing (in particular some R functions in several packages) are not capable of obeying i18n rules and thus just use "." as decimal separator. Or they use the encoding in inconsistent manner. To ensure data integrity it seems best to change your language settings to English and as well to change all "," in the data file to ".". So one of the 1st things to do for each line after reading in would be "tr/,/./;", if you are using "," as regular decimal separator - like in your provided data file.
    6. Under the precondition of enough available file handles it is possible to generate all output files in one pass (Just as poj's solution assumes; e.g. for newer windows platforms this number is practically unlimited (nominally about 16 mio per process). If this is not the case you'd have to do this in several passes.
    Does poj's suggestion suit your needs?