split of files

boby has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: split of files by GrandFather (Saint) on Jun 22, 2007 at 09:18 UTC
Let's suppose for a moment that you wanted to split the data at the lines containing "INPUT SEQUENCE=" and use the number following the = as part of the file name for the output file, then you could: `use strict; use warnings; $/ = 'INPUT SEQUENCE='; while (<DATA>) { chomp; next unless length; next unless s/^(\d+)\n//; print "Start of file seq$1.dat\n"; print "$_\n"; } __DATA__` [download] Read more... OP's sample data (5 kB) Prints: Start of file seq6618.dat >P40757\|ALN_RANCA Allantoinase, mitochondrial precursor - Rana catesbe +iana (Bull frog) MALKSKPGIMNITPGSKISVIRSKRVIQANTISSCDIIISDGKISSVLAWGKHVTSGAKLLDVGDLVVMA GIIDPHVHVNEPGRTDWEGYRTATLAAAAGGITAIVDMPLNSLPPTTSVTNFHTKLQAAKRQCYVDVAFW GGVIPDNQVELIPMLQAGVAGFKCFLINSGVPEFPHVSVTDLHTAMSELQGTNSVLLFHAELEIAKPAPE IGDSTLYQTFLDSRPDDMEIAAVQLVADLCQQYKVRCHIVHLSSAQSLTIIRKAKEAGAPLTVETTHHYL SLSSEHIPPGATYFKCCPPVRGHRNKEALWNALLQGHIDMVVSDHSPCTPDLKLLKEGDYMKAWGGISSL QFGLPLFWTSARTRGFSLTDVSQLLSSNTAKLCGLGIVKEPLKWVMMLIWSSGILTKSFRCKKMIFITRI SSPHIWDSFFKEKSWLLLFEGLLFISKGSMLPNQLENLFLYTLWSLVKPVHPVHPIIRKNLPHI + + Total Number of residues in the sequence =484 ---------------------------------------------------------------------- +----------------------------------- Number of residues in the repeat = 5 AAAAG 96 + to 100 AAAGG 97 + to 101 ---------------------------------------------------------------------- +----------------------------------- Number of residues in the repeat = 7 AGVAGFK 157 + to 163 GGISSLQ 345 + to 351 ---------------------------------------------------------------------- +----------------------------------- Minimum number of amino-acids present in the distant repeat is = 5 Maximum number of amino-acids present in the distant repeat is = 7 Total number of distant repeats found = 62 ______________________________________________________________________ +___________________________________ Start of file seq6619.dat >Q9RKU5\|ALN_STRCO Probable allantoinase - Streptomyces coelicolor MSEAELVLRSTRVITPEGTRAASVAVTGEKITAVLPYDAPVPAGARLEDVGDHVVLPGLVDTHVHVNDPG RTEWEGFWTATRAAAAGGITTLVDMPLNSIPPTTTVDNLRTKREVAADKAHIDVGFWGGALPDNVKDLRP LHEAGVFGFKAFLSPSGVDEFPHLDQEQLARSLAEIAAFDGLLIVHAEDPHHLAAAPQQGGPKYTHFLAS RPRDAEDTAIATLLAQAKRFNARVHVLHLSSSDALPLIAEARADGVRVTVETCPHYLTLTAEEVPDGASE FKCCPPIREAANQDLLWQALADGTIDCVVTDHSPSTADLKTDDFATAWGGIAGLQLSLPAMWTAARGRGL GLEDVVRWMSERTAALVGLDARKGAIAPGHDADFAVLAPDETFTVDPAALQHRNRVTAYAGKTLYGVVKS TWLRGERIVADGAFTDPKGQLLDRA + + Total Number of residues in the sequence =445 ---------------------------------------------------------------------- +----------------------------------- Number of residues in the repeat = 5 AAAAG 83 + to 87 AAAGG 84 + to 88 PSTAD 314 + to 318 ---------------------------------------------------------------------- +----------------------------------- Number of residues in the repeat = 6 AAAGGI 84 + to 89 SSSDAL 240 + to 245 ---------------------------------------------------------------------- +----------------------------------- Minimum number of amino-acids present in the distant repeat is = 5 Maximum number of amino-acids present in the distant repeat is = 8 Total number of distant repeats found = 94 ______________________________________________________________________ +___________________________________ [download] which may or may not be anything at all like what you had in mind, but then you haven't actually told us that so guessing is all we can do. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: split of files by shmem (Chancellor) on Jun 22, 2007 at 09:40 UTC
So you want to set the file boundary to just before "INPUT SEQUENCE"? There Is More Than One Way To Do It - one bizarre way is `#!/usr/bin/perl { local $/ = "INPUT SEQUENCE="; <>; # discard first chunk while(<>) { chomp; open O, '>', /^(\d+)/ and $1 or die "$!\n"; print O $/,$_; } }` [download] update - oh. You want bunches of 1000 records in separate files? `#!/usr/bin/perl { my $file = 'File00000'; local $/ = "INPUT SEQUENCE="; <>; while(<>) { chomp; unless ($.-2 % 1000) { open O, '>', ++$file or die "cant write '$file': $!\n"; } print O $/,$_; } }` [download] update - a bit of explanation: setting the input record separator $/ (see perlvar) to the token right after the file boundary lets the diamond operator `<>` (or readline) read up to and including that token as a single line into `$_`. With chomp we remove $/ from the end; it is added up front when outputting. The `++$file` is a string increment; doing that we get the next file name (File00001, File00002, ... ). Since the first "line" (consisting of the record separator only) isn't interesting, we do a `<>` before the loop. Next line is then number 2 (`$.` - see perlvar), and `$.-2 % 1000` (modulo 1000) is 0 at line 2, 1002, 2002, ... , so we (re-)open the output filehandle at that line count, which does an implicit close. See open. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l] [select]
Re: split of files by citromatik (Curate) on Jun 22, 2007 at 09:24 UTC
If I understand correctly, you have a file with a number of records, each one beginning with a line: `INPUT SEQUENCE=XXX` [download] and ending with: `______________________________________________________________________ +___________________________________` [download] And you want to split the whole file in smaller files containing N or these records The easiest way I can imagine doing this is using Tie::File. Try this script: #!/usr/bin/perl use strict; use warnings; use Tie::File; my ($file,$recs_X_file) = @ARGV; die "Usage: $0 <input_file> <recs x file>" if (@ARGV != 2); tie my @arr, 'Tie::File', $file, recsep => "__________________________ +_____________________________________________________________________ +__________",autochomp=>0; my $from=0; my $to=$recs_X_file-1; while ($from < $#arr){ my $ofile = "file.$from-$to"; open F,">",$ofile or die $!; print "printing records $from to $to in $ofile\n"; print F @arr[$from..$to]; $from=$to+1; $to = $from+$recs_X_file-1; } [download] This script interfaces the file as an array, but in the way that you want it to do: Each record in the array corresponds with one logical record in the file. Once done, it splits the array (i.e. the records in the file) N by N records and outputs them in sub-files For example, if you call the script "split_records.pl" you can invoke it with: `perl split_records.pl inputfile 10` [download] Outputs: `printing records 0 to 9 in file.0-9 printing records 10 to 19 in file.10-19 printing records 20 to 29 in file.20-29 ... and so on (depending on the number of records of the original file` [download] Of course, the files containing the records are created too Hope this helps! citromatik	[reply] [d/l] [select]
Re: split of files by andreas1234567 (Vicar) on Jun 22, 2007 at 07:30 UTC
Boby, It is not quite clear to me what you want to do. Try to explain your algorithm that decides what should go into each file. There is helpful advice here on how to How (Not) To Ask A Question. -- print map{chr}unpack(q{A3}x24,q{074117115116032097110111116104101114032080101114108032104097099107101114})	[reply]
Re: split of files by Moron (Curate) on Jun 22, 2007 at 09:53 UTC
if you mean do file split like unix split does but on pattern-matched boundaries instead of byte counts - I rather imagine something like: (update: with linux, split with -p is available to split on a regexp - then the perl script only has to shell that and cleanup after as follows: glob for the per sequence files, cat each 1000 files at a time together into some second naming convention and remove each 1000 per iteration - on second thoughts I prefer what follows after all!) my $suffix = 'z'; my $sequence = 0; my $maxseq = 1000; my $input = shift @ARGV or die "usage"; open my $ifh, $input or die "$!: $input\n"; my $ofh; while( <$ifh> ) { /\AINPUT\sSEQUENCE/ and SwitchFile( $input, \$ofh, \$suffix, \$sequence, $maxseq ); $ofh or die "Unexpected prelude: $_"; print $ofh $_; } close $ofh; sub SwitchFile { my ( $input, $oref, $sref, $qref, $max ) = @_; if ( defined( $$oref ) ) ( ++$$qref < $max ) and return; $$qref = 0; close $$oref; } my $newfile = "$input." . ++$$sref; open my $ofh, ">$newfile" or die "$!: $newfile"; $$oref = $ofh; } [download] This would create the 270 files with suffixes .aa thru .jj __________________________________________________________________________________ ^M Free your mind!	[reply] [d/l]