Hello, here is a little function (split_file) which split a file into a given number of part. This is quite simple code, but ease the dispatching of a list of reference to be handled by several jobs running in parallel.
Of course, I am not a Perl expert, and any advise on doing it in a fancier or more efficient way is welcome
The function is called with two arguments, the name of the file, and the number of files to create. The countLines subroutine is called to count the number of lines in a file.
Of course, I am quite new to perlmonks, so If you find that this has nothing to do here, just tell me.
Updates:
sub countLines { my $filename=shift(@_); die("amaUtils:countLines:invalid filename") if(length($filename)== +0); open(TMP,"<$filename") or die("amaUtils:countLines unable to open +file $filename"); my $nb=0; $nb += tr/\n/\n/ while sysread(TMP, $_, 2 ** 16); close TMP; return $nb; } sub split_file { my $filename=shift(@_); my $nbFiles=shift(@_); die("amaUtils:split_files invalid number") if($nbFiles !~ /^\d+$/) +; die("amaUtils:split_files invalid number") if($nbFiles < 1); my $curNb=1; my $totalCount=countLines($filename); my $nbLinesPerFile=int($totalCount/$nbFiles); $nbLinesPerFile++ if( ($nbLinesPerFile * $nbFiles)!=$totalCount); my $currentCount=0; open(ORIG,"<$filename"); my ($newfile,$ext)=split(/\./,$filename); open(DEST,">${newfile}_".sprintf("%02d",${curNb}).".$ext"); while(<ORIG>) { if($currentCount==$nbLinesPerFile) { close DEST; $curNb++; $currentCount=0; open(DEST,">${newfile}_".sprintf("%02d",${curNb}).".$ext") +; } print DEST $_; $currentCount++; } close DEST; close ORIG; }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: split file in N part
by jdporter (Paladin) on Mar 10, 2008 at 14:18 UTC | |
by BrowserUk (Patriarch) on Mar 10, 2008 at 15:27 UTC | |
by jdporter (Paladin) on Mar 10, 2008 at 18:06 UTC | |
by BrowserUk (Patriarch) on Mar 10, 2008 at 18:24 UTC | |
by jeepj (Scribe) on Mar 10, 2008 at 14:29 UTC | |
|
Re: split file in N part
by zentara (Cardinal) on Mar 10, 2008 at 14:08 UTC |