splitting up data...

chinamox has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I am looking for a way of parsing a very large list of words that currently passes into @ARGV. I would like to split a file into x number of new files, which are then placed into the current directory so that the main program can work on each of the new files in turn. x should be a number that is given in the command line.


some sample data:

__DATA__

apple
apex
beat
carrot
date
endzone
flyer
grandslam
hoplite
indigo

__END__
[download]

I am trying to parse up data for an online assignment. IF you could point me towards some useful documentation or examples, I would be most thankful.

Danke,

mox

Comment on splitting up data... Download Code

Replies are listed 'Best First'.
Re: splitting up data... by kwaping (Priest) on Oct 20, 2006 at 15:01 UTC
Check out File::Split. --- It's all fine and dandy until someone has to look at the code.	[reply]
Re: splitting up data... by Melly (Chaplain) on Oct 20, 2006 at 15:19 UTC
If it's an assignment, they might not want you to rely too much on a specific module. It sounds like you want to do the following: Open a file and count how many lines it has Divide the number of lines by an integer (to find out how many lines to write to each of your output files) Output n lines to each file, basing the filename on the value of n Here's some (untested) code to help you with the first 2 steps: `open(IN, $ARGV[0])\|\|die "Cannot open $ARGV[0]:$!\n"; die "No number of lines\n" unless($ARGV[1] =~ /^\d+$/); my $lines = 0; $lines ++ while(<IN>); close IN; print "lines per output file = " . int($lines/$ARGV[1]) . "\n";` [download] Tom Melly, tom@tomandlu.co.uk	[reply] [d/l]
Re^2: splitting up data... by chinamox (Scribe) on Oct 22, 2006 at 01:56 UTC
Thank you for the pointer, it was exactally what I was looking for! -mox	[reply]
Re: splitting up data... by Fletch (Bishop) on Oct 20, 2006 at 15:11 UTC
See also the manual page for your system's `split` command (presuming some form of POSIX-y-ness).	[reply]
Re: splitting up data... by Fendaria (Beadle) on Oct 20, 2006 at 17:03 UTC
If you only need a rough estimation of the split files, you can use the input file size and tell to figure out where in the input file you are and when to split. This should save you from needing to read through the file twice. Another option is to open up all output files at the start and just cycle through writing one input line to each output file in turn. This should also save you from needing to read through the file twice. Finally, you mention @ARGV which has me a little confused. If the list of words is coming in @ARGV, I believe most OS have a limit to how many ARGS you can pass to a process, which is generally pretty low. I'm assuming your changing from passing in words via @ARGV to using a file. Fendaria	[reply]
Re^2: splitting up data... by chinamox (Scribe) on Oct 22, 2006 at 06:07 UTC
Thank you for your response I am planning on using @ARGV to pass a large list of words from a file. The command line would look something like this: `username $: perl thisprogram.pl -4 /myfiles/lists/names_data` [download] I thus @ARGV would be passing the list of names contained in the file named names_data. I am then looking to divide the file equally by a number given in the command line (4 in this case) and printing the resulting files into my current directory. I was just looking for a simple way of dividing the files. Sorry for any confusion. -mox	[reply] [d/l]