chinamox has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I am looking for a way of parsing a very large list of words that currently passes into @ARGV. I would like to split a file into x number of new files, which are then placed into the current directory so that the main program can work on each of the new files in turn. x should be a number that is given in the command line.

some sample data: __DATA__ apple apex beat carrot date endzone flyer grandslam hoplite indigo __END__

I am trying to parse up data for an online assignment. IF you could point me towards some useful documentation or examples, I would be most thankful.

Danke,

mox

Replies are listed 'Best First'.
Re: splitting up data...
by kwaping (Priest) on Oct 20, 2006 at 15:01 UTC
    Check out File::Split.

    ---
    It's all fine and dandy until someone has to look at the code.
Re: splitting up data...
by Melly (Chaplain) on Oct 20, 2006 at 15:19 UTC

    If it's an assignment, they might not want you to rely too much on a specific module.

    It *sounds* like you want to do the following:

    1. Open a file and count how many lines it has
    2. Divide the number of lines by an integer (to find out how many lines to write to each of your output files)
    3. Output n lines to each file, basing the filename on the value of n

    Here's some (untested) code to help you with the first 2 steps:

    open(IN, $ARGV[0])||die "Cannot open $ARGV[0]:$!\n"; die "No number of lines\n" unless($ARGV[1] =~ /^\d+$/); my $lines = 0; $lines ++ while(<IN>); close IN; print "lines per output file = " . int($lines/$ARGV[1]) . "\n";
    Tom Melly, tom@tomandlu.co.uk
      Thank you for the pointer, it was exactally what I was looking for!

      -mox
Re: splitting up data...
by Fletch (Bishop) on Oct 20, 2006 at 15:11 UTC

    See also the manual page for your system's split command (presuming some form of POSIX-y-ness).

Re: splitting up data...
by Fendaria (Beadle) on Oct 20, 2006 at 17:03 UTC
    If you only need a rough estimation of the split files, you can use the input file size and tell to figure out where in the input file you are and when to split. This should save you from needing to read through the file twice.

    Another option is to open up all output files at the start and just cycle through writing one input line to each output file in turn. This should also save you from needing to read through the file twice.

    Finally, you mention @ARGV which has me a little confused. If the list of words is coming in @ARGV, I believe most OS have a limit to how many ARGS you can pass to a process, which is generally pretty low. I'm assuming your changing from passing in words via @ARGV to using a file.

    Fendaria

      Thank you for your response

      I am planning on using @ARGV to pass a large list of words from a file. The command line would look something like this:

      username $: perl thisprogram.pl -4 /myfiles/lists/names_data

      I thus @ARGV would be passing the list of names contained in the file named names_data. I am then looking to divide the file equally by a number given in the command line (4 in this case) and printing the resulting files into my current directory. I was just looking for a simple way of dividing the files.

      Sorry for any confusion.

      -mox