Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks
I have a big text file (200 MB) with the following structure: it has 100 reviews, each review with 5 formulas, and each formula (f) iterate 19200 times according with the values I passed, which will the same for review 2 and so on.
Review 1

f: 1 total: 30 values: 50: 20: 0: 0 ...
f: 1 total: 30 values: 50: 25: 4: 4 ...
f: 1 total: 30 values: 55: 45: 8: 6 ...
... (19200 f1 total for review 1)

f: 4 total: 250 values: 50: 20: 0: 0 ...
f: 4 total: 320 values: 50: 25: 4: 4 ...
f: 4 total: 330 values: 55: 45: 8: 6 ...
... (19200 f4 total for review 1)

Review 2

f: 1 total: 20 values: 50: 20: 0: 0 ...
f: 1 total: 30 values: 50: 25: 4: 4 ...
f: 1 total: 45 values: 55: 45: 8: 6 ...
... ... (19200 f1 total for review 2)

f: 3 total: 250 values: 50: 20: 0: 0 ...
f: 3 total: 320 values: 50: 25: 4: 4 ...
f: 3 total: 330 values: 55: 45: 8: 6 ...
...... (19200 f3 total for review 2)

My questions is how can I split the file in 19200 text files in one directory, each file with the same 100 reviews, and 5 formulas but each file refering a specific values used in the formula (f)
Example = File00001.txt: with the formula using values 50:20:0:0
Review 1
f:1 total: 30 values: 50: 20: 0: 0 ...
f:2 ... f:4 total: 250 values: 50: 20: 0: 0 ...
Review 2
f:1 total: 20 values: 50: 20: 0: 0 ...
f:2 ... f:3 total: 250 values: 50: 20: 0: 0 ...

File00201.txt:
Review 1
f:1 total: 30 values: 50: 25: 4: 4 ...
f:2 ... f:4 total: 320 values: 50: 25: 4: 4 ...
Review 2
f:1 total: 30 values: 50: 25: 4: 4 ...
f:2 ... f:3 total: 320 values: 50: 25: 4: 4 ...
f:4 ...

File00301.txt:
Review 1
f: 1 total: 30 values: 55: 45: 8: 6 ...
f: 4 total: 330 values: 55: 45: 8: 6 ...
Review 2
f: 1 total: 45 values: 55: 45: 8: 6 ...
f: 3 total: 330 values: 55: 45: 8: 6 ...
Using a regular expression will be enough?, Should I delete the blank spaces and iterate 19200 + including titles, and split?
Thank you in advance for your help
  • Comment on How split big file depending of number of iterations

Replies are listed 'Best First'.
Re: How split big file depending of number of iterations
by dragonchild (Archbishop) on Aug 11, 2003 at 14:28 UTC
    Create a hash with 19200 entries. The key will be the values you want to have stored in that file and the value will be the filehandle. Then, iterate through your file, parsing each line. Find the filehandle associated with that value and print the line to that filehandle. Something like:
    use IO::File; my %filehandle; while (my ($k, $v) = associate_filename_with_values()) { $filehandle{$v} = IO::File->new(">$k") || die "Cannot open '$k' for writing\n"; } my $data_fh = IO::File->new($datafile) || die "Cannot open '$datafile' for reading\n"; while (<$data_fh>) { chomp; next unless length $_; my @line = split, /\s+/; my $value = $line[SOME_INDEX]; $filehandle{$value}->print("$_\n"); } $data_fh->close; $_->close for values %filehandle;
    That code will need some work, especially in associating filename with value, but that should give you a headstart.

    Of course, if you want to pay me to write it for you, email me at rkinyon@columbus.rr.com - I charge decent rates for bioinformatics work.

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      I'm not aware of any UNIX that will allow you to have 19200 files open simultaneously; it's possible the situation is different under Windows, but I doubt it. In a quick test, I can open 1020 files before my script dies with a Too many open files error.

      Workarounds include storing the filename in the hash, opening the file before using it and closing it afterwards; do the same thing, but cache open filehandles, and upon receiving a Too many open files error close the least-recently-used filehandle, and jot down somewhere that it needs to be re-opened; append the data to the hash entry, itself, then go through all hash entries and print their contents to the appropriate file.

      The code above looks like it might accomplish what you asked, but I have a sneaking suspicion that you really want to do something else with that data, once you've created the large number of smaller files.

      If so, you'll probably find the current design to be quite expensive, due to a lot of IO.

      If I'm right, would you consider describing the purpose of this routine? We might be able to come up with a better process -- or not. ;-)

      If I'm wrong, and you really just want to create a bunch of small files, I apologize for my errant hunch.
      Thank you dragonchild for you fast reply, as you said it is a good headstart, I will work on it. I am a student doing an internship in this lab, I cannot afford to pay any amount even is decent ;).
      I asked for help because I am running out of time (deadline) and at this moment my mind is block!..
      Thanks again