mndoci has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks,

I have a file which has the following format

245
record #1
record #2
.
.
record #245
====
165
record #1
record #2
.
.
record #165
====

As you can make out, the first line is the number or records in that set (till "===="). There are 20 such records. Instead of regexing, I used the following

#initialize some variables my $records; # No. of records my $last; # The last line my $ini = 1; # set initial line my $count = 1; # counter for (my $count = 1; $count <= 20; $count++){ open (OLD, "< $old_file.txt") || die $!; # open file # Now we need to split up the file into 20 files open (NEW, "> $old_file$count.txt") || die $!; #open outputfile while (<OLD>){ if($. == $ini){ chomp; $records = $_; } } close OLD; $last = $ini + $records + 1; open (OLD, "<$old_file.txt") || die $!; # I am more comfortable with r +e-opening the file like this while (<OLD>){ next unless $. >= $ini; print NEW $_; last if $. >= $last; } close OLD; close NEW; $ini = $last + 1; }

What I wanted to know was, that in a case like this, would it be better practise to use a regex (since there are well-defined patterns), or use, what to me at least, is a simpler approach.

mndoci

"What you do in this world is a matter of no consequence. The question is, what can you make people believe that you have done?"-Sherlock Holmes in 'A study in scarlet'

Replies are listed 'Best First'.
(jeffa) Re: To regex or not to regex
by jeffa (Bishop) on Mar 11, 2002 at 05:35 UTC
    Thanks for sharing. Here is what i came up with. Instead of setting a limit at 20 records, this version will keep adding new files until the input file is exhausted.

    I did use one regex to read the line that tells how many records to expect. I also took the liberty of changing the names of the output files with sprintf and a different prefix than the original file (just so i could delete my mistakes easier :D).

    use strict; open(IN,'old_file.txt') or die; my $i = 1; while (<IN>) { if (/^(\d+)\s*$/) { open(OUT,'>new_file'.sprintf("%02d",$i++).'.txt') or die; do { $_ = <IN>; print OUT; } for (1..$1); } }

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Here is a more newbie-friendly version of jeffa's code. Consider it a learning tool.
      my $file_num = 1; while (<IN>) { if (/^(\d+)\s*$/) { my $count = $1; my $fname = sprintf("new_file%02d.txt", $file_num); open(OUT,">$fname") or die "Cannot open file $fname: $!\n"; for(1..$count){ my $rec = <IN>; print OUT $rec; } close OUT; $file_num++; } else { # ignore the file separator ======= } }

      Thank you very much. Much more compact than my blurb. There is one thing missing though, and I am sure I can fix that (I have not tried) without too much fuss. I want to include the first line (number of records), and the last line (====) as well. Otherwise, works like a charm.

      mndoci

      "What you do in this world is a matter of no consequence. The question is, what can you make people believe that you have done?"-Sherlock Holmes in 'A study in scarlet'

Re: To regex or not to regex
by Anonymous Monk on Mar 11, 2002 at 06:00 UTC
    {local $/="====\n"; while( <> ){ open (NEW, "> $old_file$..txt") || die $!; #open output chomp; print NEW; } }
      This code is slightly broken because it ignores a requirement that was specified by the problem (every section starts with a line containing a number which is the number of records) and relies on an assumption instead (every section is terminated by ==== followed by "\n"). There are at least three cases where this would break, and they do happen more often than not.
      • When there is a record ending with "====", this code breaks the section into more files than it should.
      • When there is a space after the "====", then it won't match the input record separator, thus 2 sections will be merged in one file. This space would not be visible to the eye so debugging this problem is not easy.
      • When there is an empty line after the last "====", you will have an additional empty file.
      So, the wisdom behind this story: follow the specs carefully and don't golf when you don't need to.

      Hope this helps,,,

      Aziz,,,

        Just realized that your post was targeted at Anonymous' post and not mine. DUH!!!!
        That said, I almost used '====' as the separator, but my first test went into a weird loop, so I did not continue along that path.

        mndoci

        "What you do in this world is a matter of no consequence. The question is, what can you make people believe that you have done?"-Sherlock Holmes in 'A study in scarlet'

        Yes it does rely on that assumption. I did have lines in my code that removed leading whitespace, but I didn't think about trailing whitespace. As for records between the # and '====', they will nto have any '====', so I guess I am safe, but I agree, my solution is not general. Thanks for the warning!!

        mndoci

        "What you do in this world is a matter of no consequence. The question is, what can you make people believe that you have done?"-Sherlock Holmes in 'A study in scarlet'

Re: To regex or not to regex
by jryan (Vicar) on Mar 11, 2002 at 18:32 UTC

    Algorithmically golfed: (golfed without employing space-saving techniques, otherwise it would look scary :P)

    open (OLD, "<$old_file.txt") || die $!; my $i; foreach (split (/====\s*\n(?=\d+)/, do{local$/;<OLD>})) { open (NEW,">$old_file$i.txt") || die $!; print NEW $_; close (NEW); $i++; }