mndoci has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks,

I have a file which has the following format

245
record #1
record #2
.
.
record #245
====
165
record #1
record #2
.
.
record #165
====

As you can make out, the first line is the number or records in that set (till "===="). There are 20 such records. Instead of regexing, I used the following

#initialize some variables my $records; # No. of records my $last; # The last line my $ini = 1; # set initial line my $count = 1; # counter for (my $count = 1; $count <= 20; $count++){ open (OLD, "< $old_file.txt") || die $!; # open file # Now we need to split up the file into 20 files open (NEW, "> $old_file$count.txt") || die $!; #open outputfile while (<OLD>){ if($. == $ini){ chomp; $records = $_; } } close OLD; $last = $ini + $records + 1; open (OLD, "<$old_file.txt") || die $!; # I am more comfortable with r +e-opening the file like this while (<OLD>){ next unless $. >= $ini; print NEW $_; last if $. >= $last; } close OLD; close NEW; $ini = $last + 1; }

What I wanted to know was, that in a case like this, would it be better practise to use a regex (since there are well-defined patterns), or use, what to me at least, is a simpler approach.

mndoci

"What you do in this world is a matter of no consequence. The question is, what can you make people believe that you have done?"-Sherlock Holmes in 'A study in scarlet'

Replies are listed 'Best First'.
(jeffa) Re: To regex or not to regex
by jeffa (Bishop) on Mar 11, 2002 at 05:35 UTC
    Thanks for sharing. Here is what i came up with. Instead of setting a limit at 20 records, this version will keep adding new files until the input file is exhausted.

    I did use one regex to read the line that tells how many records to expect. I also took the liberty of changing the names of the output files with sprintf and a different prefix than the original file (just so i could delete my mistakes easier :D).

    use strict; open(IN,'old_file.txt') or die; my $i = 1; while (<IN>) { if (/^(\d+)\s*$/) { open(OUT,'>new_file'.sprintf("%02d",$i++).'.txt') or die; do { $_ = <IN>; print OUT; } for (1..$1); } }

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Here is a more newbie-friendly version of jeffa's code. Consider it a learning tool.
      my $file_num = 1; while (<IN>) { if (/^(\d+)\s*$/) { my $count = $1; my $fname = sprintf("new_file%02d.txt", $file_num); open(OUT,">$fname") or die "Cannot open file $fname: $!\n"; for(1..$count){ my $rec = <IN>; print OUT $rec; } close OUT; $file_num++; } else { # ignore the file separator ======= } }

      Thank you very much. Much more compact than my blurb. There is one thing missing though, and I am sure I can fix that (I have not tried) without too much fuss. I want to include the first line (number of records), and the last line (====) as well. Otherwise, works like a charm.

      mndoci

      "What you do in this world is a matter of no consequence. The question is, what can you make people believe that you have done?"-Sherlock Holmes in 'A study in scarlet'

Re: To regex or not to regex
by Anonymous Monk on Mar 11, 2002 at 06:00 UTC
    {local $/="====\n"; while( <> ){ open (NEW, "> $old_file$..txt") || die $!; #open output chomp; print NEW; } }
      This code is slightly broken because it ignores a requirement that was specified by the problem (every section starts with a line containing a number which is the number of records) and relies on an assumption instead (every section is terminated by ==== followed by "\n"). There are at least three cases where this would break, and they do happen more often than not.
      • When there is a record ending with "====", this code breaks the section into more files than it should.
      • When there is a space after the "====", then it won't match the input record separator, thus 2 sections will be merged in one file. This space would not be visible to the eye so debugging this problem is not easy.
      • When there is an empty line after the last "====", you will have an additional empty file.
      So, the wisdom behind this story: follow the specs carefully and don't golf when you don't need to.

      Hope this helps,,,

      Aziz,,,

        Yes it does rely on that assumption. I did have lines in my code that removed leading whitespace, but I didn't think about trailing whitespace. As for records between the # and '====', they will nto have any '====', so I guess I am safe, but I agree, my solution is not general. Thanks for the warning!!

        mndoci

        "What you do in this world is a matter of no consequence. The question is, what can you make people believe that you have done?"-Sherlock Holmes in 'A study in scarlet'

        Just realized that your post was targeted at Anonymous' post and not mine. DUH!!!!
        That said, I almost used '====' as the separator, but my first test went into a weird loop, so I did not continue along that path.

        mndoci

        "What you do in this world is a matter of no consequence. The question is, what can you make people believe that you have done?"-Sherlock Holmes in 'A study in scarlet'

Re: To regex or not to regex
by jryan (Vicar) on Mar 11, 2002 at 18:32 UTC

    Algorithmically golfed: (golfed without employing space-saving techniques, otherwise it would look scary :P)

    open (OLD, "<$old_file.txt") || die $!; my $i; foreach (split (/====\s*\n(?=\d+)/, do{local$/;<OLD>})) { open (NEW,">$old_file$i.txt") || die $!; print NEW $_; close (NEW); $i++; }