To regex or not to regex

mndoci has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks,

I have a file which has the following format

245
record #1
record #2
.
.
record #245
====
165
record #1
record #2
.
.
record #165
====

As you can make out, the first line is the number or records in that set (till "===="). There are 20 such records. Instead of regexing, I used the following

#initialize some variables
my $records; # No. of records
my $last;  # The last line
my $ini = 1; # set initial line
my $count = 1; # counter


for (my $count = 1; $count <= 20; $count++){
open (OLD, "< $old_file.txt") || die $!; # open file
# Now we need to split up the file into 20 files

open (NEW, "> $old_file$count.txt") || die $!; #open outputfile

while (<OLD>){
    if($. == $ini){
        chomp;
        $records = $_;
    }
}
close OLD;

$last = $ini + $records + 1;

open (OLD, "<$old_file.txt") || die $!; # I am more comfortable with r
+e-opening the file like this
while (<OLD>){
    next unless $. >= $ini;
    print NEW $_;
    last if $. >= $last;
}
close OLD;
close NEW;
$ini = $last + 1;
}
[download]

What I wanted to know was, that in a case like this, would it be better practise to use a regex (since there are well-defined patterns), or use, what to me at least, is a simpler approach.

mndoci

"What you do in this world is a matter of no consequence. The question is, what can you make people believe that you have done?"-Sherlock Holmes in 'A study in scarlet'

Comment on To regex or not to regex Download Code

Replies are listed 'Best First'.
(jeffa) Re: To regex or not to regex by jeffa (Bishop) on Mar 11, 2002 at 05:35 UTC
Thanks for sharing. Here is what i came up with. Instead of setting a limit at 20 records, this version will keep adding new files until the input file is exhausted. I did use one regex to read the line that tells how many records to expect. I also took the liberty of changing the names of the output files with sprintf and a different prefix than the original file (just so i could delete my mistakes easier :D). `use strict; open(IN,'old_file.txt') or die; my $i = 1; while (<IN>) { if (/^(\d+)\s*$/) { open(OUT,'>new_file'.sprintf("%02d",$i++).'.txt') or die; do { $_ = <IN>; print OUT; } for (1..$1); } }` [download] jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l]
Re: (jeffa) Re: To regex or not to regex by abstracts (Hermit) on Mar 11, 2002 at 06:07 UTC
Here is a more newbie-friendly version of jeffa's code. Consider it a learning tool. `my $file_num = 1; while (<IN>) { if (/^(\d+)\s*$/) { my $count = $1; my $fname = sprintf("new_file%02d.txt", $file_num); open(OUT,">$fname") or die "Cannot open file $fname: $!\n"; for(1..$count){ my $rec = <IN>; print OUT $rec; } close OUT; $file_num++; } else { # ignore the file separator ======= } }` [download]	[reply] [d/l]
Re: (jeffa) Re: To regex or not to regex by mndoci (Scribe) on Mar 11, 2002 at 06:11 UTC
Thank you very much. Much more compact than my blurb. There is one thing missing though, and I am sure I can fix that (I have not tried) without too much fuss. I want to include the first line (number of records), and the last line (====) as well. Otherwise, works like a charm. mndoci "What you do in this world is a matter of no consequence. The question is, what can you make people believe that you have done?"-Sherlock Holmes in 'A study in scarlet'	[reply]
Re: To regex or not to regex by Anonymous Monk on Mar 11, 2002 at 06:00 UTC
`{local $/="====\n"; while( <> ){ open (NEW, "> $old_file$..txt") \|\| die $!; #open output chomp; print NEW; } }` [download]	[reply] [d/l]
Re: Re: To regex or not to regex by abstracts (Hermit) on Mar 11, 2002 at 06:46 UTC
This code is slightly broken because it ignores a requirement that was specified by the problem (every section starts with a line containing a number which is the number of records) and relies on an assumption instead (every section is terminated by ==== followed by "\n"). There are at least three cases where this would break, and they do happen more often than not. When there is a record ending with "====", this code breaks the section into more files than it should. When there is a space after the "====", then it won't match the input record separator, thus 2 sections will be merged in one file. This space would not be visible to the eye so debugging this problem is not easy. When there is an empty line after the last "====", you will have an additional empty file. So, the wisdom behind this story: follow the specs carefully and don't golf when you don't need to. Hope this helps,,, Aziz,,,	[reply]
Re: Re: Re: To regex or not to regex by mndoci (Scribe) on Mar 11, 2002 at 07:11 UTC
Yes it does rely on that assumption. I did have lines in my code that removed leading whitespace, but I didn't think about trailing whitespace. As for records between the # and '====', they will nto have any '====', so I guess I am safe, but I agree, my solution is not general. Thanks for the warning!! mndoci "What you do in this world is a matter of no consequence. The question is, what can you make people believe that you have done?"-Sherlock Holmes in 'A study in scarlet'	[reply]
Re: Re: Re: To regex or not to regex by mndoci (Scribe) on Mar 11, 2002 at 07:14 UTC
Just realized that your post was targeted at Anonymous' post and not mine. DUH!!!! That said, I almost used '====' as the separator, but my first test went into a weird loop, so I did not continue along that path. mndoci "What you do in this world is a matter of no consequence. The question is, what can you make people believe that you have done?"-Sherlock Holmes in 'A study in scarlet'	[reply]
Re: To regex or not to regex by jryan (Vicar) on Mar 11, 2002 at 18:32 UTC
Algorithmically golfed: (golfed without employing space-saving techniques, otherwise it would look scary :P) `open (OLD, "<$old_file.txt") \|\| die $!; my $i; foreach (split (/====\s*\n(?=\d+)/, do{local$/;<OLD>})) { open (NEW,">$old_file$i.txt") \|\| die $!; print NEW $_; close (NEW); $i++; }` [download]	[reply] [d/l]