The ^L is the FORM FEED control character. It's used to separate pages ("records") of the report.
You can probably split on the FORM FEED character rather than on the text of the report header. Better yet, don't slurp the entire "large report" into memory, but instead process each report page one at a time by setting $/ ($INPUT_RECORD_SEPARATOR) to the FORM FEED character "\f".
#!/usr/bin/perl use strict; use warnings; use autodie qw( open close ); use English qw( -no_match_vars ); # Report pages are separated by FORM FEED control characters local $INPUT_RECORD_SEPARATOR = "\f"; open my $report, '<', 'QISC001'; while (my $page = <$report>) { # Parse and transform each report page here... } close $report; exit 0;
Jim
UPDATE: You mentioned you're splitting the report into separate "stores." I presume this means you're carving the report into individual files, one per page. This script is untested, but it illustrates some general ideas you might find useful.
#!/usr/bin/perl use strict; use warnings; use autodie qw( open close ); use English qw( -no_match_vars ); @ARGV == 1 or die "Usage: perl $PROGRAM_NAME <report file>\n"; # Report pages are separated by FORM FEED control characters local $INPUT_RECORD_SEPARATOR = "\f"; my $report_file = shift @ARGV; open my $report_fh, '<', $report_file; while (my $page = <$report_fh>) { my ($page_number, $store_number, $post_date) = $page =~ m{ PAGE:\s+(\d+) .+? STORE:\s+(\d+) .+? POST\s+DATE:\s+(\d\d/\d\d/\d\d\d\d) }msx; # For example, 07/14/2011 => 20110714 $post_date =~ s{(\d\d)/(\d\d)/(\d\d\d\d)}{$3$1$2}; # For example, 20110714-001-001.rpt my $page_file = sprintf "%s-%03d-%03d.rpt", $post_date, $store_number, $page_number; open my $page_fh, '>', $page_file; print {$page_fh} $page; close $page_fh; } close $report_fh; exit 0;
In reply to Re: Problem with a regex?
by Jim
in thread Problem with a regex?
by TStanley
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |