in reply to Extract ranges of lines from a file, saving each range to a separate file

G'day perlato,

Welcome to the Monastery.

It looks like you were doing fine up to the flip-flop conditional (if (/TRANSACTION STARTED/ .. /TRANSACTION END/) {...}) and then got a bit lost.

You can do all the remaining processing within that if block. Here you'll want to do one of three things:

  1. If /TRANSACTION STARTED/ is TRUE, open a new file for writing. (Don't output the line.)
  2. If /TRANSACTION END/ is TRUE, close the filehandle. (Don't output the line.)
  3. Output all lines that don't match either condition in 1 or 2.

The coding required is very straightforward:

#!/usr/bin/env perl use strict; use warnings; use autodie; my $filename_prefix = 'pm_1155986_out_'; my $filename_suffix = '.txt'; my $filename_number = 0; my $out_fh; my ($start_re, $end_re) = (qr{TRANSACTION STARTED}, qr{TRANSACTION END +}); open my $in_fh, '<', 'pm_1155986_in.txt'; while (<$in_fh>) { if (/$start_re/ .. /$end_re/) { if (/$start_re/) { open $out_fh, '>', $filename_prefix . $filename_number++ . $filename_suff +ix; next; } if (/$end_re/) { close $out_fh; next; } print $out_fh $_; } }

[Note I've used the autodie pragma. This avoids having to hand-craft ... or die "..." messages for all the I/O operations: a tedious and error-prone activity (which Perl will do for you if you ask it nicely).]

Here's all the input and output data (within the spoiler):

$ cat pm_1155986_in.txt *TRANSACTION STARTED* [020t CARD INSERTED [020tCARD: *************5845 DATE 01-02-16 TIME 05:45:52 05:46:26 GENAC 1 : ARQC EXTERNAL AUTHENTICATE: NO ARPC 05:46:30 GENAC 2 : AAC 01 FEB 2016 05:47:41 10160021 WITHDRAW FROM XXXXXXXX ? INVALID TRANSCATION ON TERMINAL. ----------------------- [020t 05:47:05 CARD TAKEN [020t 05:47:07 TRANSACTION END [0r(1)2[000p[040qe1w3h162[020t*742*02/01/2016*05:47* *TRANSACTION STARTED* [020t CARD INSERTED [020tCARD: *************2584 DATE 01-02-16 TIME 05:47:27 05:48:00 GENAC 1 : ARQC 05:48:05 GENAC 2 : TC [020t 05:48:16 CARD TAKEN [020t 05:48:22 NOTES PRESENTED 0,0,2,0 01 FEB 2016 05:48:52 10160021 WITHDRAW FROM XXXXXXXX ? AMT GHC40.00 [020t 05:48:31 TRANSACTION END [0r(1)2[000p[040qe1w3h162[020t*743*02/01/2016*05:57* *TRANSACTION STARTED* [020t CARD INSERTED [020tCARD: *************3862 DATE 01-02-16 TIME 05:57:28 01 FEB 2016 05:58:33 10160021 INQUIRY FROM XXXXXXXX90018 AVAIL GHC1260.20 LEDGER GHC1260.20 [020t 05:58:06 CARD TAKEN [020t 05:58:11 TRANSACTION END [0r(1)2[000p[040qe1w3h162[020t*744*02/01/2016*06:43* *TRANSACTION STARTED* [020t CARD INSERTED [020tCARD: *************1972 DATE 01-02-16 TIME 06:43:53 01 FEB 2016 06:44:56 10160021 5029110111271972 4490 4490 INQUIRY FROM XXXXXXXX23013 AVAIL GHC14.28 LEDGER GHC14.28 [020t 06:44:25 CARD TAKEN [020t 06:44:29 TRANSACTION END [0r(1)2[000p[040qe1w3h162[020t*745*02/01/2016*06:56*[/CODE]
$ ls -l pm_1155986_out_* -rw-r--r-- 1 ken staff 322 24 Feb 21:20 pm_1155986_out_0.txt -rw-r--r-- 1 ken staff 287 24 Feb 21:20 pm_1155986_out_1.txt -rw-r--r-- 1 ken staff 240 24 Feb 21:20 pm_1155986_out_2.txt -rw-r--r-- 1 ken staff 261 24 Feb 21:20 pm_1155986_out_3.txt
$ for i in {0..3}; do echo ">>>>> pm_1155986_out_${i}.txt"; cat pm_115 +5986_out_${i}.txt; echo; done >>>>> pm_1155986_out_0.txt [020t CARD INSERTED [020tCARD: *************5845 DATE 01-02-16 TIME 05:45:52 05:46:26 GENAC 1 : ARQC EXTERNAL AUTHENTICATE: NO ARPC 05:46:30 GENAC 2 : AAC 01 FEB 2016 05:47:41 10160021 WITHDRAW FROM XXXXXXXX ? INVALID TRANSCATION ON TERMINAL. ----------------------- [020t 05:47:05 CARD TAKEN >>>>> pm_1155986_out_1.txt [020t CARD INSERTED [020tCARD: *************2584 DATE 01-02-16 TIME 05:47:27 05:48:00 GENAC 1 : ARQC 05:48:05 GENAC 2 : TC [020t 05:48:16 CARD TAKEN [020t 05:48:22 NOTES PRESENTED 0,0,2,0 01 FEB 2016 05:48:52 10160021 WITHDRAW FROM XXXXXXXX ? AMT GHC40.00 >>>>> pm_1155986_out_2.txt [020t CARD INSERTED [020tCARD: *************3862 DATE 01-02-16 TIME 05:57:28 01 FEB 2016 05:58:33 10160021 INQUIRY FROM XXXXXXXX90018 AVAIL GHC1260.20 LEDGER GHC1260.20 [020t 05:58:06 CARD TAKEN >>>>> pm_1155986_out_3.txt [020t CARD INSERTED [020tCARD: *************1972 DATE 01-02-16 TIME 06:43:53 01 FEB 2016 06:44:56 10160021 5029110111271972 4490 4490 INQUIRY FROM XXXXXXXX23013 AVAIL GHC14.28 LEDGER GHC14.28 [020t 06:44:25 CARD TAKEN

— Ken

Replies are listed 'Best First'.
Re^2: Extract ranges of lines from a file, saving each range to a separate file
by RonW (Parson) on Feb 27, 2016 at 00:47 UTC

    Just FYI, FWIW, the .. operator has a couple of features that can replace the duplicated regex matching.

    First, the value of .. isn't just FALSE or TRUE, it's also a line number relative to the start of the range. Before the start, the value is 0 (aka FALSE). When the start of the range is matched, the value is 1. this number increments until the end of the range. So, you can:

    my $rln = /$start_re/ .. /$end_re/; if $rln == 1 { # open output file next; } if $rln > 1 { print $out_fh $_; }

    Second, when the range ends, the number has 'E0' appended. So, you can:

    if rindex($rln, 'E0') { close $out_fh; next; }

    rindex is a simple string search that works backwards, so has much less overhead than another regex match. And appending 'E0' to a string of digits is still a valid number - numerically equal to the number without the 'E0'.