Extract table from a block of text

reaper9187 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, I have a text file which contains a lot of information, most of it unimportant. However, there is a block of text that I need to parse which contains information in a tabular format as shown below.

INFO START

STIME      ETIME     COLUMN3    COLUMN4   COLUMN5
aaaa1      bbb1        ccc1       ddd1      eee1
aaaa2      bbb2        ccc2       ddd2      eee2
aaaa3      bbb3        ccc3       ddd3      eee3
aaaa4      bbb4        ccc4       ddd4      eee4

END
[download]

The sample output should be a table(hash maybe ?) that maps each element to its corresponding row as follows:

aaaa1:bbb1      ccc1
aaaa1:bbb1      ddd1
aaaa1:bbb1      eee1

aaaa2:bbb2      ccc2
aaaa2:bbb2      ddd2
aaaa2:bbb2      eee2

aaaa2:bbb3      ccc3
aaaa2:bbb3      ddd3
aaaa2:bbb3      eee3
[download]

Comment on Extract table from a block of text Select or Download Code

Replies are listed 'Best First'.
Re: Extract table from a block of text by choroba (Cardinal) on Sep 21, 2014 at 07:21 UTC
The flip-flop operator can tell you whether you're between the given lines. No need to hash anything as the output depends on the current line only. #!/usr/bin/perl use warnings; use strict; while (<DATA>) { if (my $line = /^INFO START$/ .. /^END$/) { next if /^$/ # Skip empty lines. or $line =~ /E/ # Skip the END line. or 1 == $line # Skip the START line. or /^STIME/; # Skip the header. my ($stime, $etime, @cols) = split; print "$stime:$etime\t$_\n" for @cols; print "\n"; } } __DATA__ ... ignore ... INFO START STIME ETIME COLUMN3 COLUMN4 COLUMN5 aaaa1 bbb1 ccc1 ddd1 eee1 aaaa2 bbb2 ccc2 ddd2 eee2 aaaa3 bbb3 ccc3 ddd3 eee3 aaaa4 bbb4 ccc4 ddd4 eee4 END ... ignore again ... [download] لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^2: Extract table from a block of text (updated) by LanX (Saint) on Sep 21, 2014 at 11:07 UTC
Hi Choroba, As a side note: Instead of parsing the sequence number $line you could apply the technique described in Re^4: grep trouble (body of flip-flop range) to skip the boundaries of the flip and the flop. :) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :)} UPDATE in hindsight it's a bad idea to use something like: `#!/usr/bin/perl use warnings; use strict; while (<DATA>) { if (/^INFO START$/ .. /^END$/ and not //) { print "$_"; } } __DATA__ ... ignore ... INFO START STIME ETIME COLUMN3 COLUMN4 COLUMN5 aaaa1 bbb1 ccc1 ddd1 eee1 aaaa2 bbb2 ccc2 ddd2 eee2 aaaa3 bbb3 ccc3 ddd3 eee3 aaaa4 bbb4 ccc4 ddd4 eee4 END ... ignore again ...` [download] While it does only print the inner range ... `STIME ETIME COLUMN3 COLUMN4 COLUMN5 aaaa1 bbb1 ccc1 ddd1 eee1 aaaa2 bbb2 ccc2 ddd2 eee2 aaaa3 bbb3 ccc3 ddd3 eee3 aaaa4 bbb4 ccc4 ddd4 eee4` [download] ... it's vulnerable to mess up the empty match `//` (i.e. match again the last successfully matched regular expression) by any other regex happening within the if-branch. :-/ The usual trap of global dependencies!	[reply] [d/l] [select]
Re^2: Extract table from a block of text by Laurent_R (Canon) on Sep 21, 2014 at 10:48 UTC
This works perfectly with the dummy data provided in the original post, but the regex to skip the END line might be a bit dangerous because real data might contain a 'E'. In addition, if the file is large, it might be better to do a `last`, rather than a `next`, when the line with the END tag is met.	[reply] [d/l] [select]
Re^3: Extract table from a block of text by LanX (Saint) on Sep 21, 2014 at 10:57 UTC
> but the regex to skip the END line might be a bit dangerous because real data might contain a `E` That's a misunderstanding, $line holds a sequence number which comes only in exponential notation (like 7E0) iff the flip-flop terminates. Has nothing to do with the END marker! :) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :)}	[reply] [d/l]
Re^4: Extract table from a block of text by Laurent_R (Canon) on Sep 21, 2014 at 12:23 UTC

UPDATE