Extracting formatted text block

punkish has asked for the wisdom of the Perl Monks concerning the following question:

/regexp/ monks,

I have the following

__DATA__
blah blah blah

blah blah blah
blah blah blah
          blah blah blah

INTERESTING CODE--
  CODE     NAME                    CODE     NAME                      
  -------- ----------------------- -------- -----------------------
  ABC      NAME ONE                RST      NAME EIGHT ...   
  DEF      NAME TWO THREE          WXY      NAME NINE - TEN
  GHIJK    NAME FOUR ...           ZAB      NAME ELEVEN
  LMN      NAME FIVE - SIX         CDE      NAME TWELVE
  OPQ      NAME SEVEN 

more blah
blah blah
[download]

I want the following

#########################
print $interesting_code;

# output
"NAME ONE, NAME TWO THREE, NAME FOUR ..., NAME FIVE - SIX, NAME SEVEN,
+ NAME EIGHT ..., NAME NINE - TEN, NAME ELEVEN, NAME TWELVE"
# you get the picture... a csv list of NAMEs
#########################
[download]

I have almost solved the above, but my code is so juvenile that I put it forth here with great embarassment. It is also not producing the correct output, and is certainly not going to hold for variations in the above. While there could be variations in the INTERESTING CODE block, it would still be formatted as above. A two-column set of two columns, the CODE followed by the NAME

my $stuff;
my @stuff;
while (<DATA>) {
  if ($stuff) {
    push(@stuff, $_) 
      if (!($_ =~ /^\s*CODE\s*NAME/) && !($_ =~ /^\s*-+/));
  } else {
    $stuff = 1 if ($_ =~ /^INTERESTING CODE-+/);
  }
}
my @interesting_code;
for (@stuff) {
  my @code = split(/\s{2,}/, $_);
  my $i = 0;
  for (@code) {
    push(@interesting_code, $_) if !($i % 2);
    $i++;
  }
}
$interesting_code = join(', ', @interesting_code);
[download]

I seek edification; and better, and more dependable code. All advice welcomed with gratitude.

Update: Reformatted so one wouldn't get a headache from reading it.

--

when small people start casting long shadows, it is time to go to bed

Comment on Extracting formatted text block Select or Download Code

Replies are listed 'Best First'.
Re: Extracting formatted text block by ikegami (Patriarch) on Mar 16, 2005 at 23:08 UTC
My way (documented): use strict; use warnings; sub rtrim { local $_ = (@_ ? $_[0] : $_); s/\s+$//; return $_; } { local $_; # Skip junk at top. do { defined($_ = <DATA>) or die("Bad data.\n"); } while (!/^\s-+\s+(-+)\s+-+\s+(-+)/); # Find field positions and lengths. # Could be hardcoded. my $match_start1 = $-[1]; my $match_length1 = $+[1] - $match_start1; my $match_start2 = $-[2]; my $match_length2 = $+[2] - $match_start2; # Extract the names. my @names1; my @names2; while (defined($_ = <DATA>) && !/^\s$/) { push(@names1, rtrim(substr($_, $match_start1, $match_length1))) if length()-1 >= $match_start1; push(@names2, rtrim(substr($_, $match_start2, $match_length2))) if length()-1 >= $match_start2; } my @names = (@names1, @names2); # Display results. print(join(', ', @names), $/); } [download] Update: Didn't notice there were two columns of names. Fixed.	[reply] [d/l]
Re: Extracting formatted text block by Zaxo (Archbishop) on Mar 16, 2005 at 23:07 UTC
Those are evidently fixed-width fields, so unpack is a likely choice. `while (<DATA>) { # seek interesting stuff . . . while (<DATA>) { last if /^\n/; my (@codes,@names); ($codes[0],$names[0],$codes[1],$names[1]) = unpack 'xx A8 x A23 xx A8 xxx A22', $_; push @interesting_stuff, grep {$_} @names; } }` [download] The unpack format may not be exactly right for your real data, but it matches what I think I see here (I'd expect the two records in a line to have the same format). After Compline, Zaxo	[reply] [d/l]
Re^2: Extracting formatted text block by ikegami (Patriarch) on Mar 16, 2005 at 23:33 UTC
Two problems: 1) Your solution doesn't take account lines that only have one column: `x outside of string at !.pl line 28, <DATA> line 14.` 2) Your output not ordered correctly (if I fudge the data to get around the first problem): `NAME ONE, AME EIGHT ..., NAME TWO THREE, AME NINE - TEN, NAME FOUR ..., AME ELEVEN, NAME FIVE - SIX, AME TWELVE, NAME SEVEN, XXX XXXXXX`	[reply] [d/l] [select]
Re: Extracting formatted text block by jdporter (Paladin) on Mar 16, 2005 at 23:32 UTC
Here, I turn a line such as `" ---- "` [download] into a pattern (regex) that looks like `/ (....) /` [download] The transformation is accomplished via a set of very simple regexes. Once I have that, I use it to parse the fields out of the lines of data. I also use it to parse the header line. I remember the columns that the 'NAME' header appeared in, and only take those columns of data. Since there can be multiple sets of columns, I have to do a little kludgery to keep the first 'NAME' column of data separate from the second, and so on. (Update:* Added a* `__DATA__` section and changed some variable names.) use strict; use warnings; my $gold_field = 'NAME'; my( @gold_aoa, @col, $pre,; $pat, $pad ); while (<DATA>) { chomp; if ( /^[- ]+$/ ) { # this line is a picture; turn it into a pattern $pad = ' ' x length($_); # overkill, I know. s/ -/ (-/g; s/^-/(-/g; s/- /-) /g; s/-$/-)/g; y/-/./; $pat = $_; # start looking for data. # prev contains header my @hdr = $prev =~ /^$pat/; s/\s+$// for @hdr; @col = grep { $hdr[$_] eq $gold_field } 0 .. $#hdr; next; } if ( $pat ) { $_ .= $pad; if ( my @dat = /^$pat/ ) { @dat = @dat[@col]; for my $i ( 0 .. $#dat ) { $dat[$i] =~ s/\s+$//; length($dat[$i]) and push @{ $gold_aoa[$i] }, $dat[$i]; } } else { # issue report local $, = ','; local $\ = "\n"; print map @$_, @gold_aoa; @gold_aoa = (); # stop looking for data undef $pat; } } $prev = $_; } __DATA__ Here's the original sample data: CODE NAME CODE NAME -------- ----------------------- -------- ----------------------- ABC NAME ONE RST NAME EIGHT ... DEF NAME TWO THREE WXY NAME NINE - TEN GHIJK NAME FOUR ... ZAB NAME ELEVEN LMN NAME FIVE - SIX CDE NAME TWELVE OPQ NAME SEVEN And here's another bunch of data. It all still works! CODE NAME AGE CODE NAME AGE CODE NAME AGE ---- ----- --- ---- --------- --- ---- ----------- --- ABC ONE 1 RST EIGHT 19 RS0 VEGA 39 DEF TWO 2 WXY NINE 23 WX0 SHELIAK 23 DEJ THREE 3 WXZ TEN 29 WY0 SULAFAT 29 GHI FOUR 9 ZAB ELEVEN 31 ZA0 AL ADFAR 31 LMN FIVE 10 CDE TWELVE 37 CD0 AL ATHFAR 37 LMS SIX 13 OPQ SEVEN 15 [download] (Update:* Added the following commentary.)* This solution was designed to be flexible (i.e. robust) in the face of variable numbers of column sets (i.e. your example showed two, but I wanted to allow for any), and variable column widths and gutter widths. Some of the other proposed solutions hard-code these parameters. I think solutions that key off of the 'INTERESTING CODE' line are particularly non-robust.	[reply] [d/l] [select]
Re: Extracting formatted text block by BrowserUk (Patriarch) on Mar 17, 2005 at 03:25 UTC
`#! perl -slw use strict; my @names; my $fmt; my( @col1, @col2 ); while( <DATA> ) { if( m[INTERESTING CODE--] .. m[^\s$] ) { if( m[^\s$] ) { print join', ', grep defined, @col1, @col2; @names = @col1 = @col2 = (); undef $fmt; next; } if( m[^[\s-]+$] ) { $fmt .= "\@$-[-1] A${ \( $+[-1]-$-[-1]) }" while m[\s+([-] ++)]g; next; } next unless $fmt; $_ .= chr(0)x100; ( undef, $col1[@col1], undef, $col2[@col2] ) = unpack $fmt, $_ +; } }` [download] Input and output Read more... (2 kB)	[reply] [d/l] [select]
Re: Extracting formatted text block by TedPride (Priest) on Mar 17, 2005 at 00:28 UTC
This looks simple, but it's not. You can't use split to get the data, since the boundaries may be as small as one space and the data itself may contain spaces. Unpack gives errors if you try to access a position outside the string, so the last line is a major problem too. About the best you can do is the following, and it requires loading the whole set of data into memory at once to be even this neat. EDIT: I updated my code for the wrapping text and for reading line by line instead of all at once. use strict; use warnings; my (@col1, @col2); do {} until index(<DATA>, 'INTERESTING CODE') != -1; <DATA> for (0..1); while (<DATA>) { $_ .= ' ' x (67 - length $_); @_ = unpack 'x2A8x1A23x1A8x1A23', $_; push @col1, $_[1] if $_[0]; push @col2, $_[3] if $_[2]; } print '"' . join(', ', @col1, @col2) . '"'; __DATA__ blah blah blah blah blah blah blah blah blah blah blah blah INTERESTING CODE-- CODE NAME CODE NAME -------- ----------------------- -------- ----------------------- ABC NAME ONE RST NAME EIGHT ... DEF NAME TWO THREE WXY NAME NINE - TEN GHIJK NAME FOUR ... ZAB NAME ELEVEN LMN NAME FIVE - SIX CDE NAME TWELVE OPQ NAME SEVEN more blah blah blah [download]	[reply] [d/l]
Re^2: Extracting formatted text block by punkish (Priest) on Mar 17, 2005 at 01:30 UTC
a'ite TedPride. You have to explain what in Larry's name you are doing here. I am not familiar with the following two idioms `do {} until index(<DATA>, 'INTERESTING CODE') != -1; <DATA> for (0..1);` [download] -- when small people start casting long shadows, it is time to go to bed	[reply] [d/l]
Re: Extracting formatted text block by punkish (Priest) on Mar 17, 2005 at 02:06 UTC
Thanks to the monks you responded. I have the following -- _TedPride(); # the above seems to go into an infinite loop. I have to # CTRL-C to exit. I also don't understand the idioms used _jdporter(); # seems neat, but... # the above exits immediately without printing anything _ikegami(); # the only code that thus far prints correctly. # I have to run it against several variations to # confirm its robustness, Btw, I do not understand # $-[1] and $+[1] bit. At first glance I thought it # was a typo! _Zaxo(); # as mentioned by ikegami, exits with # 'x' outside of string in unpack at ... line ... _punkish(); # works, but skips a value or two # Disclaimer: # I tried to be very careful, but I may have copied the code # given above incorrectly [download] -- when small people start casting long shadows, it is time to go to bed	[reply] [d/l]
Re^2: Extracting formatted text block by jdporter (Paladin) on Mar 17, 2005 at 05:56 UTC
`_jdporter(); # seems neat, but... # the above exits immediately without printing anything` [download] That's because it's reading from `<DATA>`, but there's no __DATA__ section. I assumed you would append an appropriate __DATA__ section, as you showed in the original post.	[reply] [d/l] [select]
Re: Extracting formatted text block by sh1tn (Priest) on Mar 17, 2005 at 02:29 UTC
`while( <DATA> ){ if( /INTERESTING CODE/ .. /^\n/ ){ if( /(?!<NAME.+)NAME\s(\w+(\s\W?\s?\w+))/ ){ push @col_1, $1 } if( /(\w+\W+){2,}NAME\s(\w+(\s\W?\s?\w+))/ ){ push @col_2, $2 } } } print join ', ', (map{ $_ = 'NAME '.$_}@col_1, @col_2) __END__ STDOUT: NAME ONE, NAME TWO THREE, NAME FOUR, NAME FIVE - SIX, NAME SEVEN, NAME EIGHT, NAME NINE - TEN, NAME ELEVEN, NAME TWELVE` [download]	[reply] [d/l]