punkish has asked for the wisdom of the Perl Monks concerning the following question:

/regexp/ monks,

I have the following

__DATA__ blah blah blah blah blah blah blah blah blah blah blah blah INTERESTING CODE-- CODE NAME CODE NAME -------- ----------------------- -------- ----------------------- ABC NAME ONE RST NAME EIGHT ... DEF NAME TWO THREE WXY NAME NINE - TEN GHIJK NAME FOUR ... ZAB NAME ELEVEN LMN NAME FIVE - SIX CDE NAME TWELVE OPQ NAME SEVEN more blah blah blah

I want the following

######################### print $interesting_code; # output "NAME ONE, NAME TWO THREE, NAME FOUR ..., NAME FIVE - SIX, NAME SEVEN, + NAME EIGHT ..., NAME NINE - TEN, NAME ELEVEN, NAME TWELVE" # you get the picture... a csv list of NAMEs #########################

I have almost solved the above, but my code is so juvenile that I put it forth here with great embarassment. It is also not producing the correct output, and is certainly not going to hold for variations in the above. While there could be variations in the INTERESTING CODE block, it would still be formatted as above. A two-column set of two columns, the CODE followed by the NAME

my $stuff; my @stuff; while (<DATA>) { if ($stuff) { push(@stuff, $_) if (!($_ =~ /^\s*CODE\s*NAME/) && !($_ =~ /^\s*-+/)); } else { $stuff = 1 if ($_ =~ /^INTERESTING CODE-+/); } } my @interesting_code; for (@stuff) { my @code = split(/\s{2,}/, $_); my $i = 0; for (@code) { push(@interesting_code, $_) if !($i % 2); $i++; } } $interesting_code = join(', ', @interesting_code);

I seek edification; and better, and more dependable code. All advice welcomed with gratitude.

Update: Reformatted so one wouldn't get a headache from reading it.

--

when small people start casting long shadows, it is time to go to bed

Replies are listed 'Best First'.
Re: Extracting formatted text block
by ikegami (Patriarch) on Mar 16, 2005 at 23:08 UTC

    My way (documented):

    use strict; use warnings; sub rtrim { local $_ = (@_ ? $_[0] : $_); s/\s+$//; return $_; } { local $_; # Skip junk at top. do { defined($_ = <DATA>) or die("Bad data.\n"); } while (!/^\s*-+\s+(-+)\s+-+\s+(-+)/); # Find field positions and lengths. # Could be hardcoded. my $match_start1 = $-[1]; my $match_length1 = $+[1] - $match_start1; my $match_start2 = $-[2]; my $match_length2 = $+[2] - $match_start2; # Extract the names. my @names1; my @names2; while (defined($_ = <DATA>) && !/^\s*$/) { push(@names1, rtrim(substr($_, $match_start1, $match_length1))) if length()-1 >= $match_start1; push(@names2, rtrim(substr($_, $match_start2, $match_length2))) if length()-1 >= $match_start2; } my @names = (@names1, @names2); # Display results. print(join(', ', @names), $/); }

    Update: Didn't notice there were two columns of names. Fixed.

Re: Extracting formatted text block
by Zaxo (Archbishop) on Mar 16, 2005 at 23:07 UTC

    Those are evidently fixed-width fields, so unpack is a likely choice.

    while (<DATA>) { # seek interesting stuff . . . while (<DATA>) { last if /^\n/; my (@codes,@names); ($codes[0],$names[0],$codes[1],$names[1]) = unpack 'xx A8 x A23 xx A8 xxx A22', $_; push @interesting_stuff, grep {$_} @names; } }
    The unpack format may not be exactly right for your real data, but it matches what I think I see here (I'd expect the two records in a line to have the same format).

    After Compline,
    Zaxo

      Two problems:

      1) Your solution doesn't take account lines that only have one column:
      x outside of string at !.pl line 28, <DATA> line 14.

      2) Your output not ordered correctly (if I fudge the data to get around the first problem):
      NAME ONE, AME EIGHT ..., NAME TWO THREE, AME NINE - TEN, NAME FOUR ..., AME ELEVEN, NAME FIVE - SIX, AME TWELVE, NAME SEVEN, XXX XXXXXX

Re: Extracting formatted text block
by jdporter (Paladin) on Mar 16, 2005 at 23:32 UTC

    Here, I turn a line such as

    " ---- "
    into a pattern (regex) that looks like
    / (....) /
    The transformation is accomplished via a set of very simple regexes. Once I have that, I use it to parse the fields out of the lines of data. I also use it to parse the header line. I remember the columns that the 'NAME' header appeared in, and only take those columns of data. Since there can be multiple sets of columns, I have to do a little kludgery to keep the first 'NAME' column of data separate from the second, and so on.

    (Update: Added a __DATA__ section and changed some variable names.)

    use strict; use warnings; my $gold_field = 'NAME'; my( @gold_aoa, @col, $pre,; $pat, $pad ); while (<DATA>) { chomp; if ( /^[- ]+$/ ) { # this line is a picture; turn it into a pattern $pad = ' ' x length($_); # overkill, I know. s/ -/ (-/g; s/^-/(-/g; s/- /-) /g; s/-$/-)/g; y/-/./; $pat = $_; # start looking for data. # prev contains header my @hdr = $prev =~ /^$pat/; s/\s+$// for @hdr; @col = grep { $hdr[$_] eq $gold_field } 0 .. $#hdr; next; } if ( $pat ) { $_ .= $pad; if ( my @dat = /^$pat/ ) { @dat = @dat[@col]; for my $i ( 0 .. $#dat ) { $dat[$i] =~ s/\s+$//; length($dat[$i]) and push @{ $gold_aoa[$i] }, $dat[$i]; } } else { # issue report local $, = ','; local $\ = "\n"; print map @$_, @gold_aoa; @gold_aoa = (); # stop looking for data undef $pat; } } $prev = $_; } __DATA__ Here's the original sample data: CODE NAME CODE NAME -------- ----------------------- -------- ----------------------- ABC NAME ONE RST NAME EIGHT ... DEF NAME TWO THREE WXY NAME NINE - TEN GHIJK NAME FOUR ... ZAB NAME ELEVEN LMN NAME FIVE - SIX CDE NAME TWELVE OPQ NAME SEVEN And here's another bunch of data. It all still works! CODE NAME AGE CODE NAME AGE CODE NAME AGE ---- ----- --- ---- --------- --- ---- ----------- --- ABC ONE 1 RST EIGHT 19 RS0 VEGA 39 DEF TWO 2 WXY NINE 23 WX0 SHELIAK 23 DEJ THREE 3 WXZ TEN 29 WY0 SULAFAT 29 GHI FOUR 9 ZAB ELEVEN 31 ZA0 AL ADFAR 31 LMN FIVE 10 CDE TWELVE 37 CD0 AL ATHFAR 37 LMS SIX 13 OPQ SEVEN 15

    (Update: Added the following commentary.)

    This solution was designed to be flexible (i.e. robust) in the face of variable numbers of column sets (i.e. your example showed two, but I wanted to allow for any), and variable column widths and gutter widths. Some of the other proposed solutions hard-code these parameters. I think solutions that key off of the 'INTERESTING CODE' line are particularly non-robust.

Re: Extracting formatted text block
by BrowserUk (Patriarch) on Mar 17, 2005 at 03:25 UTC

    #! perl -slw use strict; my @names; my $fmt; my( @col1, @col2 ); while( <DATA> ) { if( m[INTERESTING CODE--] .. m[^\s*$] ) { if( m[^\s*$] ) { print join', ', grep defined, @col1, @col2; @names = @col1 = @col2 = (); undef $fmt; next; } if( m[^[\s-]+$] ) { $fmt .= "\@$-[-1] A${ \( $+[-1]-$-[-1]) }" while m[\s+([-] ++)]g; next; } next unless $fmt; $_ .= chr(0)x100; ( undef, $col1[@col1], undef, $col2[@col2] ) = unpack $fmt, $_ +; } }

    Input and output

Re: Extracting formatted text block
by TedPride (Priest) on Mar 17, 2005 at 00:28 UTC
    This looks simple, but it's not. You can't use split to get the data, since the boundaries may be as small as one space and the data itself may contain spaces. Unpack gives errors if you try to access a position outside the string, so the last line is a major problem too. About the best you can do is the following, and it requires loading the whole set of data into memory at once to be even this neat.

    EDIT: I updated my code for the wrapping text and for reading line by line instead of all at once.

    use strict; use warnings; my (@col1, @col2); do {} until index(<DATA>, 'INTERESTING CODE') != -1; <DATA> for (0..1); while (<DATA>) { $_ .= ' ' x (67 - length $_); @_ = unpack 'x2A8x1A23x1A8x1A23', $_; push @col1, $_[1] if $_[0]; push @col2, $_[3] if $_[2]; } print '"' . join(', ', @col1, @col2) . '"'; __DATA__ blah blah blah blah blah blah blah blah blah blah blah blah INTERESTING CODE-- CODE NAME CODE NAME -------- ----------------------- -------- ----------------------- ABC NAME ONE RST NAME EIGHT ... DEF NAME TWO THREE WXY NAME NINE - TEN GHIJK NAME FOUR ... ZAB NAME ELEVEN LMN NAME FIVE - SIX CDE NAME TWELVE OPQ NAME SEVEN more blah blah blah
      a'ite TedPride. You have to explain what in Larry's name you are doing here. I am not familiar with the following two idioms

      do {} until index(<DATA>, 'INTERESTING CODE') != -1; <DATA> for (0..1);
      --

      when small people start casting long shadows, it is time to go to bed
Re: Extracting formatted text block
by punkish (Priest) on Mar 17, 2005 at 02:06 UTC
    Thanks to the monks you responded. I have the following --
    _TedPride(); # the above seems to go into an infinite loop. I have to # CTRL-C to exit. I also don't understand the idioms used _jdporter(); # seems neat, but... # the above exits immediately without printing anything _ikegami(); # the only code that thus far prints correctly. # I have to run it against several variations to # confirm its robustness, Btw, I do not understand # $-[1] and $+[1] bit. At first glance I thought it # was a typo! _Zaxo(); # as mentioned by ikegami, exits with # 'x' outside of string in unpack at ... line ... _punkish(); # works, but skips a value or two # Disclaimer: # I tried to be very careful, but I may have copied the code # given above incorrectly
    --

    when small people start casting long shadows, it is time to go to bed
      _jdporter(); # seems neat, but... # the above exits immediately without printing anything
      That's because it's reading from <DATA>, but there's no __DATA__ section. I assumed you would append an appropriate __DATA__ section, as you showed in the original post.
Re: Extracting formatted text block
by sh1tn (Priest) on Mar 17, 2005 at 02:29 UTC
    while( <DATA> ){ if( /INTERESTING CODE/ .. /^\n/ ){ if( /(?!<NAME.+)NAME\s(\w+(\s\W?\s?\w+)*)/ ){ push @col_1, $1 } if( /(\w+\W+){2,}NAME\s(\w+(\s\W?\s?\w+)*)/ ){ push @col_2, $2 } } } print join ', ', (map{ $_ = 'NAME '.$_}@col_1, @col_2) __END__ STDOUT: NAME ONE, NAME TWO THREE, NAME FOUR, NAME FIVE - SIX, NAME SEVEN, NAME EIGHT, NAME NINE - TEN, NAME ELEVEN, NAME TWELVE