RFC: Is there a better way to use Text::Balanced?

Lady_Aleena has asked for the wisdom of the Perl Monks concerning the following question:

Hello. I have been writing a function to help me parse my lines. I had gotten help on this previously, however, I could not get it to work in Perl 5.8.8. I gave Text::Balanced a try and got it to work on my system. (I haven't uploaded the new code to my web host yet. I am trying to manage my expectations.) Before I get too excited about it, I ask for someone to look it over and let me know where I may encounter problems, if any. Have a cookie.

My system

A<a string>
ABBR<a string>
ACRONYM<a string>
B<a string>
BIG<a string>
CITE<a string>
CODE<a string>
DFN<a string>
EM<a string>
I<a string>
KBD<a string>
SAMP<a string>
SMALL<a string>
SPAN<a string>
STRONG<a string>
SUB<a string>
SUP<a string>
TT<a string>
VAR<a string>
[download]

Any of those could have attributes, so I denote them like...

A<link|href="url">
I<italic text|style="color:blue">
[download]

And they can be nested like...

I<A<link|href="url">|style="color:blue">
[download]

My code

package Base::HTML::Inline;
use strict;
use warnings;
use Exporter qw(import);
our @EXPORT_OK = qw(inline);

use Text::Balanced qw(extract_bracketed extract_multiple);

use Util::FancySplice;

my $allowed = join('|', qw(A ABBR ACRONYM B BIG CITE CODE DFN EM I KBD
+ Q SAMP SMALL SPAN STRONG SUB SUP TT VAR));

sub inline {
  my ($text) = @_;
  $text =~ s/\s\<!.+$//;
  
  my $brackets = fancy_splice(2,extract_multiple( $text, [ sub { extra
+ct_bracketed($_[0],'<>', qr(.*?(?=<)) ) } ] ));

  my $end = undef;
  if (scalar @{$brackets->[-1]} == 1) {
    $end = shift $brackets->[-1];
    pop @{$brackets};
  }

  my $line;
  for my $bracket (@$brackets) {
    (my $start = $bracket->[0]) =~ s/^(.+|)\b(.+)$/$1/;
    (my $tag   = $bracket->[0]) =~ s/^(.+|)\b(.+)$/$2/;

    if ($tag && $tag =~ /^(.+|)($allowed)$/) {
      $start .= $1;
      $tag = lc $2;
    }
    else {
      $tag = lc $tag;
    }

    (my $tagged = $bracket->[1]) =~ s/^\<(.+)\>$/$1/;
    my @in_parts = split(/\|/,$tagged);
    my $attribute = scalar @in_parts > 1 &&  $in_parts[-1] !~ /\>/ ? '
+ '.pop @in_parts : '';
    my $in_tag = join('|',@in_parts);

    $in_tag = $in_tag =~ /\</ ? inline($in_tag) : $in_tag;
    
    $line .= "$start<$tag$attribute>$in_tag</$tag>";
  }
  $line .= $end if $end;

  return $line;
}

1;
[download]

Here is the string I ran through my new inline function for the examples. I know it is long, but I will be parsing longer strings through inline.

my $text = 'Anyone who watches the Syfy channel knows that on Monday n
+ights they aired three television series I<A<EurSUP<e>ka|href="Movies
+_by_series.pl?series=EWA#EUReKA">|class="title">, I<A<Warehouse 13|hr
+ef="Movies_by_series.pl?series=EWA#Warehouse_13">>, and I<A<Alphas|hr
+ef="Movies_by_series.pl?series=EWA#Alphas">>. Some might not be aware
+ that these three series have formed a crossover cosmology which I ca
+ll A<EWA|href="Movies_by_series.pl?series=EWA"> <!-- This is a long s
+tring. -->';
[download]

The first thing I wanted to do was make sure any inline HTML comments I put in just for me are stripped. So I remove those first with s/\s\<!.+$// on line 15. Now the party can really start by extracting the brackets on line 17.

Since my lines can have more than one set of brackets in them, I have to use extract_multiple in conjunction with extract_bracketed. So, using those together, I get the following array.

$VAR1 = [
          'Anyone who watches the Syfy channel knows that on Monday ni
+ghts they aired three television series I',
          '<A<EurSUP<e>ka|href="Movies_by_series.pl?series=EWA#EUReKA"
+>>',
          ', I',
          '<A<Warehouse 13|href="Movies_by_series.pl?series=EWA#Wareho
+use_13">>',
          ', and I',
          '<A<Alphas|href="Movies_by_series.pl?series=EWA#Alphas">>',
          '. Some might not be aware that these three series have form
+ed a crossover cosmology which I call A',
          '<EWA|href="Movies_by_series.pl?series=EWA">',
          '.'
        ];
[download]

I was looking at it, and in moments I figured out how to group the results. It appears every other value is what is the bracket found with the value before it being what comes before the bracket. The last value is what appears after the last bracket, if anything. To make it a little easier for me to munge the lines, I need to group them. So I wrote a little function called fancy_splice to group the lines together.

fancy_splice

sub fancy_splice { 
  my ($amount, @in_list) = @_;
  my $list;
  while (@in_list) {
    push @$list,[splice(@in_list,0,$amount)];
  }
  return $list;
}
[download]

So, after putting the array returned by extract_multiple through fancy_splice, I get...

$VAR1 = [
          [
            'Anyone who watches the Syfy channel knows that on Monday 
+nights they aired three television series I',
            '<A<EurSUP<e>ka|href="Movies_by_series.pl?series=EWA#EUReK
+A">|class="title">'
          ],
          [
            ', I',
            '<A<Warehouse 13|href="Movies_by_series.pl?series=EWA#Ware
+house_13">>'
          ],
          [
            ', and I',
            '<A<Alphas|href="Movies_by_series.pl?series=EWA#Alphas">>'
          ],
          [
            '. Some might not be aware that these three series have fo
+rmed a crossover cosmology which I call A',
            '<EWA|href="Movies_by_series.pl?series=EWA">'
          ],
          [
            '.'
          ]
        ];
[download]

Now the party is in full swing with my lines with partners with the exception of the last guy standing alone at the end in this case. Because he is such a party pooper, I am going to deal with him first in lines 17 to 23 by assigning his value to $end. I pop off his group from the rest and send him to the bar.

Now I turn on the music and get the pairs dancing on line 25. The first partner I deal with is the lead, $bracket->[0]. On his tails is the tag which goes with the contents of the bracket (the last word of the value). So, I assign everything else in the lead to $start and put the last word into $tag.

Now, we will have to skip two loops ahead here to explain why $tag also needs munging. If you look at the first group's second value, you will see EurSUP. Well EurSUP is not an inline HTML tag, and I just want the SUP. So, lines 30 through 36 deal with getting the tags I want and rejoining the part, which is not the tag, with $start.

Now I am done dealing with the lead, it is time to deal with his partner which takes a bit more work. The first things I need to do are to name her $tagged and take her brackets off (line 38). I next look to see if the tag has any attributes, so I split $tagged by the pipes, in any. I then check the last value of the array has a > in it, and if not, pop it off the array and assign it to the $attribute. (If it did have a > in it, then it needs to be kept with the $in_tag because it may be the attribute of the next level tag or even the level after next.

When the dance is done for on set, I put the pieces together on line 43 and concatenate that piece to $line. When all the sets are done dancing, I bring $end back from the bar, and concatenate him too. All the dancing produces...

Anyone who watches the Syfy channel knows that on Monday nights they a
+ired three television series <i class="title"><a href="Movies_by_seri
+es.pl?series=EWA#EUReKA">Eur<sup>e</sup>ka</a></i>, <i><a href="Movie
+s_by_series.pl?series=EWA#Warehouse_13">Warehouse 13</a></i>, and <i>
+<a href="Movies_by_series.pl?series=EWA#Alphas">Alphas</a></i>. Some 
+might not be aware that these three series have formed a crossover co
+smology which I call <a href="Movies_by_series.pl?series=EWA">EWA</a>
[download]

Thank you for taking the time to read this, have another cookie. Please let me know if you see a problem or a place where I can tighten things up.

No matter how hysterical I get, my problems are not time sensitive. So, relax, have a cookie, and a very nice day!

Lady Aleena

Comment on RFC: Is there a better way to use Text::Balanced? Select or Download Code

Replies are listed 'Best First'.
Re: RFC: Is there a better way to use Text::Balanced? by Loops (Curate) on Nov 13, 2014 at 15:35 UTC
Hi Aleena, No idea if this is cookie worthy, but you got me interested in looking at the Text::Balanced module: use Text::Balanced qw(extract_multiple gen_extract_tagged); my $codes = qr(A\|ABBR\|ACRONYM\|B\|BIG\|CITE\|CODE\|DFN\|EM\|I\|KBD\|SAMP\|SMALL\| +SPAN\|STRONG\|SUB\|SUP\|TT\|VAR); my $extractor = [ gen_extract_tagged('<!--', '-->', ''), gen_extract_tagged("$codes<", '>', '') ]; # Join all plain text segments, and element substitutions; removing co +mments sub inline { my $text = shift; my $result = ''; for (extract_multiple $text, $extractor) { $_ = element(lc $1,$2) if /^($codes)<(.*)>$/; $result .= $_ unless (/^<!--/); } return $result; } # Handle a tag element with attribute if one exists, and then recurse +into it sub element { my ($id,$inner) = @_; return ($inner =~ s/\\|([^<>]+?)$// ? "<$id $1>" : "<$id>") . inline($inner) . "</$id>"; } print inline 'Anyone who watches the Syfy channel knows that on Monday nights ' +. 'they aired three television series I<A<EurSUP<e>ka\|href=' . '"Movies_by_series.pl?series=EWA#EUReKA">\|class="title">, I<A<Wareh +ouse ' . '13\|href="Movies_by_series.pl?series=EWA#Warehouse_13">>, and I<A<A +lphas' . '\|href="Movies_by_series.pl?series=EWA#Alphas">>. Some might not be + aware ' . 'that these three series have formed a crossover cosmology which I +call A<' . 'EWA\|href="Movies_by_series.pl?series=EWA"><!-- This is a long stri +ng. -->'; [download] Which prints the same output as in your example using Perl 5.8.8. The key is using extract_tagged which will match your codes exactly. For instance this is the results of the first extract_multiple : 'Anyone who watches the Syfy channel knows that on Monday nights they +aired three television series ' 'I<A<EurSUP<e>ka\|href="Movies_by_series.pl?series=EWA#EUReKA">\|class=" +title">' ', ' 'I<A<Warehouse 13\|href="Movies_by_series.pl?series=EWA#Warehouse_13">> +' ', and ' 'I<A<Alphas\|href="Movies_by_series.pl?series=EWA#Alphas">>' '. Some might not be aware that these three series have formed a cross +over cosmology which I call ' 'A<EWA\|href="Movies_by_series.pl?series=EWA">' '<!-- This is a long string. -->' [download] Everything is already broken up nicely and all that is left to do is recurse inside the tags found to handle any nested codes. This does nothing more than your code however, as usual just TIMTOWTDI.	[reply] [d/l] [select]
Re: RFC: Is there a better way to use Text::Balanced? by tobyink (Canon) on Nov 15, 2014 at 21:45 UTC
OK, this took me a lot longer than I expected it to, and the algorithm ended up a little convoluted, but I think this better handles a few edge cases... use v5.14; use strictures; package Parser { use Moo 1.006000; use Types::Standard qw( RegexpRef ArrayRef ); use Text::Balanced qw( extract_bracketed ); use HTML::Entities qw( encode_entities ); use namespace::autoclean; my $Allowance = RegexpRef->plus_coercions( ArrayRef, sub { qr/${\( join "\|", map quotemeta, @$_ )}/ }, ); has allowed_tags => ( is => 'ro', isa => $Allowance, coerce => 1, builder => sub { [qw(A ABBR ACRONYM B BIG CITE CODE DFN EM I KBD Q SAMP SMALL SPAN STRONG SUB SUP TT VAR)] }, ); sub print { my $self = shift; $self = $self->new unless ref $self; print $self->parse($_) for @_; } sub parse { my $self = shift; my ($text) = @_; my $tags = $self->allowed_tags; my ($before, $match) = ($text =~ m{ \A # start of string (.*?) # leading text ($before) ( # either... \<\!-- # the start of a comment \| # or... $tags\< # a tag ) }xsm) or do { my @return = split /\\|/, $text; $return[0] = encode_entities($return[0]); return @return; }; # strip $before from $text substr($text, 0, length($before)) = ''; # If the first thing that needed to be handled was a comment if ($match eq '<!--') { # Strip it out $text =~ s/\<\!--(.+?)--\>//g; # Handle the rest via recursion return join "", $before, $self->parse($text); } chop(my $found_tag = lc $match); substr($text, 0, length($found_tag)) = ''; my ($got, $remainder) = extract_bracketed($text, q/<"'>/); $got = substr($got, 1, length($got) - 2); my ($markup, @attrs) = $self->parse($got); my ($more_markup, @more_attrs) = $self->parse($remainder); $_ //= '' for $markup, $more_markup; join("", $before, (@attrs ? "<$found_tag @attrs>" : "<$found_tag>"), $markup, "</$found_tag>", $more_markup, ), @more_attrs; } } Parser->print(<<'TEXT'); Anyone who watches the Syfy channel knows that on Monday nights they aired three television series I<A<EurSUP<e>ka\|href="Movies_by_series.pl?series=EWA#EUReKA">\|class="t +itle">, I<A<Warehouse & 13\|href="Movies_by_series.pl?series=EWA#Warehouse_13"> +>, and I<A<Alphas\|href="Movies_by_series.pl?series=EWA#Alphas">>. Some might not be aware that these three series have formed a crossove +r cosmology which I call A<EWA\|href="Movies_by_series.pl?series=EWA"> <!-- This is a long string. --> TEXT [download] toby d�t ink	[reply] [d/l]
Re^2: RFC: Is there a better way to use Text::Balanced? by Lady_Aleena (Priest) on Nov 15, 2014 at 22:26 UTC
Hello tobyink. Thank you for taking the time to write this, however I don't understand it and would not be able to maintain it. You used objects which completely confuse me. Also, you used Perl 5.14 instead of 5.8.8, so it may not work on my web host. I hadn't heard of strictures until now. I think you encoded the entities prematurely (only encode the text within the tags). I'm not sure why autoclean is needed, or why you needed Types::Standard. Could this be rewritten in a purely functional way, or are objects required to make this work? I don't see a function called `inline` to return the parsed text (the text is printed later) or a way to create it. I'm sorry I can't make use of this, would cookies help? *No matter how hysterical I get, my problems are not time sensitive. So, relax, have a cookie, and a very nice day!* Lady Aleena	[reply] [d/l]
Re^3: RFC: Is there a better way to use Text::Balanced? by tobyink (Canon) on Nov 17, 2014 at 14:18 UTC
It's easy enough to change it to a single function... use strict; use warnings; use Text::Balanced qw( extract_bracketed ); use HTML::Entities qw( encode_entities ); my $default_allowed_tags = do { my @tags = qw( A ABBR ACRONYM B BIG CITE CODE DFN EM I KBD Q SAMP SMALL SPAN STRONG SUB SUP TT VAR ); qr/${\( join "\|", map quotemeta, @tags )}/; }; sub parse_markup { my ($text, $tags) = @_; $tags \|\|= $default_allowed_tags; my ($before, $match) = ($text =~ m{ \A # start of string (.*?) # leading text ($before) ( # either... \<\!-- # the start of a comment \| # or... $tags\< # a tag ) }xsm) or do { my @return = split /\\|/, $text; $return[0] = encode_entities($return[0]); return @return; }; # strip $before from $text substr($text, 0, length($before)) = ''; # If the first thing that needed to be handled was a comment if ($match eq '<!--') { # Strip it out $text =~ s/\<\!--(.+?)--\>//g; # Handle the rest via recursion return join "", $before, parse_markup($text, $tags); } chop(my $found_tag = lc $match); substr($text, 0, length($found_tag)) = ''; my ($got, $remainder) = extract_bracketed($text, q/<"'>/); $got = substr($got, 1, length($got) - 2); my ($markup, @attrs) = parse_markup($got, $tags); my ($more_markup, @more_attrs) = parse_markup($remainder, $tags); defined($_) or $_='' for $markup, $more_markup; join("", $before, (@attrs ? "<$found_tag @attrs>" : "<$found_tag>"), $markup, "</$found_tag>", $more_markup, ), @more_attrs; } print for parse_markup(<<'TEXT'); Anyone who watches the Syfy channel knows that on Monday nights they aired three television series I<A<EurSUP<e>ka\|href="Movies_by_series.pl?series=EWA#EUReKA">\|class="t +itle">, I<A<Warehouse & 13\|href="Movies_by_series.pl?series=EWA#Warehouse_13"> +>, and I<A<Alphas\|href="Movies_by_series.pl?series=EWA#Alphas">>. Some might not be aware that these three series have formed a crossove +r cosmology which I call A<EWA\|href="Movies_by_series.pl?series=EWA"> <!-- This is a long string. --> TEXT [download] But personally I prefer the OO version because it means I can have a `$parser` object, which I can pass around. Code that needs to process some markup gets given the markup, and given the parser, and simply throws the markup at the parser. This makes it easy to switch in a different parser if required (say, one that generated stricter XHTML, or one that processed Markdown). toby d�t ink c	[reply] [d/l] [select]
Re: RFC: Is there a better way to use Text::Balanced? by RonW (Parson) on Nov 13, 2014 at 18:09 UTC
This is very good. I am impressed. And your description of what your code does and how is very good. Your test string looks like it gives your code a thorough work out. There might be a very few corner cases to resolve, but I'd say this is ready to go live.	[reply]