Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: RFC: Is there a better way to use Text::Balanced?

by tobyink (Canon)
on Nov 15, 2014 at 21:45 UTC ( [id://1107322]=note: print w/replies, xml ) Need Help??


in reply to RFC: Is there a better way to use Text::Balanced?

OK, this took me a lot longer than I expected it to, and the algorithm ended up a little convoluted, but I think this better handles a few edge cases...

use v5.14; use strictures; package Parser { use Moo 1.006000; use Types::Standard qw( RegexpRef ArrayRef ); use Text::Balanced qw( extract_bracketed ); use HTML::Entities qw( encode_entities ); use namespace::autoclean; my $Allowance = RegexpRef->plus_coercions( ArrayRef, sub { qr/${\( join "|", map quotemeta, @$_ )}/ }, ); has allowed_tags => ( is => 'ro', isa => $Allowance, coerce => 1, builder => sub { [qw(A ABBR ACRONYM B BIG CITE CODE DFN EM I KBD Q SAMP SMALL SPAN STRONG SUB SUP TT VAR)] }, ); sub print { my $self = shift; $self = $self->new unless ref $self; print $self->parse($_) for @_; } sub parse { my $self = shift; my ($text) = @_; my $tags = $self->allowed_tags; my ($before, $match) = ($text =~ m{ \A # start of string (.*?) # leading text ($before) ( # either... \<\!-- # the start of a comment | # or... $tags\< # a tag ) }xsm) or do { my @return = split /\|/, $text; $return[0] = encode_entities($return[0]); return @return; }; # strip $before from $text substr($text, 0, length($before)) = ''; # If the first thing that needed to be handled was a comment if ($match eq '<!--') { # Strip it out $text =~ s/\<\!--(.+?)--\>//g; # Handle the rest via recursion return join "", $before, $self->parse($text); } chop(my $found_tag = lc $match); substr($text, 0, length($found_tag)) = ''; my ($got, $remainder) = extract_bracketed($text, q/<"'>/); $got = substr($got, 1, length($got) - 2); my ($markup, @attrs) = $self->parse($got); my ($more_markup, @more_attrs) = $self->parse($remainder); $_ //= '' for $markup, $more_markup; join("", $before, (@attrs ? "<$found_tag @attrs>" : "<$found_tag>"), $markup, "</$found_tag>", $more_markup, ), @more_attrs; } } Parser->print(<<'TEXT'); Anyone who watches the Syfy channel knows that on Monday nights they aired three television series I<A<EurSUP<e>ka|href="Movies_by_series.pl?series=EWA#EUReKA">|class="t +itle">, I<A<Warehouse & 13|href="Movies_by_series.pl?series=EWA#Warehouse_13"> +>, and I<A<Alphas|href="Movies_by_series.pl?series=EWA#Alphas">>. Some might not be aware that these three series have formed a crossove +r cosmology which I call A<EWA|href="Movies_by_series.pl?series=EWA"> <!-- This is a long string. --> TEXT

Replies are listed 'Best First'.
Re^2: RFC: Is there a better way to use Text::Balanced?
by Lady_Aleena (Priest) on Nov 15, 2014 at 22:26 UTC

    Hello tobyink. Thank you for taking the time to write this, however I don't understand it and would not be able to maintain it. You used objects which completely confuse me. Also, you used Perl 5.14 instead of 5.8.8, so it may not work on my web host. I hadn't heard of strictures until now. I think you encoded the entities prematurely (only encode the text within the tags). I'm not sure why autoclean is needed, or why you needed Types::Standard. Could this be rewritten in a purely functional way, or are objects required to make this work? I don't see a function called inline to return the parsed text (the text is printed later) or a way to create it.

    I'm sorry I can't make use of this, would cookies help?

    No matter how hysterical I get, my problems are not time sensitive. So, relax, have a cookie, and a very nice day!
    Lady Aleena

      It's easy enough to change it to a single function...

      use strict; use warnings; use Text::Balanced qw( extract_bracketed ); use HTML::Entities qw( encode_entities ); my $default_allowed_tags = do { my @tags = qw( A ABBR ACRONYM B BIG CITE CODE DFN EM I KBD Q SAMP SMALL SPAN STRONG SUB SUP TT VAR ); qr/${\( join "|", map quotemeta, @tags )}/; }; sub parse_markup { my ($text, $tags) = @_; $tags ||= $default_allowed_tags; my ($before, $match) = ($text =~ m{ \A # start of string (.*?) # leading text ($before) ( # either... \<\!-- # the start of a comment | # or... $tags\< # a tag ) }xsm) or do { my @return = split /\|/, $text; $return[0] = encode_entities($return[0]); return @return; }; # strip $before from $text substr($text, 0, length($before)) = ''; # If the first thing that needed to be handled was a comment if ($match eq '<!--') { # Strip it out $text =~ s/\<\!--(.+?)--\>//g; # Handle the rest via recursion return join "", $before, parse_markup($text, $tags); } chop(my $found_tag = lc $match); substr($text, 0, length($found_tag)) = ''; my ($got, $remainder) = extract_bracketed($text, q/<"'>/); $got = substr($got, 1, length($got) - 2); my ($markup, @attrs) = parse_markup($got, $tags); my ($more_markup, @more_attrs) = parse_markup($remainder, $tags); defined($_) or $_='' for $markup, $more_markup; join("", $before, (@attrs ? "<$found_tag @attrs>" : "<$found_tag>"), $markup, "</$found_tag>", $more_markup, ), @more_attrs; } print for parse_markup(<<'TEXT'); Anyone who watches the Syfy channel knows that on Monday nights they aired three television series I<A<EurSUP<e>ka|href="Movies_by_series.pl?series=EWA#EUReKA">|class="t +itle">, I<A<Warehouse & 13|href="Movies_by_series.pl?series=EWA#Warehouse_13"> +>, and I<A<Alphas|href="Movies_by_series.pl?series=EWA#Alphas">>. Some might not be aware that these three series have formed a crossove +r cosmology which I call A<EWA|href="Movies_by_series.pl?series=EWA"> <!-- This is a long string. --> TEXT

      But personally I prefer the OO version because it means I can have a $parser object, which I can pass around. Code that needs to process some markup gets given the markup, and given the parser, and simply throws the markup at the parser. This makes it easy to switch in a different parser if required (say, one that generated stricter XHTML, or one that processed Markdown).

      c

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1107322]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (3)
As of 2024-03-29 02:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found