Lady_Aleena has asked for the wisdom of the Perl Monks concerning the following question:
Hello. I have been writing a function to help me parse my lines. I had gotten help on this previously, however, I could not get it to work in Perl 5.8.8. I gave Text::Balanced a try and got it to work on my system. (I haven't uploaded the new code to my web host yet. I am trying to manage my expectations.) Before I get too excited about it, I ask for someone to look it over and let me know where I may encounter problems, if any. Have a cookie.
My system
A<a string> ABBR<a string> ACRONYM<a string> B<a string> BIG<a string> CITE<a string> CODE<a string> DFN<a string> EM<a string> I<a string> KBD<a string> SAMP<a string> SMALL<a string> SPAN<a string> STRONG<a string> SUB<a string> SUP<a string> TT<a string> VAR<a string>
Any of those could have attributes, so I denote them like...
A<link|href="url"> I<italic text|style="color:blue">
And they can be nested like...
I<A<link|href="url">|style="color:blue">
My code
package Base::HTML::Inline; use strict; use warnings; use Exporter qw(import); our @EXPORT_OK = qw(inline); use Text::Balanced qw(extract_bracketed extract_multiple); use Util::FancySplice; my $allowed = join('|', qw(A ABBR ACRONYM B BIG CITE CODE DFN EM I KBD + Q SAMP SMALL SPAN STRONG SUB SUP TT VAR)); sub inline { my ($text) = @_; $text =~ s/\s\<!.+$//; my $brackets = fancy_splice(2,extract_multiple( $text, [ sub { extra +ct_bracketed($_[0],'<>', qr(.*?(?=<)) ) } ] )); my $end = undef; if (scalar @{$brackets->[-1]} == 1) { $end = shift $brackets->[-1]; pop @{$brackets}; } my $line; for my $bracket (@$brackets) { (my $start = $bracket->[0]) =~ s/^(.+|)\b(.+)$/$1/; (my $tag = $bracket->[0]) =~ s/^(.+|)\b(.+)$/$2/; if ($tag && $tag =~ /^(.+|)($allowed)$/) { $start .= $1; $tag = lc $2; } else { $tag = lc $tag; } (my $tagged = $bracket->[1]) =~ s/^\<(.+)\>$/$1/; my @in_parts = split(/\|/,$tagged); my $attribute = scalar @in_parts > 1 && $in_parts[-1] !~ /\>/ ? ' + '.pop @in_parts : ''; my $in_tag = join('|',@in_parts); $in_tag = $in_tag =~ /\</ ? inline($in_tag) : $in_tag; $line .= "$start<$tag$attribute>$in_tag</$tag>"; } $line .= $end if $end; return $line; } 1;
Here is the string I ran through my new inline function for the examples. I know it is long, but I will be parsing longer strings through inline.
my $text = 'Anyone who watches the Syfy channel knows that on Monday n +ights they aired three television series I<A<EurSUP<e>ka|href="Movies +_by_series.pl?series=EWA#EUReKA">|class="title">, I<A<Warehouse 13|hr +ef="Movies_by_series.pl?series=EWA#Warehouse_13">>, and I<A<Alphas|hr +ef="Movies_by_series.pl?series=EWA#Alphas">>. Some might not be aware + that these three series have formed a crossover cosmology which I ca +ll A<EWA|href="Movies_by_series.pl?series=EWA"> <!-- This is a long s +tring. -->';
The first thing I wanted to do was make sure any inline HTML comments I put in just for me are stripped. So I remove those first with s/\s\<!.+$// on line 15. Now the party can really start by extracting the brackets on line 17.
Since my lines can have more than one set of brackets in them, I have to use extract_multiple in conjunction with extract_bracketed. So, using those together, I get the following array.
$VAR1 = [ 'Anyone who watches the Syfy channel knows that on Monday ni +ghts they aired three television series I', '<A<EurSUP<e>ka|href="Movies_by_series.pl?series=EWA#EUReKA" +>>', ', I', '<A<Warehouse 13|href="Movies_by_series.pl?series=EWA#Wareho +use_13">>', ', and I', '<A<Alphas|href="Movies_by_series.pl?series=EWA#Alphas">>', '. Some might not be aware that these three series have form +ed a crossover cosmology which I call A', '<EWA|href="Movies_by_series.pl?series=EWA">', '.' ];
I was looking at it, and in moments I figured out how to group the results. It appears every other value is what is the bracket found with the value before it being what comes before the bracket. The last value is what appears after the last bracket, if anything. To make it a little easier for me to munge the lines, I need to group them. So I wrote a little function called fancy_splice to group the lines together.
fancy_splice
sub fancy_splice { my ($amount, @in_list) = @_; my $list; while (@in_list) { push @$list,[splice(@in_list,0,$amount)]; } return $list; }
So, after putting the array returned by extract_multiple through fancy_splice, I get...
$VAR1 = [ [ 'Anyone who watches the Syfy channel knows that on Monday +nights they aired three television series I', '<A<EurSUP<e>ka|href="Movies_by_series.pl?series=EWA#EUReK +A">|class="title">' ], [ ', I', '<A<Warehouse 13|href="Movies_by_series.pl?series=EWA#Ware +house_13">>' ], [ ', and I', '<A<Alphas|href="Movies_by_series.pl?series=EWA#Alphas">>' ], [ '. Some might not be aware that these three series have fo +rmed a crossover cosmology which I call A', '<EWA|href="Movies_by_series.pl?series=EWA">' ], [ '.' ] ];
Now the party is in full swing with my lines with partners with the exception of the last guy standing alone at the end in this case. Because he is such a party pooper, I am going to deal with him first in lines 17 to 23 by assigning his value to $end. I pop off his group from the rest and send him to the bar.
Now I turn on the music and get the pairs dancing on line 25. The first partner I deal with is the lead, $bracket->[0]. On his tails is the tag which goes with the contents of the bracket (the last word of the value). So, I assign everything else in the lead to $start and put the last word into $tag.
Now, we will have to skip two loops ahead here to explain why $tag also needs munging. If you look at the first group's second value, you will see EurSUP. Well EurSUP is not an inline HTML tag, and I just want the SUP. So, lines 30 through 36 deal with getting the tags I want and rejoining the part, which is not the tag, with $start.
Now I am done dealing with the lead, it is time to deal with his partner which takes a bit more work. The first things I need to do are to name her $tagged and take her brackets off (line 38). I next look to see if the tag has any attributes, so I split $tagged by the pipes, in any. I then check the last value of the array has a > in it, and if not, pop it off the array and assign it to the $attribute. (If it did have a > in it, then it needs to be kept with the $in_tag because it may be the attribute of the next level tag or even the level after next.
When the dance is done for on set, I put the pieces together on line 43 and concatenate that piece to $line. When all the sets are done dancing, I bring $end back from the bar, and concatenate him too. All the dancing produces...
Anyone who watches the Syfy channel knows that on Monday nights they a +ired three television series <i class="title"><a href="Movies_by_seri +es.pl?series=EWA#EUReKA">Eur<sup>e</sup>ka</a></i>, <i><a href="Movie +s_by_series.pl?series=EWA#Warehouse_13">Warehouse 13</a></i>, and <i> +<a href="Movies_by_series.pl?series=EWA#Alphas">Alphas</a></i>. Some +might not be aware that these three series have formed a crossover co +smology which I call <a href="Movies_by_series.pl?series=EWA">EWA</a>
Thank you for taking the time to read this, have another cookie. Please let me know if you see a problem or a place where I can tighten things up.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: RFC: Is there a better way to use Text::Balanced?
by Loops (Curate) on Nov 13, 2014 at 15:35 UTC | |
|
Re: RFC: Is there a better way to use Text::Balanced?
by tobyink (Canon) on Nov 15, 2014 at 21:45 UTC | |
by Lady_Aleena (Priest) on Nov 15, 2014 at 22:26 UTC | |
by tobyink (Canon) on Nov 17, 2014 at 14:18 UTC | |
|
Re: RFC: Is there a better way to use Text::Balanced?
by RonW (Parson) on Nov 13, 2014 at 18:09 UTC |