Pathologically Eclectic Rubbish Lister | |
PerlMonks |
Re: Creating an abstract (updated)by haukex (Archbishop) |
on Aug 09, 2021 at 22:36 UTC ( [id://11135754]=note: print w/replies, xml ) | Need Help?? |
Obligatory Link to Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks... Based on your function I'm presuming you want to preserve tags - if you didn't, then the task would be easily accomplished with something like HTML::Strip. You haven't provided any sample input, so I had to make some up, I hope it's representative - but note that it already demonstrates some flaws if I run it through your function: /(.*)<(.*)/ needs an /s flag, and the <p> and <i> tags are not closed properly. I could also easily break it completely with some of the tricks in the above link. Doing the task "right" is unfortunately not exactly trivial even with some of the nice HTML parsers. Here's my attempt, which I haven't fully put through its paces in terms of testing. It was a nice exercise because I actually haven't really used Mojo::DOM for DOM creation yet. Note how it counts characters of text only, not including the HTML tags.
Update: The above can also be extended to filter certain tags by adding this before the elsif ( $n->type eq 'tag' ), where %filter is a hash with the keys being names of tags to remove (or the condition can be reversed to keep only those tags):
In Section
Seekers of Perl Wisdom
|
|