Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Creating an abstract

by Bod (Parson)
on Aug 09, 2021 at 20:14 UTC ( [id://11135746]=perlquestion: print w/replies, xml ) Need Help??

Bod has asked for the wisdom of the Perl Monks concerning the following question:

In a few places I have code that takes a significant piece of text, such as a knowledgebase article, and splits a small section from the beginning as an abstract. The code I generally use came from an answer to Splitting long text for Template

But now, for a different application, I want to create an abstract of text which may contain HTML tags. The problem comes with not wanting to split an HTML tag in two. I want either all of it or none of it. So this is the code I am using...

sub abstract { my $text = shift; if (length $text > 200 and $text =~ /^(.{0,200}\b)(.*)$/s) { $text = "$1..."; } # Check we have not split an HTML tag my $lt = $text =~ tr/<//; my $gt = $text =~ tr/>//; if ($lt != $gt) { my ($keep, $strip) = $text =~ /(.*)<(.*)/; $text = "$keep..."; } return $text; }
It does exactly what I want.

However, I cannot help thinking that the code could be more succinct...
Can you suggest a better way to do it?

Replies are listed 'Best First'.
Re: Creating an abstract (updated)
by haukex (Archbishop) on Aug 09, 2021 at 22:36 UTC

    Obligatory Link to Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks...

    Based on your function I'm presuming you want to preserve tags - if you didn't, then the task would be easily accomplished with something like HTML::Strip.

    You haven't provided any sample input, so I had to make some up, I hope it's representative - but note that it already demonstrates some flaws if I run it through your function: /(.*)<(.*)/ needs an /s flag, and the <p> and <i> tags are not closed properly. I could also easily break it completely with some of the tricks in the above link.

    Doing the task "right" is unfortunately not exactly trivial even with some of the nice HTML parsers. Here's my attempt, which I haven't fully put through its paces in terms of testing. It was a nice exercise because I actually haven't really used Mojo::DOM for DOM creation yet. Note how it counts characters of text only, not including the HTML tags.

    use warnings; use strict; print html_abstract(<<'END_HTML', 200), "\n"; <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed tristique purus urna, a lacinia nulla euismod et. Pellentesque tempus et justo faucibus. <i>Fusce scelerisque, <b>magna</b> <a href="http://www.example.com">efficitur congue, leo nibh</a> volutpat nibh, ac mattis dolor ipsum sit amet quam.</i> Suspendisse eleifend id ligula quis placerat. Pellentesque fermentum eu magna sed mollis. Quisque placerat efficitur blandit. Vestibulum non.</p> END_HTML use Mojo::DOM; sub html_abstract { my ($html, $remain) = @_; my $walk; $walk = sub { my ($in, $out) = @_; for my $n ( @{ $in->child_nodes } ) { last unless $remain; if ( $n->type eq 'cdata' || $n->type eq 'text' ) { my $txt = $n->content; if ( length $txt < $remain ) { $out->append_content($txt); $remain -= length $txt; } else { $txt =~ /^(.{0,$remain}\b)/s; $out->append_content("$1..."); $remain = 0; } } elsif ( $n->type eq 'tag' ) { my $t = $out->new_tag( $n->tag, %{ $n->attr } ) # new_tag gives us a "root", but we want the tag ->child_nodes->first; $walk->($n, $t); $out->append_content($t); } # ignore other node types for now } return $out; }; return $walk->(Mojo::DOM->new($html), Mojo::DOM->new)->to_string; } __END__ <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed tristique purus urna, a lacinia nulla euismod et. Pellentesque tempus et justo faucibus. <i>Fusce scelerisque, <b>magna</b> <a href="http:// +www.example.com">efficitur congue, leo ...</a></i></p>

    Update: The above can also be extended to filter certain tags by adding this before the elsif ( $n->type eq 'tag' ), where %filter is a hash with the keys being names of tags to remove (or the condition can be reversed to keep only those tags):

    elsif ( $n->type eq 'tag' && $filter{$n->tag} ) { $walk->($n, $out) }
Re: Creating an abstract
by tybalt89 (Monsignor) on Aug 09, 2021 at 21:38 UTC

    succinct - sort of, better - ???

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11135746 use warnings; use Data::Dump 'dd'; local $_ = 'text <one 1> three <two 2> four <five 5> K <six 6> lastwor +d'; for my $len ( 30 .. length ) { my $text = s/^.{0,$len}\b\K.*//sr =~ s/<[^<>]*\z//sr . '...'; dd $text; }

    Outputs:

    "text <one 1> three <two 2> ..." "text <one 1> three <two 2> four..." "text <one 1> three <two 2> four..." "text <one 1> three <two 2> four ..." "text <one 1> three <two 2> four ..." "text <one 1> three <two 2> four ..." "text <one 1> three <two 2> four ..." "text <one 1> three <two 2> four ..." "text <one 1> three <two 2> four ..." "text <one 1> three <two 2> four ..." "text <one 1> three <two 2> four ..." "text <one 1> three <two 2> four <five 5> ..." "text <one 1> three <two 2> four <five 5> K..." "text <one 1> three <two 2> four <five 5> K..." "text <one 1> three <two 2> four <five 5> K ..." "text <one 1> three <two 2> four <five 5> K ..." "text <one 1> three <two 2> four <five 5> K ..." "text <one 1> three <two 2> four <five 5> K ..." "text <one 1> three <two 2> four <five 5> K ..." "text <one 1> three <two 2> four <five 5> K ..." "text <one 1> three <two 2> four <five 5> K ..." "text <one 1> three <two 2> four <five 5> K <six 6> ..." "text <one 1> three <two 2> four <five 5> K <six 6> ..." "text <one 1> three <two 2> four <five 5> K <six 6> ..." "text <one 1> three <two 2> four <five 5> K <six 6> ..." "text <one 1> three <two 2> four <five 5> K <six 6> ..." "text <one 1> three <two 2> four <five 5> K <six 6> ..." "text <one 1> three <two 2> four <five 5> K <six 6> ..." "text <one 1> three <two 2> four <five 5> K <six 6> ..." "text <one 1> three <two 2> four <five 5> K <six 6> lastword..."

      It's certainly more succinct and it behaves slightly differently. Your version includes the tag if it is to be split whereas mine removes it - this makes no difference to my application. It's taken me a little while to work out how the regexp works and I very much doubt I could write it for myself tomorrow...so probably not easily maintainable by me.

      But certainly succinct and rather impressive :)

        Nope, mine removes the last tag if it's not closed.

Re: Creating an abstract
by Anonymous Monk on Aug 10, 2021 at 10:14 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11135746]
Approved by haukex
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2024-04-19 06:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found