in reply to Extracting a substring of N chars ignoring embedded HTML

Probably won't win any prizes as it doesn't use any modules and uses regexes for parsing HTML, but it seems to deal with most things I've thrown at it, including both types of quoted attributes even when they contain embedded '>' chars.

#! perl -slw use vars qw[$required]; use strict; sub abstract (\$$) { my ($data, $req) = @_; my ($s, $p) = (0, 0); while ($p < $req) { $$data =~ /[^<]{1,${\($req - $p)}}/gc; last if ($p += pos($$data) - $s) >= $req; my ($q) = $$data =~ /\G[^"'>]+(.)/gc; #!" $$data =~ /\G[^$q]+/gc if $q =~ /["']/; #!" $$data =~ /\G[^>]+/gc; $s = ++pos($$data); } $$data =~ /\G[\w]+/gc; return substr( $$data, 0, pos($$data) ); } my $data = join '', <DATA>; $data =~ tr/\n/ /d; print abstract( $data, $required||40 ); __DATA__ A story on GettingIt about <a href="http://ss.gettingit.com/cgi-bin/gx.cgi/AppLogic+FTContentServ +er?GXHC_gx_session_id_FutureTenseContentServer=7f12a816fa48a5b9&pagen +ame=FutureTense/Demos/GI/Templates/Article_View&parm1=A1545-1999Oct12 +&topframe=true">hacking polls</a>. Contrary to what the article says, Time is <b>not</b> checking for mul +tiple votes on <a href="javascript:document.timedigital.submit();">their poll</a>. And I'm happy to report that despite the fact that my cheater scripts aren't running, I'm still beating Bill Gates. Some <b>more</b> test data. <table name='a probably invalid> name' width='%80'> <TR align=right><TH>Things</TH><TH>Values</TH></TR> <TR align=center><TD>A thing</TD><TD>A value</TD></TR> </table>

Some sample output

c:\test>226146.pl -required=225 A story on GettingIt about<a href="http://ss.gettingit.com/cgi-bin/gx. +cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTenseContentSer +ver=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templates/Article_ +View&parm1=A1545-1999Oct12&topframe=true">hacking polls</a>.Contrary +to what the article says, Time is <b>not</b> checking for multiple vo +tes on<a href="javascript:document.timedigital.submit();">theirpoll</ +a>. And I'm happy to report that despite the fact thatmy cheater scri +pts aren't running, I'm still beating c:\test>226146.pl -required=30 A story on GettingIt about<a href="http://ss.gettingit.com/cgi-bin/gx. +cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTenseContentSer +ver=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templates/Article_ +View&parm1=A1545-1999Oct12&topframe=true">hacking c:\test>226146.pl -required=30 A story on GettingIt about <a href="http://ss.gettingit.com/cgi-bin/gx +.cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTenseContentSe +rver=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templates/Article +_View&parm1=A1545-1999Oct12&topframe=true">hacking c:\test>226146.pl -required=225 A story on GettingIt about <a href="http://ss.gettingit.com/cgi-bin/gx +.cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTenseContentSe +rver=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templates/Article +_View&parm1=A1545-1999Oct12&topframe=true">hacking polls</a>. Contrar +y to what the article says, Time is <b>not</b> checking for multiple +votes on <a href="javascript:document.timedigital.submit();">their po +ll</a>. And I'm happy to report that despite the fact that my cheater + scripts aren't running, I'm still c:\test>226146.pl -required=300 A story on GettingIt about <a href="http://ss.gettingit.com/cgi-bin/gx +.cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTenseContentSe +rver=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templates/Article +_View&parm1=A1545-1999Oct12&topframe=true">hacking polls</a>. Contrar +y to what the article says, Time is <b>not</b> checking for multiple +votes on <a href="javascript:document.timedigital.submit();">their po +ll</a>. And I'm happy to report that despite the fact that my cheater + scripts aren't running, I'm still beating Bill Gates. Some <b>more</ +b> test data. <table name='a <probably invalid> name' width='%80'> <T +R align=right><TH>Things</TH><TH c:\test>226146.pl -required=30 A story on GettingIt about <a href="http://ss.gettingit.com/cgi-bin/gx +.cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTenseContentSe +rver=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templates/Article +_View&parm1=A1545-1999Oct12&topframe=true">hacking

Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Replies are listed 'Best First'.
Re^2: Extracting a substring of N chars ignoring embedded HTML
by Aristotle (Chancellor) on Jan 12, 2003 at 15:03 UTC
    It leaves cut off tags at the end of the output. I wouldn't want to print that verbatim. The parser solutions have no such problem.

    Makeshifts last the longest.