Probably won't win any prizes as it doesn't use any modules and uses regexes for parsing HTML, but it seems to deal with most things I've thrown at it, including both types of quoted attributes even when they contain embedded '>' chars.

#! perl -slw use vars qw[$required]; use strict; sub abstract (\$$) { my ($data, $req) = @_; my ($s, $p) = (0, 0); while ($p < $req) { $$data =~ /[^<]{1,${\($req - $p)}}/gc; last if ($p += pos($$data) - $s) >= $req; my ($q) = $$data =~ /\G[^"'>]+(.)/gc; #!" $$data =~ /\G[^$q]+/gc if $q =~ /["']/; #!" $$data =~ /\G[^>]+/gc; $s = ++pos($$data); } $$data =~ /\G[\w]+/gc; return substr( $$data, 0, pos($$data) ); } my $data = join '', <DATA>; $data =~ tr/\n/ /d; print abstract( $data, $required||40 ); __DATA__ A story on GettingIt about <a href="http://ss.gettingit.com/cgi-bin/gx.cgi/AppLogic+FTContentServ +er?GXHC_gx_session_id_FutureTenseContentServer=7f12a816fa48a5b9&pagen +ame=FutureTense/Demos/GI/Templates/Article_View&parm1=A1545-1999Oct12 +&topframe=true">hacking polls</a>. Contrary to what the article says, Time is <b>not</b> checking for mul +tiple votes on <a href="javascript:document.timedigital.submit();">their poll</a>. And I'm happy to report that despite the fact that my cheater scripts aren't running, I'm still beating Bill Gates. Some <b>more</b> test data. <table name='a probably invalid> name' width='%80'> <TR align=right><TH>Things</TH><TH>Values</TH></TR> <TR align=center><TD>A thing</TD><TD>A value</TD></TR> </table>

Some sample output

c:\test>226146.pl -required=225 A story on GettingIt about<a href="http://ss.gettingit.com/cgi-bin/gx. +cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTenseContentSer +ver=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templates/Article_ +View&parm1=A1545-1999Oct12&topframe=true">hacking polls</a>.Contrary +to what the article says, Time is <b>not</b> checking for multiple vo +tes on<a href="javascript:document.timedigital.submit();">theirpoll</ +a>. And I'm happy to report that despite the fact thatmy cheater scri +pts aren't running, I'm still beating c:\test>226146.pl -required=30 A story on GettingIt about<a href="http://ss.gettingit.com/cgi-bin/gx. +cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTenseContentSer +ver=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templates/Article_ +View&parm1=A1545-1999Oct12&topframe=true">hacking c:\test>226146.pl -required=30 A story on GettingIt about <a href="http://ss.gettingit.com/cgi-bin/gx +.cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTenseContentSe +rver=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templates/Article +_View&parm1=A1545-1999Oct12&topframe=true">hacking c:\test>226146.pl -required=225 A story on GettingIt about <a href="http://ss.gettingit.com/cgi-bin/gx +.cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTenseContentSe +rver=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templates/Article +_View&parm1=A1545-1999Oct12&topframe=true">hacking polls</a>. Contrar +y to what the article says, Time is <b>not</b> checking for multiple +votes on <a href="javascript:document.timedigital.submit();">their po +ll</a>. And I'm happy to report that despite the fact that my cheater + scripts aren't running, I'm still c:\test>226146.pl -required=300 A story on GettingIt about <a href="http://ss.gettingit.com/cgi-bin/gx +.cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTenseContentSe +rver=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templates/Article +_View&parm1=A1545-1999Oct12&topframe=true">hacking polls</a>. Contrar +y to what the article says, Time is <b>not</b> checking for multiple +votes on <a href="javascript:document.timedigital.submit();">their po +ll</a>. And I'm happy to report that despite the fact that my cheater + scripts aren't running, I'm still beating Bill Gates. Some <b>more</ +b> test data. <table name='a <probably invalid> name' width='%80'> <T +R align=right><TH>Things</TH><TH c:\test>226146.pl -required=30 A story on GettingIt about <a href="http://ss.gettingit.com/cgi-bin/gx +.cgi/AppLogic+FTContentServer?GXHC_gx_session_id_FutureTenseContentSe +rver=7f12a816fa48a5b9&pagename=FutureTense/Demos/GI/Templates/Article +_View&parm1=A1545-1999Oct12&topframe=true">hacking

Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.


In reply to Re: Extracting a substring of N chars ignoring embedded HTML by BrowserUk
in thread Extracting a substring of N chars ignoring embedded HTML by FamousLongAgo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.