I should have indicated in my previous post that the sample code is rather fragile, and very dependent on the way the website developer chooses to store his/her HTML.

As Ea says below, HTML::TokeParser is a much better choice for robust processing.

Having said that - to answer your questions: (I'm assuming that your lines 20,21 are these)

m|^\s*<[^/>]+>(.+)</| and $_=$1; # Zap tags on both sides, if any # The line above looks for text enclosed in html tokens, and extract +s the text. # Eg: applying the regex to : "<h2>Some text</h2>" places "Some tex +t" into "$1", which is then copied into "$_" s|<[^>]+>||g; # Zap single </onetag> tags # The line above handles left-over single tags: # Eg: it zaps "<sometag/>" from "text1 <sometag/> text2" # Actually, it is rather crude, and does not care about tag terminat +ion, or matching.
In order to format the text better, you need to collect it into a scalar. Instead of "print", collect it using:
$collected_text .= $_;
Of course, you should declare $collected_text outside the loop.
Then, after the loop, you will need to parse and clean $collected_text, before printing it.

             I hope life isn't a big joke, because I don't get it.
                   -SNL


In reply to Re^5: How to output the words that you want that came thru an html file? by NetWallah
in thread How to output the words that you want that came thru an html file? by stone_ice

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.