Re^5: How to output the words that you want that came thru an html file?

I should have indicated in my previous post that the sample code is rather fragile, and very dependent on the way the website developer chooses to store his/her HTML.

As Ea says below, HTML::TokeParser is a much better choice for robust processing.

Having said that - to answer your questions: (I'm assuming that your lines 20,21 are these)

  m|^\s*<[^/>]+>(.+)</| and $_=$1; # Zap tags on both sides, if any
  # The line above looks for text enclosed in html tokens, and extract
+s the text. 
  # Eg: applying the regex to :  "<h2>Some text</h2>" places "Some tex
+t" into "$1", which is then copied into "$_"


  s|<[^>]+>||g;           # Zap single </onetag> tags
  # The line above handles left-over single tags:
  # Eg: it zaps "<sometag/>" from  "text1 <sometag/> text2"
  # Actually, it is rather crude, and does not care about tag terminat
+ion, or matching.
[download]

In order to format the text better, you need to collect it into a scalar. Instead of "print", collect it using:

  $collected_text .= $_;
[download]

Of course, you should declare $collected_text outside the loop.
Then, after the loop, you will need to parse and clean $collected_text, before printing it.

I hope life isn't a big joke, because I don't get it.
-SNL

Comment on Re^5: How to output the words that you want that came thru an html file? Select or Download Code