Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^2: Getting the text of the html document

by CountZero (Bishop)
on Jun 20, 2005 at 13:12 UTC ( [id://468322]=note: print w/replies, xml ) Need Help??


in reply to Re: Getting the text of the html document
in thread Getting the text of the html page

The only way to deal with HTML (or other mark-up languages) is to parse the HTML-code. A "simple" regex-solution is not guaranteed to work in all cases.

CountZero

"If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Replies are listed 'Best First'.
Re^3: Getting the text of the html document
by dyer85 (Acolyte) on Jul 19, 2005 at 09:26 UTC

    That's a good point. My little regexes there don't convert every single entity, but it strips EVERY tag, and converts the <'s, >'s, quotes, and ampersands. Not much else would be left behind, honestly.

    Regardless of that fact, bradcathey, seems to have a very nice solution which is much faster than regex anyway.

      What would your regex do with a tag like this:

      <img src="next.gif" alt="-->" />

      Honestly, it's best to use a real parser.

      --
      <http://www.dave.org.uk>

      "The first rule of Perl club is you do not talk about Perl club."
      -- Chip Salzenberg

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://468322]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-03-28 20:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found