Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

html analysis tool via regex

by stabu (Scribe)
on Oct 13, 2005 at 08:12 UTC ( [id://499805]=perlquestion: print w/replies, xml ) Need Help??

stabu has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I'm using perl on win32, and every now and again I have to extract info from a html page. I use regex to precisely tell perl what I want pulled out. But this requires a close study of the html source, and much trial and error on the regex themselves. What is the tools that you guys use for analysing html at such depth. I started with Word (ha!), the notepad, then editplus, and now vim. All allow a good view of the html source but each have their difference from perl's regex, so the trial and error factor is still very high. Anybody have any suggestions?

Thanks in advance for answers.

Replies are listed 'Best First'.
Re: html analysis tool via regex
by davorg (Chancellor) on Oct 13, 2005 at 09:12 UTC
    I use regex to precisely tell perl what I want pulled out.

    You really don't want to do that. Regular expressions will potentially break on all but the simplest HTML pages. If you want to parse HTML then you should use HTML::Parser or one of its subclasses. Personally I usually use HTML::TreeBuilder.

    --
    <http://dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

Re: html analysis tool via regex
by marto (Cardinal) on Oct 13, 2005 at 08:24 UTC
    Hi stabu,

    I am not 100% sure what you are trying to achieve, however you may want to check out the WWW::Mechanize and HTML::TokeParser modules. They may suite your requirements of getting html pages and extrating information.

    Hope this helps.

    Martin
      Yes, I think that regexp's are for the simplest digs from html or xml only. Of course, you can write very sophisticated regexp, but this way is, imho, read only and more painfull.
      So I suggest some html parser, especially, if you _really_ cannot get better data sources than html.
Re: html analysis tool via regex
by GrandFather (Saint) on Oct 13, 2005 at 09:22 UTC

    The tool I use most for that sort of work is HTML::TreeBuilder. Which is not the question you asked, but is very likely the answer you seek. Using regexin is not, let me stress it, Is Not the way to extract information from HTML.

    One of the best tools youo have to preview the HTML itself is probably the browser that you are using. Most browsers allow you to view the HTML source - for FireFox use Ctrl-U. With IE use the menu entry View|Source.


    Perl is Huffman encoded by design.
Re: html analysis tool via regex
by saintmike (Vicar) on Oct 13, 2005 at 08:17 UTC
    Try XML::XSH for an interactive shell that lets you navigate through the nodes of XML and HTML documents.
Re: html analysis tool via regex
by jbrugger (Parson) on Oct 13, 2005 at 08:19 UTC
    I'm not sure what you try to get from your pages, text, links, images, so i suggest you do a search on html on cpan. html

    "We all agree on the necessity of compromise. We just can't agree on when it's necessary to compromise." - Larry Wall.
Re: html analysis tool via regex
by stabu (Scribe) on Oct 13, 2005 at 10:01 UTC
    Thanks so much for all your answers.
    Ok, I'll look into those modules. Although the learning curve might be too much for me at this late stage.
    Just as side comments, yes, regex snaps very easily, but the html is exceedingly horrible. It's a series of prettified database entries rendered into html by one of those automated tools. Speaking plainly it's as close to a pool of vomit as html can ever get.
    Thanks god, I only have to do it once in a while!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://499805]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2024-03-29 02:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found