xmlschema has asked for the wisdom of the Perl Monks concerning the following question:

Not sure where to ask this but... I need a script that will parse a specific page and output certain elements (article title, blurb, etc), format them, maybe dump into a small DB, and then post to my Web site. Can anyone help? I can be reached via MSN Messenger (fadeq at hotmail dot com) if anyone is interested.

Replies are listed 'Best First'.
Re: Need Help With Parser
by Joost (Canon) on Aug 01, 2005 at 20:40 UTC
    As noted above, we can't really help you unless you specify what kind of page you're talking about. (HTML, PDF, scanned / OCR processed, from any specific site/publication or in general etc). It would also help if you could explain what parts of the problem you need help with - if you have any code, but you're stuck, posting that code could help.

    Please also note that the chances of people helping you will increase significantly if you can be bothered to read the replies here instead of asking people to contact you in private. Most regular monks regard the site as a public information forum, and replying in private won't help others who might have a similar question.

    Please read how (not) to ask a question, it will probably help you get worthwhile answers, and save everybody some time in the future.

Re: Need Help With Parser
by gellyfish (Monsignor) on Aug 01, 2005 at 20:30 UTC

    You will need to look at HTML::Parser to start with, but as you haven't given any indication of what you have tried or the nature of the 'page' you are trying to parse it's difficult to be more specific.

    /J\

Re: Need Help With Parser
by BaldPenguin (Friar) on Aug 01, 2005 at 20:51 UTC
    Not to repeat too much of the above, there are many answers for this question. One approach that will work if the page you are pulling from is reliabley valid HTML is to pull the page down and use XML::Twig to parse the page or specific nodes within the using twig_roots. I just completed some work like this. The only snag will be if the page you pull is not valid HTML, the parser could choke, I solved a majority of these errors by running through Tidy.

    Don
    WHITEPAGES.COM | INC
    Everything I've learned in life can be summed up in a small perl script!
Re: Need Help With Parser
by xmlschema (Initiate) on Aug 01, 2005 at 21:36 UTC
    It's an HTML page. Sorry I'm new to the site. I am not a programmer so maybe I can offer a few dollars to anyone interested?

      Most of the people here will fall all over themselves in an attempt to help you, but not because you offer to pay them. What really incites cooperation on PerlMonks is a detailed problem description complete with sample input; expected output; and examples of code you have already tried - why those failed, or how you wish they could be made to perform better.

      In your case:

      1. What is the specific page you want to parse? Can you at least provide a mock-up with fake data?

      2. When you say you want "article title", are you looking for what's between two HTML tags? Is there some other way of designating what constitutes a "blurb"?

      3. What DB are you using? What format do you want the output to be in?

      4. What have you tried so far? Which packages/modules are you looking at? If you can't show actual code, can you provide pseudo-code? (PerlMonks are "pathologically helpful," but they do like to see that you've made an effort to at least begin solving your problem by yourself.)

      5. If you have a good start, but it doesn't do quite what you'd like, can you tell us how you'd like to see it improved?

      HTH,

      planetscape