shu has asked for the wisdom of the Perl Monks concerning the following question:

Hi all

Since everyone here is a perl expert and im a total newbie i would be very very grateful if someone could help me out with my doubts.

I am doing a project to develop a student professor system including databases etc. To start off I need lots of professor data from various websites of educational institutions( for populating my database) . To extract this data and get started I decided to use perl since its text extraction capabilities are known to one n all.

The problem is all these sites have a totally different HTML format and structure and differ in which the info of all profs is listed, and I cant seem to come up with a generic PERL code to extract this data and put it in text files on my local hard disk. Therefore I think ill need to use REGEX and PATTERN MATCHING to do the task but im not sure how to go about it. I wrote one code that works with http://www.ntu.edu.sg/sce/staffacad.asp but this is way to specific and doesnt work with any other staff sites.! I need to do the following:

  1. Visit the base site of any institute and extract professor information which includes NAME,EMAIL,DEGREE,RESEARCH INTERESTS AND PUBLICATIONS RELEASED
  2. For publications the listing either appears via a link on the profs homepages or as a chunk of data under the heading "PUBLICATIONS" etc. I think i can get the data if its via a link but i dunno hoe to extract that exact chunk in the middle of a page.
  3. All this info shud be extracted to external text files

I can manage if someone just helps me with snippets of code to gt started with the extraction...accurate extraction of information from any random site of a intitution which has profs listed etc.

For example some sites are:

Greatly appreciate any help in any direction...totally lost here..please feel free to ask if u have any doubts regarding my question!

shuchi

Edited by BazB: added formatting.

Replies are listed 'Best First'.
Re: Regex/Pattern Matching
by bart (Canon) on Jan 08, 2004 at 23:17 UTC
    It reminds me of a little project I have done for personal reasons, and that is parsing online TV schedules, and turning them into a single, simple format for all. I'll describe the basic skeleton here, leaving the deails for you to fill in.

    Basically, for each style of website with programme listings, I have a different class (package). I placed them under the "Channel" hierarchy, so the methods to parse the BBC schedule listings, for example, are in the package Channel::BBC. Each "style" of HTML pages has its own package.

    And each of these Channel::* packages, all have a few class methods: one is "parse" for extracting the schedule; another one "date", to be able to check the date on the page... You get the idea. The important part is that the API is identical across all these packages.

    Now, it's possible to do a generic call to parsing a page from the BBC, like this:

    my $class = 'BBC'; my @programme = "Channel::$class"->parse($html);
    As you can see: there's one statement to do the parsing, and it can be the exact same statement, class being a variable (parameter), irrespective of which style of page it is. Actually, the parameters (including the $class) are passed in a big Array Of Hashes, and one loop just fetches, checks, and parses all the different HTML pages, and eventually builds a new HTML file for each from the result.

    Do note that the above snippet works under strict, even tough actually it is a symbolic reference.

    Now, how can you parse a HTML file? The way I'd do it now, is using HTML::TokeParser::Simple. Just look out for a specific tag, e.g. "form", or "table", then maybe one more, etc... and then finally grab the data you need. Don't worry about other styles of pages, you just have to be able to process <em<this style of page.

    Do remember that the first parameter to the methods, like "parse", will be the package name, so don't forget to drop it.

      Hi Bart Thanx for your suggestions. Im very very new to perl so Im not even sure how to sue the modules properly leave alone create my own classes. Anyway ill see what I can do tho i dint quite understand the concept of the BBc class. Hmm as for grabbing the data, I did use HTML::TokeParser::Simple, LWP::UserAgent and HTML::Parser but once I get the HTML into an external file on my hardisk i don't exactly know how to search for the headings and only grab data under a particular heading. The data is like HEADING 1 <data> <data> PUBLICATION <data> <data> <data> RESEARCH <data> .. . . so on and I need to grab the data under PUBLICATION. Now the heading may differ from page to page but i was trying to match 'pub' but thr r other headings also tht mite have those words! It would be great if you could help me with a code snippet. Thnx a lot Shuchi
Re: Regex/Pattern Matching
by ysth (Canon) on Jan 08, 2004 at 18:38 UTC
    I don't think you'll be able to come up with a general way to do this. You'll have to repeat the work you did for www.ntu.edu.sg/sce/staffacad.asp for each site, creating a sub for each site that extracts and returns all the information.

    If your problem with extracting the publication text is stripping out the HTML from the text, look at HTML::Parser.

    Best of luck to you.

Re: Regex/Pattern Matching
by TomDLux (Vicar) on Jan 08, 2004 at 23:27 UTC

    Since every university uses its own data layout, you will need an object oriented approach with a separate subclass for each university.

    A cheaper solution is to provide a research assistant ( aka undergrad suffering beer withdrawal symptoms ) with a pencil and a pad of paper.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA