How do I extract text from an HTML page?

jamiel has asked for the wisdom of the Perl Monks concerning the following question:

imagine that I have a document named finalgrades.html located at http://somedomain.com/finalgrades.html and it contains the following text:

<html>
<head>
<title></title>
</head>
<body>
<b>Guy 1</b>, grade: 100<br>
<b>Guy 2</b>, grade: 70<br>
<b>Guy 3</b>, grade: 98<br>
</body>
</html>
[download]

I want a perl script ( to make a CGI ) which takes the values after "grade:" and before "<b>" for the line containing a name that I coose (for example "Guy 1"), and prints it to the screen when running it as a CGI on my webpage, something like this:

<html>
<head>
<title>CONGRATULATIONS TO THE PARTICIPANT Guy 1</title>
</head>
<body>
<h1>Guy 1</h1> Has just got a "100" as a 
grade for this Seminar. Congratulations!!!<br>
</body>
</html>
[download]

edited: Sun Aug 3 15:19:23 2003 by jeffa - formatting, linkafied link edited: Mon Mar 13 10:25:23 2006 by jamiel - formatting, spelling, etc...

Comment on How do I extract text from an HTML page? Select or Download Code

Replies are listed 'Best First'.
Re: How do I extract text from an HTML page? by bobn (Chaplain) on Aug 03, 2003 at 05:22 UTC
Use the CGI.pm module to generate your HML There are many modules to parse HTML. HTML::TokeParser::Simple looks promising but there are others. you can check these and others at http://search.cpan.org Update: fixed links. --Bob Niederman, http://bob-n.com	[reply]
Re: How do I extract text from an HTML page? by CountZero (Bishop) on Aug 03, 2003 at 16:06 UTC
Well whatever you do, the only way not to go is to regex the HTML-code yourself. This will only work for the most simple and regular of HTML-code and will break before you know it. Another approach is to go to the source of your data in the first web-page. Assuming that this is based upon some database, can't you go directly to that database and query the data from there? CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: How do I extract text from an HTML page? by ido50 (Scribe) on Aug 03, 2003 at 17:53 UTC
Assuming that the format for every grade is '`<b>NAME</b>, grade: GRADE<br>`' and that there are no lines that begin incidentally with the same format (Well actually you don't have if you use the regex I wrote. At least I think so...), you can use this little snippet that uses regex to create this HTML page (You can easily alter it as you wish to just print it and not create an HTML file): open (FILE, "+<$path/finalgrades.html") or die "Can't open file: $!"; +# where $path is the full path to the directory where the file reside +s. while (<FILE>) { if (m/^<b>[^<>]</b>, grade: (\d+)<br>$/) { my $name = $1; my $grade = $2; open (HTML, "+>>$path/$name.html") or die "$!"; print HTML "<html><head><title>CONGRATULATIONS TO THE PARTICIPANT +$name</title></head><body><h1>$name</h1> has just reviece "$grade" as + a grade for this seminar. Congratulations!!!<br></body></html> } } close (FILE) or die "$!"; [download] Note that you should want to use CGI.pm to print the HTML taks instead of printing them directly. Go to http://search.cpan.org/author/LDS/CGI.pm-2.93/CGI.pm for more info on the CGI module. -------------------------- Live fat, die young	[reply] [d/l] [select]
Re: Re: How do I extract text from an HTML page? by ido50 (Scribe) on Aug 03, 2003 at 17:57 UTC
A few corrections: 1. Replace "(Well actually you don't have if you..." with "(Well actually you don't have to worry about it if you...". 2. After the print HTML "bla bla" statement I forgot to include a ";", and you should add after it a "close HTML" statement too. ---------------------- Live fat, die young	[reply]
Re: Re: Re: How do I extract text from an HTML page? by ido50 (Scribe) on Aug 03, 2003 at 18:00 UTC
Last two corrections (Not my day today): 1. I also forgot a terminating double quote in the print HTML "bla bla" statement. 2. Replace the "`[^<>]`" in the regex with "`([^<>])`". ---------------------- Live fat, die young	[reply] [d/l] [select]
4Re: How do I extract text from an HTML page? by jeffa (Bishop) on Aug 03, 2003 at 19:02 UTC

Back to Seekers of Perl Wisdom