kiat has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I don't seem to have any luck with parsing an html file to get the names of schools. I'm seeking your help to shed some light on how to make the regular expression work. The school names are capitalised and are of the form:

X PRIMARY SCHOOL

where X stands for any number of words before the word PRIMARY.

I've pasted a portion of the file in my scatchpad. I'll be very interested to see how it can be done. I just hope it's not something too easy :)

Update I realised that text pasted in the scatchpad is being parsed and formatted. Hm...not the original file. Let me think of a way.

Update2 I did a view source of the html file from site and that's what I got. I plunged right into trying to do some regex on it. I found out later that by doing a Copy from the webpage and then pasting the copied text in notepad, I got a rather nicely formatted output. I did a regex on that and got the school names I wanted (nearly there). Still, I would like to know how it can be done from the raw html file...

<html><head><title>School Directory Services</title></head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <META HTTP-EQUIV="pragma" CONTENT="no-cache"> <META HTTP-EQUIV="Refresh" CONTENT=1200> <body bgcolor="white"> <p> <font face="Arial" Size=4> <b>DIRECTORY OF SCHOOLS 2003 </b></font><br> <font face="Arial" Size=3>(Government/Government-Aided/Independent Sch +ools)</font> <br> <br> <br> <!--------------------------- PRIMARY (G) SCHOOL --------------------- +---------> <!--------------------------- COMMON CODE --------------------------- +-------------> <font face="Arial" Size=3><b>PRIMARY SCHOOLS</b></font><br><br> <font face="Arial" Size=3>Government</font><br> <font face="Arial" Size=2><i></i></font> <br><br> <TABLE cellspacing=0 border=0 cellpadding=2 width="100%"> <TR BgColor=#cccccc> <TD Width="7%"><FONT Size=3><i>SCH<br>CODE,<br>ZONE</i></FONT></TD> <TD Width="32%"><FONT Size=3><i>SCHOOL NAME &<br>ADDRESS</i></FONT></ +TD> <TD Width="45%"><FONT Size=3><i>PRINCIPAL &<br>VICE-PRINCIPAL</i></FO +NT></TD> <TD Width="20%"><FONT Size=3><i>TELEPHONE,<br>FAX NUMBERS &<br>EMAIL +ADDRESS</i></FONT></TD> </TR> <TR><td colspan=4><hr></td></TR> <tr BgColor=white><td valign='top'><Font Size=2>1744<br>North</font></ +td><td valign='top'><Font Size=2>ADMIRALTY PRIMARY SCHOOL + <br> +›小学 + <br>11 WOODLANDS CIRCLE <br>SINGAPORE 7398 +07</font></td><td valign='top'><Font Size=2>P : MDM LIM SOH LIAN + <br>&nbsp;&nbsp;&nbsp;&n +bsp;林素莲 + <br>VP: MR TAN MENG HUI + <br>&nbsp;&nbsp;&nbsp;&nbsp;陈明辉 + </font></t +d><td valign='top'><Font Size=2>Tel: 63620598 <br>Fax: 636 +27512 <br><a style ='text-decoration:none' href='mailto: ADMIRA +LTY_PS@MOE.EDU.SG '>ADMIRALTY_PS@MOE.EDU.SG </a></a></fon +t></td></tr><tr><td colspan=4><hr></td></tr><tr BgColor=#A5C9F9><td v +align='top'><Font Size=2>1738<br>North</font></td><td valign='top'><F +ont Size=2>AHMAD IBRAHIM PRIMARY SCHOOL + <br>依布‹欣小学 + <br>10 YIS +HUN STREET 11 <br>SINGAPORE 768643</font></td><td vali +gn='top'><Font Size=2>P : MISS FOONG YIN WEI + <br>&nbsp;&nbsp;&nbsp;&nbsp;冯燕慧 + </font> +</td><td valign='top'><Font Size=2>Tel: 67592906 <br>Fax: +67592927 <br><a style ='text-decoration:none' href='mailto: aip +s@moe.edu.sg '>AIPS@MOE.EDU.SG </a></a></ +font></td></tr><tr><td colspan=4><hr></td></tr><tr BgColor=white><td +valign='top'><Font Size=2>1766<br>North</font></td><td valign='top'>< +Font Size=2>ANDERSON PRIMARY SCHOOL + <br>安徳逊小学 + <br>19 AN +G MO KIO AVE 9 <br>SINGAPORE 569785</font></td><td val +ign='top'><Font Size=2>P : MR CHONG KWAI KUEN + <br>&nbsp;&nbsp;&nbsp;&nbsp;钟桂权 + <br>VP +: MR YAM WAI KUEN < +br>&nbsp;&nbsp;&nbsp;&nbsp;任伟权 + </font></td><td valign='top'><F +ont Size=2>Tel: 64560340 <br>Fax: 65522310 <br><a st +yle ='text-decoration:none' href='mailto: anderson_ps@moe.edu.sg + '>ANDERSON_PS@MOE.EDU.SG </a></a></font></td></tr><tr><td c +olspan=4><hr></td></tr><tr BgColor=#A5C9F9><td valign='top'><Font Siz +e=2>1150<br>North</font></td><td valign='top'><Font Size=2>ANG MO KIO + PRIMARY SCHOOL + <br>茂乔小学 + <br>20 ANG MO KIO AVENUE 3 + <br>SINGAPORE 569920</font></td><td valign='top'><Font Size=2 +>P : MR PATRICK SIH SEAH YANG + <br>&nbsp;&nbsp;&nbsp;&nbsp;›声延 + <br>VP: MISS HUA YEN CHEUNG + <br>&nbsp;&nbsp;&nbsp;& +nbsp;华燕增 + </font></td><td valign='top'><Font Size=2>Tel: 645207 +94 <br>Fax: 64588121 <br><a style ='text-decoration: +none' href='mailto: AMKPS@MOE.EDU.SG '>AMKPS@MOE.EDU.SG + </a></a></font></td></tr><tr><td colspan=4><hr></td></tr +><tr BgColor=white><td valign='top'><Font Size=2>1234<br>South</font> +</td><td valign='top'><Font Size=2>BALESTIER HILL PRIMARY SCHOOL + <br +>博理小学 + <br>565 BALESTIER ROAD <br>SINGAPORE 32 +9927</font></td><td valign='top'><Font Size=2>P : MRS IRENE HO SENG T +UCK <br>&nbsp;&nbsp;&nbsp; +&nbsp;‹玉珠 + </font></td><td valign='top'><Font Size=2>Tel: 63539 +451 <br>Fax: 62546150 <br><a style ='text-decoration +:none' href='mailto: bhps@moe.edu.sg '>BHPS@MOE.EDU.SG + </a></a></font></td></tr><tr><td colspan=4><hr></td></t +r><tr BgColor=#A5C9F9><td valign='top'><Font Size=2>1230<br>East</fon +t></td><td valign='top'><Font Size=2>BEDOK GREEN PRIMARY SCHOOL + < +br>育青小学 + <br>1 BEDOK SOUTH AVE 2 <br>SINGAPORE +469317</font></td><td valign='top'><Font Size=2>P : MR LEE YIN HIN + <br>&nbsp;&nbsp;&nbs +p;&nbsp;李恩‹ + <br>VP: MR TONY FOO YAP SENG + <br>&nbsp;&nbsp;&nbsp;&nbsp;符业成 + </font +></td><td valign='top'><Font Size=2>Tel: 64425416 <br>Fax: + 64491491 <br><a style ='text-decoration:none' href='mailto: bg +ps@moe.edu.sg '>BGPS@MOE.EDU.SG </a></a>< +/font></td></tr><tr><td colspan=4><hr></td></tr><tr BgColor=white><td + valign='top'><Font Size=2>1196<br>East</font></td><td valign='top'>< +Font Size=2>BEDOK WEST PRIMARY SCHOOL + <br>尚智小学 + <br>50 BE +DOK RESERVOIR CRESCENT <br>SINGAPORE 479225</font></td><td val +ign='top'><Font Size=2>P : MDM LEE LAY KEOK + <br>&nbsp;&nbsp;&nbsp;&nbsp;李丽菊 + </font +></td><td valign='top'><Font Size=2>Tel: 64451224 <br>Fax: + 64495856 <br><a style ='text-decoration:none' href='mailto: BW +PS@MOE.EDU.SG '>BWPS@MOE.EDU.SG </a></a>< +/font></td></tr><tr><td colspan=4><hr></td></tr><tr BgColor=#A5C9F9>< +td valign='top'><Font Size=2>1129<br>South</font></td><td valign='top +'><Font Size=2>BENDEMEER PRIMARY SCHOOL + <br>明智小学 + <br>1062 + SERANGOON ROAD <br>SINGAPORE 328174</font></td><td +valign='top'><Font Size=2>P : MR KOH CHEE SENG + <br>&nbsp;&nbsp;&nbsp;&nbsp;许志成 + </f +ont></td><td valign='top'><Font Size=2>Tel: 62982911 <br>F +ax: 62995735 <br><a style ='text-decoration:none' href='mailto: + BENDEMEER_PS@MOE.EDU.SG '>BENDEMEER_PS@MOE.EDU.SG </a></ +a></font></td></tr><tr><td colspan=4><hr></td></tr><tr BgColor=white> +<td valign='top'><Font Size=2>1145<br>South</font></td><td valign='to +p'><Font Size=2>BLANGAH RISE PRIMARY SCHOOL + <br>布兰›坡小学 + <br>91 + TELOK BLANGAH HEIGHTS <br>SINGAPORE 109100</font></td><td + valign='top'><Font Size=2>P : MRS THNG KIM PUI ANGELINA + <br>&nbsp;&nbsp;&nbsp;&nbsp;陈金培 + </ +font></td><td valign='top'><Font Size=2>Tel: 62717387 <br> +Fax: 62763037 <br><a style ='text-decoration:none' href='mailto +: BRPS@MOE.EDU.SG '>BRPS@MOE.EDU.SG </a>< +/a></font></td></tr><tr><td colspan=4><hr></td></tr><tr BgColor=#A5C9 +F9><td valign='top'><Font Size=2>1640<br>West</font></td><td valign=' +top'><Font Size=2>BOON LAY GARDEN PRIMARY SCHOOL + <br>文›小学 + <br>2 +0 BOON LAY DRIVE <br>SINGAPORE 649930</font></td>< +td valign='top'><Font Size=2>P : MRS LIM FLORENCE + <br>&nbsp;&nbsp;&nbsp;&nbsp;庄兼华 + +<br>VP: MDM YEO KIM GEK NOREEN + <br>&nbsp;&nbsp;&nbsp;&nbsp;杨金玉 + </font></td><td valign='t +op'><Font Size=2>Tel: 63160998 <br>Fax: 63160209 <br +><a style ='text-decoration:none' href='mailto: BLGPS@MOE.EDU.SG + '>BLGPS@MOE.EDU.SG </a></a></font></td></tr><tr +><td colspan=4><hr></td></tr><tr BgColor=white><td valign='top'><Font + Size=2>1013<br>West</font></td><td valign='top'><Font Size=2>BOON LA +Y PRIMARY SCHOOL + <br>文礼小学 + <br>320 JURONG EAST STREET 32 + <br>SINGAPORE 609476</font></td><td valign='top'><Font Siz +e=2>P : MR MOHD MANSOR BIN SHAIK A KADIR + </font></td><td valign='top'><Font Size=2>Tel: 65624978 + <br>Fax: 65631164 <br><a style ='text-decoration:none' href= +'mailto: BLPS@MOE.EDU.SG '>BLPS@MOE.EDU.SG + </a></a></font></td></tr><tr><td colspan=4><hr></td></tr><tr BgColo +r=#A5C9F9><td valign='top'><Font Size=2>1020<br>West</font></td><td v +align='top'><Font Size=2>BUKIT PANJANG PRIMARY SCHOOL + <br>武吉班 +让小学 + <br>109 CASHEW ROAD <br>SINGAPORE 679676</font +></td><td valign='top'><Font Size=2>P : MDM BALAKRISHNA VYJANTHIMALA + <br>VP: MDM POON CHOR CHOO + <br>&nbsp;&nbsp;&nbsp;&nbsp +;方楚如 + </font></td><td valign='top'><Font Size=2>Tel: 67691912 + <br>Fax: 67637462 <br><a style ='text-decoration:none +' href='mailto: BPPS@MOE.EDU.SG '>BPPS@MOE.EDU.SG + </a></a></font></td></tr><tr><td colspan=4><hr></td></tr><tr + BgColor=white><td valign='top'><Font Size=2>1247<br>West</font></td> +<td valign='top'><Font Size=2>BUKIT TIMAH PRIMARY SCHOOL + <br>武 +知马小学 + <br>111 LORONG KISMIS <br>SINGAPORE 598112< +/font></td><td valign='top'><Font Size=2>P : MR RAJA RAJENDRA + </font></td><td valign='top +'><Font Size=2>Tel: 64662863 <br>Fax: 64692179 <br>< +a style ='text-decoration:none' href='mailto: bukittimahps@moe.edu.sg + '>BUKITTIMAHPS@MOE.EDU.SG </a></a></font></td></tr><tr>< +td colspan=4><hr></td></tr><tr BgColor=#A5C9F9><td valign='top'><Font + Size=2>1209<br>West</font></td><td valign='top'><Font Size=2>BUKIT V +IEW PRIMARY SCHOOL + <br>百德小学 + <br>18 BUKIT BATOK STREET 21 + <br>SINGAPORE 659634</font></td><td valign='top'><Font Siz +e=2>P : MDM JENNY S G LAW + <br>&nbsp;&nbsp;&nbsp;&nbsp;刘›› + <br>VP: MRS HARJIT SINGH +NEE DEWAN KAUR </font></td><td vali +gn='top'><Font Size=2>Tel: 65661980 <br>Fax: 65635015 + <br><a style ='text-decoration:none' href='mailto: BUKITVIEW_PS@MOE +.EDU.SG '>BUKITVIEW_PS@MOE.EDU.SG </a></a></font></td></t +r><tr><td colspan=4><hr></td></tr><tr BgColor=white><td valign='top'> +<Font Size=2>1751<br>North</font></td><td valign='top'><Font Size=2>C +ANBERRA PRIMARY SCHOOL + <br>康 1 ? - + <br>21 ADMIRALTY DRIVE + <br>SINGAPORE 757714</font></td><td valign='top'><Fo +nt Size=2>P : MISS RATNASINGAM SELVARANI + </font></td><td valign='top'><Font Size=2>Tel: 67597433 + <br>Fax: 67587312 <br><a style ='text-decoration:none' + href='mailto: CANBERRA_PS@MOE.EDU.SG '>CANBERRA_PS@MOE.EDU.SG + </a></a></font></td></tr><tr><td colspan=4><hr></td></tr><tr +BgColor=#A5C9F9><td valign='top'><Font Size=2>1771<br>East</font></td +><td valign='top'><Font Size=2>CASUARINA PRIMARY SCHOOL + <br>康 +岭小学 + <br>30 PASIR RIS ST 41 <br>SINGAPORE 518935 +</font></td><td valign='top'><Font Size=2>P

Update3 The code below picks only two schools:

my $file = 'directory[1].txt'; open (FH, "$file") or die $!; #while (/(?<=>)([\w\s]+?) PRIMARY\s+SCHOOL/g ) { print $&,"\n"}; my @schools; while (my $line = <FH>) { if ($line =~ /(?<=>)([\w\s]+?)\sPRIMARY\s+SCHOOL/g) { push (@schools, "$1"); } } print "@schools\n";

Replies are listed 'Best First'.
Re: Help with regular expression - real file
by merlyn (Sage) on Dec 13, 2003 at 16:22 UTC
Re: Help with regular expression - real file
by hanenkamp (Pilgrim) on Dec 13, 2003 at 17:08 UTC

    I think merlyn is right, trying to scan HTML is difficult. On the other hand, for something as simple as what you are attempting, XML::LibXML may be overkill. In this, assuming that the page doesn't change formatting frequently you are really looking for a pattern like:

    /(?<=>)([\w ]+?) PRIMARY SCHOOL/

    This will match the non-greedily any amount of words and space following the last ">" of a tag that is followed by the words " PRIMARY SCHOOL". This will include " PRIMARY SCHOOL" in the match too. This will fail if the line is broken in the middle--but you can get around that by using "\s" instead of spaces between words and such.

Re: Help with regular expression - real file
by carric (Beadle) on Dec 14, 2003 at 00:01 UTC
    Another possible approach is using HTML::Table::Extract to scrape the data (since you are targeting data in a table). You can pull the row and and data you want sans HTML tags.

    It's pretty simple to use and the documentation is pretty good.
    #!/usr/bin/Perl use lib qw( ..); use HTML::TableExtract; use LWP::Simple; use Data::Dumper; my $te = new HTML::TableExtract( depth=>3,count=>4,gridmap=>1); my $content = get( "http://www.ice.com/customer/product_search.jsp?tofs=keywords&keywords +=necklace"); $te->parse($content); foreach $ts ($te->table_states) { foreach $row ($ts->rows) { print Dumper $row; } }
    I hope that helps out.

    Carric