Matthieu14 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I'm would like to parse a basic html file and i tried the following Perl script without result.
My code will be quite simple for you so you will understand what i'm trying to do without any problem.
Maybe someone will be able to tell me where is my fault.
I would appreciate because i'm a newbie with complex regexp in Perl (multiple lines...) and i don't understand why it doesn't work...
use strict; my $path_publication = "C:\\Mes_Programmes\\scripts"; my $entree = ""; my $contenu = ""; my @fichiers = (); @fichiers = (); opendir REP, $path_publication or die "Cannot read $path_publication"; + @fichiers = readdir REP; closedir REP; foreach $entree (@fichiers) { $contenu = ""; if ($entree =~ /diagram[[:alnum:]]+\.htm/) { open IN, '<' . $path_publication . '\\' . $entree or d +ie "Cannot open $entree"; while (<IN>) { $contenu = $contenu . $_; } if ( $contenu =~ m[.*<body>.*<br>.*</span>.*<H1>.*</H1 +>.*<HR>.*<SMALL>.*</SMALL>.*</body>.*]smi ) { print "IT WORKS"; } close IN; } }
Many thanks, Matthieu

Replies are listed 'Best First'.
Re: Parsing an html file
by Anonymous Monk on Apr 20, 2010 at 08:33 UTC

      The HTML::Tree(Builder) tutorial was written in 2003.

      HTML-Tree itself seems to have been last updated in 2006.

      Is HTML-Tree still a good choice for parsing HTML? Can anyone please direct me to a newer and more complete tutorial?

        Edit: I thought this question merited its own topic thread, so rather than answer here I asked here.

Re: Parsing an html file
by jethro (Monsignor) on Apr 20, 2010 at 08:47 UTC
    I don't see anything wrong, but I don't know the contents of your datafile. To find bugs in regexp like yours you could start with simpler regexp and work up with more complex iterations until you don't match anymore. In the difference you found your problem.

    Also you should check if the file contents is really in $contenu by printing it after it was filled.

      Sorry. Here is the content of my html file which is stored in $contenu (This is the result of a print $contenu) :

      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//FR"> <HTML lang="fr"> <HEAD> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1 +"> <LINK HREF="CollapsibleList.css" REL="stylesheet" TYPE="text/css"> <SCRIPT TYPE="text/javascript" SRC="CollapsibleList.js" ></SCRIPT> <link rel=stylesheet href=QimProcess.css> <TITLE>1-FIN01 - Facturer</TITLE> <META NAME="author" CONTENT="ADMIN"> </HEAD> <BODY> <A NAME="topofpagediagram12htm"></A> <div id="entete"> </div> <div id="menu_h"> <ul id="menu_horizontal"> <li> <a href="#">Index</a> <ul class="sous_menu_horizontal"> <li><a href="indexprocess.htm" + target="PAGE">Processus</a></li> <li><a href="indexdocument.htm +" target="PAGE">Documents</a></li> </ul> </li> <li> <a href="#">Aide</a> <ul class="sous_menu_horizontal"> <li><a href="diagram7.htm" tar +get="PAGE">LÚgende</a></li> </ul> </li> </ul> </div> <div id="menu_v"> <object classid="clsid:D27CDB6E-AE6D-11cf-96B8-4445535 +40000" id="menutree" width="100%" height="100%" codebase="http://fpdo +w nload.macromedia.com/get/flashplayer/current/swflash.cab"> <param name="movie" value="menutree.swf" /> <param name="quality" value="high" /> <param name="wmode" value="transparent" /> <param name="allowScriptAccess" value="sameDom +ain" /> <embed src="menutree.swf" quality="high" wmode +="transparent" width="100%" height="100%" name="menutree" align="midd +l e" play="true" loop="false" quality="high +" allowScriptAccess="sameDomain" type="application/x-shockwave-flash" p +luginspage="http://www.adobe.com/go/getflashplayer"> </embed> </object> </div> <div id="corps"> <a href="mailto:is.methodes@xxxxx.com?Subject=[REFERENTIEL] 1-FIN01 - + Facturer (12)">Send us a comment</a> | <a HREF="diagram6bca80b88e2 411dea2910019b93c8ff0.htm">Home</a><br>DerniÞre mise Ó jour effectuÚe +par Administrator (22.03.2010 10:54:59) <H1>1-FIN01 - Facturer</H1> <MAP NAME="COORDdiagram12htm"> <AREA SHAPE="RECT" COORDS="356, 171, 375, 190" HREF="document5.htm +" ALT="Mode OpÚratoire PrÚparer la facturation"> <AREA SHAPE="RECT" COORDS="95, 201, 189, 277" ALT="&#9556;vÚnement + interne Projet Ó facturer"> <AREA SHAPE="RECT" COORDS="95, 680, 189, 756" ALT="RÚsultat intern +e Projet facturÚ"> <AREA SHAPE="RECT" COORDS="267, 503, 361, 579" HREF="process63.htm +" ALT="N3 - ActivitÚ Suivre les opÚrations de crÚdit"> <AREA SHAPE="RECT" COORDS="463, 201, 557, 277" HREF="process54.htm +" ALT="N3 - ActivitÚ Valider les factures"> <AREA SHAPE="RECT" COORDS="267, 401, 361, 477" HREF="process55.htm +" ALT="N3 - ActivitÚ Recouvrer les factures"> <AREA SHAPE="RECT" COORDS="420, 76, 609, 756" HREF="organization18 +.htm" ALT="Fonction Directeur"> <AREA SHAPE="RECT" COORDS="267, 103, 361, 179" HREF="process52.htm +" ALT="N3 - ActivitÚ PrÚparer la facturation"> <AREA SHAPE="RECT" COORDS="267, 201, 361, 277" HREF="process53.htm +" ALT="N3 - ActivitÚ Etablir les factures"> <AREA SHAPE="RECT" COORDS="267, 299, 361, 375" HREF="process56.htm +" ALT="N3 - ActivitÚ Envoyer les factures"> <AREA SHAPE="RECT" COORDS="224, 76, 413, 756" HREF="organization4. +htm" ALT="Fonction Assistante de direction"> <AREA SHAPE="RECT" COORDS="95, 582, 189, 658" ALT="RÚsultat intern +e Factures payÚes"> <AREA SHAPE="RECT" COORDS="95, 401, 189, 477" ALT="&#9556;vÚnement + externe DÚlai de paiement atteint"> <AREA SHAPE="RECT" COORDS="95, 103, 189, 179" ALT="&#9556;vÚnement + externe Fin de mois"> <AREA SHAPE="RECT" COORDS="356, 171, 375, 190" HREF="document5.htm +" ALT="Mode OpÚratoire PrÚparer la facturation"> </MAP> <P><IMG SRC="diagram12.jpg" USEMAP="#COORDdiagram12htm" ALT="1-FIN01 - + Facturer" LONGDESC=""></P> <HR><P> </div> <BR><SMALL> CrÚÚ Ó partir du modÞle QimProcess le 19.04.2010 Ó 16:16</SMALL></P> </BODY> </HTML>
        There is no span-tag in this file, but is expected in your regexp. Naturally it won't match.
Re: Parsing an html file
by samarzone (Pilgrim) on Apr 20, 2010 at 08:41 UTC
    Not easy to guess why it is not working, without the content of html file. However did you intentionally have only closing </span> tag and not the opening tag?