I'm trying to pull some data out of an HTML filee.
Before I get to the specific problem I would like to say that, yes I realize that this is far from optimal in more ways then one.
I have read
Death to Dot Star! and realize that this regex is probably horrendously inefficient. I have also been advised that
HTML::TableExtract is a better way to get data out of an html table then is a regex.
With all that in mind I would like to ask for help with this problem.
/^<(?:[Tt][Rr]).*?>(\d{5}).*?>(\d{2}).*?>(\d{3}).*?>(\d{3}).*?>(\d{2}).*?>(?:[&\w]).*?>(\w+(?:(?:[\s\w|&]+)?)*).*?>\s(\d).*?>(\w*?\d(?:[,\d\*]?)*)((?:[\w\d,]?)+).*?>(\w(?:(?:[\w\d-])?)*).*?<\/[tT][rR]>(<.*)?/ I am using that regex to pull the information out of an webpage, with the following line format (all newlines are mine to ease readbility).
<tr><td width="0" align="center"><font face="Arial" size="2">5 Digits
+</font></td>
<td width="0" align="center"><font face="Arial" size="2">2 Digits <
+/font></td>
<td width="0" align="center"><font face="Arial" size="2">3 Digits </
+font></td>
<td width="0" align="center"><font face="Arial" size="2">3 Digits </
+font></td>
<td width="0" align="center"><font face="Arial" size="2">2 Digits <
+/font></td>
<td width="0" align="center"><font face="Arial" size="2"> </font>
+</td>
<td align="center"><font face="Arial" size="2">AS tring</font></td>
<td align="center"><font face="Arial" size="2"> 1 Digit always precede
+d by a space </font></td>
<td align="center"><font face="Arial" size="2">Letters, Digits, (comma
+s|asterisks)?</font></td>
<td align="center"><font face="Arial" size="2">A String always includi
+ng a dash</font></td>
<td align="center"><font face="Arial" size="2"> </font></td></tr>
Now here's the problem, the 9th piece of data is on occasion the string "SEE SCHEDULE OF CLASSES" and the tenth will then be the regex is the logic statement of an if statement.
while <FILE> {
if (REGEX) {
do stuff
}
}
The problem comes in when a file with this alternate format is input. The script hangs and then says that there was an internal server error. I managed to get around this by using a different regex to first test for the "SEE SCHEDULE OF CLASSES" string and if it exists simply going to the next line of the file.
My question is why does my regex simply not match, return false, the if not execute, and the loop continue?
Thanks in advance for any and all help.
-Etan
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.