comment on

I ask for any and all monks whose knowledge of regular expressions exceeds my own to help me optimize this regex.
Optimize for speed or readability the choice is yours. I am parsing computer generated, badly formatted, HTML which has a number of lines that look like this

<tr>
<td width="0" align="center"><font face="Arial" size="2">29022</font>
</td><td width="0" align="center"><font face="Arial" size="2">01</font
+>
</td><td width="0" align="center"><font face="Arial" size="2">354</fon
+t>
</td><td width="0" align="center"><font face="Arial" size="2">201</fon
+t>
</td><td width="0" align="center"><font face="Arial" size="2">01</font
+>
</td><td width="0" align="center"><font face="Arial" size="2">&nbsp;</
+font>
</td><td align="center"><font face="Arial" size="2">INTRODUCTION TO FI
+LM</font>
</td><td align="center"><font face="Arial" size="2"> 3</font>
</td><td align="center"><font face="Arial" size="2">MW5M7,8</font>
</td><td align="center"><font face="Arial" size="2">MI-100</font>
</td><td align="center"><font face="Arial" size="2">&nbsp;</font>
</td></tr>
[download]

The first piece of data is always 5 digits, the second 2, the third 3, the fourth 3, the fifth 2, the sixth a $nbsp or a single letter, the seventh an unspecified number of words, the eigth a single space followed by a digit, the ninth will always have 1 letter and then will either be another letter followed by a number or two numbers (that will possibly be followed by another similar set, as shown above), the tenth will always be letters followed by a dash and then some numbers or letters, and finally the eleventh can be ignored.

I am currently using /^<tr.*?(\d\d\d\d\d).*?<td.*?>(\d\d).*?<td.*?>(\d\d\d).*?<td.*?>(\d\d\d).*?<td.*?>(\d\d).*?<td\.*?>(?:&).*?<td.*?>(\w+(?:(?:\s\w+)?)*).*?<td.*?>(\s\d).*?<td.*?>(\w(?:(?:[\w\d,])?)*).*?<td.*?>(\w(?:\(?:[\w\d-])?)*)/ to pull out the relevant pieces of information. Any advice would be appreciated.

-Etan

P.S. Oh and if possible some slight explanation would be useful. I understand the basics of regex but flags and oddities still escape me. Thanks in advance.

P.P.S I had been checking for both the <td> and <font> tags but then read this node and realized that I was asking for too much. So I stopped asking for the font tags. Just thought I'd throw that in to put my thought processes up for inspection.

In reply to Regex optimization by deryni

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.