comment on

I'm sure this has been asked before but I wouldn't even begin to know what search terms to try for :)

I am using WWW::Mechanize to scrape a site that has images and text next to them. I want to rip through and pull out all images and put them in an array. I'd use a similar regex to then slurp up all the text and place them in array #2 (the images and text have to be in the same order as they are found).

I have a regex that ripped out all useless junk in the HTML file keeping just the table that I'm looking for. I'm not sure how to loop over $page (content dump) to pull out every unique instance of an image WITHOUT using the image function within this module. Using this image function would still leave me stranded for trying to get the text to come with it.

Below is a sample of what I am working with

</a><br><br><table width="100%" cellpadding="2" cellspacing="0" border
+="0"><tr><td align="left" valign="bottom"><img src='http://images.tek
+-tips.com/items/image001.gif' alt='Image001' width='40' height='40' b
+order='0'> Description of image here</td><td align="right" valign="bo
+ttom"></td>
</tr><tr><td align="left" valign="bottom"><img src='http://images.tek-
+tips.com/items/image002.gif' alt='Image002' width='40' height='40' bo
+rder='0'> Description of image here</td><td align="right" valign="bot
+tom"></td>
[download]

I want all images to be in @images and all text next to that image be in @text. There is definitely a way to go through this in one pass and collect both but would it be easier having two separate regexes to do this?

These are not my strong point and I appreciate any and all help to get the data extracted.

In reply to Pulling all instances of a regex out by coldfingertips

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.