Re: Oft encountered regex problem

Replies are listed 'Best First'.
Re: Re: Oft encountered regex problem by GermanHerman (Sexton) on Jul 18, 2003 at 22:26 UTC
The actual application is this. I have created a webcrawler that is searching variose sites and retriving data (names prices etc) and storing them in a database. and I have another script that compares prices for similar items. Well whenever I get to the part of the script that downloads theses sites that actually downloads the actual product pages my code turns into something like this, `$thisHash('price') = $1 if $pageContent =~ s/prices: \<table\>.+?\<\/t +able\>//s $thisHash('cas') = $1 if $pageContent =~ s/CAS: \<b\>\d+-\d{2}-\d+\<\/ +b\>//s` [download] etc etc etc. I would like to do something alog the lines of: `$pageContent =~ /(CAS: \<b>\d+-\d{2}-\d+\<\/b\>\|).*?(prices: \<table\> +.+?\<\/table\>\|)/s` [download] and then assign the captured values to a hash in an elegant manner. Sorry I should have been much nore clear in the first place -Herman	[reply] [d/l] [select]
Re: Re: Re: Oft encountered regex problem by graff (Chancellor) on Jul 20, 2003 at 03:29 UTC
Yes, this is quite a different sort of problem from the initial example that started this thread. If you have a small number of distinct sites that you're scanning, and you are reasonably confident that each site has its own pattern that it follows consistently, then you can try keeping the appropriate regexes for price extraction in its own hash, keyed by web-site name -- something like: `my %regs = ( "site1.com" => [ "price", qr{prices: <table>(.+?)</table> +}is ], "site2.com" => [ "cas", qr{cas: <b>(\d+-\d{2}-\d+)}is ], ... ); ... foreach my $site (keys %regs) { ... # fetch data into $pagecontent... my ($key,$reg) = @$regs{$site}; $thisHash{$key} = $1 if $pagecontent =~ $reg; }` [download] This at least makes it easier to keep track of each site's peculiarities, and to limit the number of executable lines you need to actually work through all the sites. (Maybe you need a slightly more elaborate structure, if you're pulling "price" and "CAS" from the same site; maybe you can see the way to go given this example.) I can't imagine doing this any more compactly, since it does depend heavily on specific knowledge about how each site formats is price lists, etc. It would be hard to generalize any further unless all the sites somehow managed to do roughly the same thing to present their information, which seems implausible. (It's not always a given that you can use regexes for this sort of thing at all -- many folks here would suggest that you use HTML::TokeParser or somesuch, which might not be a bad idea... Do look at least at HTML::TokeParser::Simple; it may make things a lot easier and give you a level of "abstraction" (generality) that will be useful.) BTW, I noticed that your example in this reply was referring to "$1", though your regexes did not contain any parens. That would be wrong.	[reply] [d/l]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks