Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: Oft encountered regex problem

by blaze (Friar)
on Jul 18, 2003 at 21:50 UTC ( [id://275767]=note: print w/replies, xml ) Need Help??


in reply to Oft encountered regex problem

hmm....this seems alot like homework, yes it is possible and not very difficult. What have you tried so far?
-Robert


Hopefully there will be a witty sig here at some point

Replies are listed 'Best First'.
Re: Re: Oft encountered regex problem
by GermanHerman (Sexton) on Jul 18, 2003 at 22:26 UTC
    The actual application is this. I have created a webcrawler that is searching variose sites and retriving data (names prices etc) and storing them in a database. and I have another script that compares prices for similar items. Well whenever I get to the part of the script that downloads theses sites that actually downloads the actual product pages my code turns into something like this,
    $thisHash('price') = $1 if $pageContent =~ s/prices: \<table\>.+?\<\/t +able\>//s $thisHash('cas') = $1 if $pageContent =~ s/CAS: \<b\>\d+-\d{2}-\d+\<\/ +b\>//s
    etc etc etc.
    I would like to do something alog the lines of:
    $pageContent =~ /(CAS: \<b>\d+-\d{2}-\d+\<\/b\>|).*?(prices: \<table\> +.+?\<\/table\>|)/s
    and then assign the captured values to a hash in an elegant manner. Sorry I should have been much nore clear in the first place
    -Herman
      Yes, this is quite a different sort of problem from the initial example that started this thread. If you have a small number of distinct sites that you're scanning, and you are reasonably confident that each site has its own pattern that it follows consistently, then you can try keeping the appropriate regexes for price extraction in its own hash, keyed by web-site name -- something like:
      my %regs = ( "site1.com" => [ "price", qr{prices: <table>(.+?)</table> +}is ], "site2.com" => [ "cas", qr{cas: <b>(\d+-\d{2}-\d+)}is ], ... ); ... foreach my $site (keys %regs) { ... # fetch data into $pagecontent... my ($key,$reg) = @$regs{$site}; $thisHash{$key} = $1 if $pagecontent =~ $reg; }
      This at least makes it easier to keep track of each site's peculiarities, and to limit the number of executable lines you need to actually work through all the sites. (Maybe you need a slightly more elaborate structure, if you're pulling "price" and "CAS" from the same site; maybe you can see the way to go given this example.)

      I can't imagine doing this any more compactly, since it does depend heavily on specific knowledge about how each site formats is price lists, etc. It would be hard to generalize any further unless all the sites somehow managed to do roughly the same thing to present their information, which seems implausible. (It's not always a given that you can use regexes for this sort of thing at all -- many folks here would suggest that you use HTML::TokeParser or somesuch, which might not be a bad idea... Do look at least at HTML::TokeParser::Simple; it may make things a lot easier and give you a level of "abstraction" (generality) that will be useful.)

      BTW, I noticed that your example in this reply was referring to "$1", though your regexes did not contain any parens. That would be wrong.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://275767]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (3)
As of 2024-04-26 01:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found