Oft encountered regex problem

GermanHerman has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Oft encountered regex problem by Limbic~Region (Chancellor) on Jul 18, 2003 at 21:47 UTC
Herman, I started to reply to this node about 3 times and stopped. Are you going to be creating a bunch of hashes that hold a single record or would you like an Array of Hashes that hold all the records? In any case - what you are going to want to do is figure out how to split your data into key and value pieces. You have three things to split on - ": ", " is ", and " at". You are going to want to limit the split to just two pieces since the rest of the string may duplicate the delimiter. `my ($key, $val) = split /(: \| is \| and )/ , $_, 2; $info{$key} = $val;` [download] This has some problems. You are assuming you will not encounter the same key name twice in the same data file - that would overwrite the first one. You are also not building the hash with a single regex like you asked, but iterating over the data splitting each line into two pieces. It seems to me that if you were processing a lot of these types of records at once you would want to use an AoH. If you can give a little bit more context of what you are trying to accomplish, I would be glad to help further. Cheers - L~R	[reply] [d/l]
Re: Oft encountered regex problem by dbwiz (Curate) on Jul 18, 2003 at 22:14 UTC
This should do what you want. #!/usr/bin/perl -w use strict; use Data::Dumper; my $first = <<'END'; name is Doug eyes: brown email at bill@hotmail.com END my $second = <<'END'; name is Fred eyes: black era: prehistoric email at fflinstone@hotmail.com END sub parse { my $text = shift; my %info = (); while ($text =~ / ^ # start of line (\w+) # a word [:\s] # colon or space .* # anything \s # a space (\S+) # a sequence of non spaces $ # till the end of line /xmg) { $info{$1} = $2; } return \%info; } for ($first, $second) { print Data::Dumper->Dump([parse($_)], ['info']),"\n"; } __END__ $info = { 'eyes' => 'brown', 'email' => 'bill@hotmail.com', 'name' => 'Doug' }; $info = { 'eyes' => 'black', 'era' => 'prehistoric', 'email' => 'fflinstone@hotmail.com', 'name' => 'Fred' }; [download] You can assign the result from parse() to a hash reference and eventually build the structure that suits your needs best.	[reply] [d/l]
Re: Oft encountered regex problem by CountZero (Bishop) on Jul 18, 2003 at 21:51 UTC
Did you consider using `split`? `while (<INPUT>) { my ($key, $value)=split (/ is \|: \| at /); # save $key and $value; or print them; or ... }` [download] CountZero Update:Beaten by L~R "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply] [d/l] [select]
Re: Oft encountered regex problem by pzbagel (Chaplain) on Jul 18, 2003 at 21:57 UTC
That pretty much does it: `#!/usr/bin/perl -lw use Data::Dumper; my %info; while(<DATA>){ $info{$1}=$2 if /(\w+)[ :](?:is\|at)? (.)/; } print Dumper(\%info); __DATA__ name is Doug eyes: brown email at bill@hotmail.com ##### Output with each dataset: #Fred Flintstone $VAR1 = { 'email' => 'fflinstone@hotmail.com', 'name' => 'Fred', 'eyes' => 'black', 'era' => 'prehistoric' }; #Doug $VAR1 = { 'email' => 'bill@hotmail.com', 'name' => 'Doug', 'eyes' => 'brown' };` [download] Addendum:* Updated regex to get rid of space at beginning of data.	[reply] [d/l]
Re: Oft encountered regex problem by blaze (Friar) on Jul 18, 2003 at 21:50 UTC
hmm....this seems alot like homework, yes it is possible and not very difficult. What have you tried so far? -Robert Hopefully there will be a witty sig here at some point	[reply]
Re: Re: Oft encountered regex problem by GermanHerman (Sexton) on Jul 18, 2003 at 22:26 UTC
The actual application is this. I have created a webcrawler that is searching variose sites and retriving data (names prices etc) and storing them in a database. and I have another script that compares prices for similar items. Well whenever I get to the part of the script that downloads theses sites that actually downloads the actual product pages my code turns into something like this, `$thisHash('price') = $1 if $pageContent =~ s/prices: \<table\>.+?\<\/t +able\>//s $thisHash('cas') = $1 if $pageContent =~ s/CAS: \<b\>\d+-\d{2}-\d+\<\/ +b\>//s` [download] etc etc etc. I would like to do something alog the lines of: `$pageContent =~ /(CAS: \<b>\d+-\d{2}-\d+\<\/b\>\|).*?(prices: \<table\> +.+?\<\/table\>\|)/s` [download] and then assign the captured values to a hash in an elegant manner. Sorry I should have been much nore clear in the first place -Herman	[reply] [d/l] [select]
Re: Re: Re: Oft encountered regex problem by graff (Chancellor) on Jul 20, 2003 at 03:29 UTC
Yes, this is quite a different sort of problem from the initial example that started this thread. If you have a small number of distinct sites that you're scanning, and you are reasonably confident that each site has its own pattern that it follows consistently, then you can try keeping the appropriate regexes for price extraction in its own hash, keyed by web-site name -- something like: `my %regs = ( "site1.com" => [ "price", qr{prices: <table>(.+?)</table> +}is ], "site2.com" => [ "cas", qr{cas: <b>(\d+-\d{2}-\d+)}is ], ... ); ... foreach my $site (keys %regs) { ... # fetch data into $pagecontent... my ($key,$reg) = @$regs{$site}; $thisHash{$key} = $1 if $pagecontent =~ $reg; }` [download] This at least makes it easier to keep track of each site's peculiarities, and to limit the number of executable lines you need to actually work through all the sites. (Maybe you need a slightly more elaborate structure, if you're pulling "price" and "CAS" from the same site; maybe you can see the way to go given this example.) I can't imagine doing this any more compactly, since it does depend heavily on specific knowledge about how each site formats is price lists, etc. It would be hard to generalize any further unless all the sites somehow managed to do roughly the same thing to present their information, which seems implausible. (It's not always a given that you can use regexes for this sort of thing at all -- many folks here would suggest that you use HTML::TokeParser or somesuch, which might not be a bad idea... Do look at least at HTML::TokeParser::Simple; it may make things a lot easier and give you a level of "abstraction" (generality) that will be useful.) BTW, I noticed that your example in this reply was referring to "$1", though your regexes did not contain any parens. That would be wrong.	[reply] [d/l]