Herman,
I started to reply to this node about 3 times and stopped. Are you going to be creating a bunch of hashes that hold a single record or would you like an Array of Hashes that hold all the records?
In any case - what you are going to want to do is figure out how to split your data into key and value pieces. You have three things to split on - ": ", " is ", and " at". You are going to want to limit the split to just two pieces since the rest of the string may duplicate the delimiter.
my ($key, $val) = split /(: | is | and )/ , $_, 2;
$info{$key} = $val;
This has some problems. You are assuming you will not encounter the same key name twice in the same data file - that would overwrite the first one. You are also not building the hash with a single regex like you asked, but iterating over the data splitting each line into two pieces.
It seems to me that if you were processing a lot of these types of records at once you would want to use an AoH.
If you can give a little bit more context of what you are trying to accomplish, I would be glad to help further.
Cheers - L~R | [reply] [d/l] |
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my $first = <<'END';
name is Doug
eyes: brown
email at bill@hotmail.com
END
my $second = <<'END';
name is Fred
eyes: black
era: prehistoric
email at fflinstone@hotmail.com
END
sub parse {
my $text = shift;
my %info = ();
while ($text =~ /
^ # start of line
(\w+) # a word
[:\s] # colon or space
.* # anything
\s # a space
(\S+) # a sequence of non spaces
$ # till the end of line
/xmg) {
$info{$1} = $2;
}
return \%info;
}
for ($first, $second) {
print Data::Dumper->Dump([parse($_)], ['info']),"\n";
}
__END__
$info = {
'eyes' => 'brown',
'email' => 'bill@hotmail.com',
'name' => 'Doug'
};
$info = {
'eyes' => 'black',
'era' => 'prehistoric',
'email' => 'fflinstone@hotmail.com',
'name' => 'Fred'
};
You can assign the result from parse() to a hash reference and eventually build the structure that suits your needs best. | [reply] [d/l] |
Did you consider using split?
while (<INPUT>) {
my ($key, $value)=split (/ is |: | at /);
# save $key and $value; or print them; or ...
}
CountZero Update:Beaten by L~R "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] [d/l] [select] |
#!/usr/bin/perl -lw
use Data::Dumper;
my %info;
while(<DATA>){
$info{$1}=$2 if /(\w+)[ :](?:is|at)? (.*)/;
}
print Dumper(\%info);
__DATA__
name is Doug
eyes: brown
email at bill@hotmail.com
##### Output with each dataset:
#Fred Flintstone
$VAR1 = {
'email' => 'fflinstone@hotmail.com',
'name' => 'Fred',
'eyes' => 'black',
'era' => 'prehistoric'
};
#Doug
$VAR1 = {
'email' => 'bill@hotmail.com',
'name' => 'Doug',
'eyes' => 'brown'
};
Addendum: Updated regex to get rid of space at beginning of data. | [reply] [d/l] |
hmm....this seems alot like homework, yes it is possible and not very difficult. What have you tried so far?
-Robert
Hopefully there will be a witty sig here at some point
| [reply] |
The actual application is this.
I have created a webcrawler that is searching variose sites and retriving data (names prices etc) and storing them in a database. and I have another script that compares prices for similar items. Well whenever I get to the part of the script that downloads theses sites that actually downloads the actual product pages my code turns into something like this,
$thisHash('price') = $1 if $pageContent =~ s/prices: \<table\>.+?\<\/t
+able\>//s
$thisHash('cas') = $1 if $pageContent =~ s/CAS: \<b\>\d+-\d{2}-\d+\<\/
+b\>//s
etc etc etc.
I would like to do something alog the lines of:
$pageContent =~ /(CAS: \<b>\d+-\d{2}-\d+\<\/b\>|).*?(prices: \<table\>
+.+?\<\/table\>|)/s
and then assign the captured values to a hash in an elegant manner.
Sorry I should have been much nore clear in the first place
-Herman
| [reply] [d/l] [select] |
Yes, this is quite a different sort of problem from the initial example that started this thread. If you have a small number of distinct sites that you're scanning, and you are reasonably confident that each site has its own pattern that it follows consistently, then you can try keeping the appropriate regexes for price extraction in its own hash, keyed by web-site name -- something like:
my %regs = ( "site1.com" => [ "price", qr{prices: <table>(.+?)</table>
+}is ],
"site2.com" => [ "cas", qr{cas: <b>(\d+-\d{2}-\d+)}is ],
...
);
...
foreach my $site (keys %regs) {
... # fetch data into $pagecontent...
my ($key,$reg) = @$regs{$site};
$thisHash{$key} = $1 if $pagecontent =~ $reg;
}
This at least makes it easier to keep track of each site's peculiarities, and to limit the number of executable lines you need to actually work through all the sites. (Maybe you need a slightly more elaborate structure, if you're pulling "price" and "CAS" from the same site; maybe you can see the way to go given this example.)
I can't imagine doing this any more compactly, since it does depend heavily on specific knowledge about how each site formats is price lists, etc. It would be hard to generalize any further unless all the sites somehow managed to do roughly the same thing to present their information, which seems implausible. (It's not always a given that you can use regexes for this sort of thing at all -- many folks here would suggest that you use HTML::TokeParser or somesuch, which might not be a bad idea... Do look at least at HTML::TokeParser::Simple; it may make things a lot easier and give you a level of "abstraction" (generality) that will be useful.)
BTW, I noticed that your example in this reply was referring to "$1", though your regexes did not contain any parens. That would be wrong. | [reply] [d/l] |