in reply to Help regarding regular expression
Regular expressions really aren't the most reliable way to parse XML or HTML. Line breaks can show up anywhere so it is entirely possible to have one of your dd tags split across two lines. You will get the most reliable results if you use one of the many XML or HTML parsers available on CPAN, for example, XML::Twig. These will let you pick out specific elements and their attributes (e.g. the value of attribute "term" for the <dd> element).
To make sure you have one and only one of each number, use a hash. Each time you retrieve a new number, add it as a hash key. When you are done parsing the file, you can extract the list of keys from the hash (see keys) and you will have a unique list of numbers. The pseudo-code looks like this:
# declare your hash my %hGoIds; # read in the file using XML::Twig or line by line # if you must. Now for each number you find: $hGoIds{$go_id} = 1; # when you have read in all of the lines and found all of # the numbers, extract your keys my @aKeys = keys %hGoIds;
For more information on hashes, see perldata and search for the word hash (for some reason it isn't in the table of contents).
Best, beth
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Help regarding regular expression
by chavanak (Initiate) on Aug 06, 2009 at 11:27 UTC | |
by ELISHEVA (Prior) on Aug 06, 2009 at 11:41 UTC | |
by ig (Vicar) on Aug 06, 2009 at 12:22 UTC |