in reply to Help regarding regular expression

Regular expressions really aren't the most reliable way to parse XML or HTML. Line breaks can show up anywhere so it is entirely possible to have one of your dd tags split across two lines. You will get the most reliable results if you use one of the many XML or HTML parsers available on CPAN, for example, XML::Twig. These will let you pick out specific elements and their attributes (e.g. the value of attribute "term" for the <dd> element).

To make sure you have one and only one of each number, use a hash. Each time you retrieve a new number, add it as a hash key. When you are done parsing the file, you can extract the list of keys from the hash (see keys) and you will have a unique list of numbers. The pseudo-code looks like this:

# declare your hash my %hGoIds; # read in the file using XML::Twig or line by line # if you must. Now for each number you find: $hGoIds{$go_id} = 1; # when you have read in all of the lines and found all of # the numbers, extract your keys my @aKeys = keys %hGoIds;

For more information on hashes, see perldata and search for the word hash (for some reason it isn't in the table of contents).

Best, beth

Replies are listed 'Best First'.
Re^2: Help regarding regular expression
by chavanak (Initiate) on Aug 06, 2009 at 11:27 UTC
    Hi, I am really sorry but I am not that adept in using html::parser. Also the problem is the text I want to extract contains alphabets colon and number (e.g.: GO:1234567). Since I am not adept in html:parsing or xml parsing, I was trying reg ex. I have never used perl before only using it for two days :(

      I'm guessing from the material you posted, you will probably be doing a lot more parsing of gene data in XML format over the next few weeks, months(?), so it is *well* worth your while to learn the correct tools. It isn't as hard as you think, and there are *many* people to help you here, including some who are also doing gene research! The beauty of modules like XML::Twig is that you don't actually need to know how to parse HTML since it does the parsing for you. You just need to learn how to start the process and use the results.

      So I'd start instead by looking up XML::Twig, reading the documentation, and asking about any questions you have here or on a new thread. If you decide to stay with this thread, you might want to update your original post to indicate the change of strategy. Also it would be a good idea to change the title to something like "Using XML::Twig to parse gene data". Such a title would do a better job of attracting the right people to help you.

      If you decide to start a new node, be sure to update Help regarding regular expression with an explanation of your change in strategy and a link to a new node. Also in the new node, link back to this node so that people understand the whole context of the discussion (you'll get much better advice that way). To link to nodes within PerlMonks, you can use [id://NNNN] where NNNN is the node id of the post. (that's the number in the left column on Nodes You Wrote) The title of the node will be displayed automatically.

      I don't recommend asking your questions in reply to this node. People will be less likely to see a deeply nested node, so you won't get the widest help.

      If you have general questions about how to use CPAN or modules, you can also ask in the chatterbox (sidebar to the right). You can also get boatloads of information about XML::Twig (and any other module) using this link: cpan module search. It lets you find all of the PerlMonks nodes (questions, answers, tutorials) that discuss the module you are interested in learning how to use.

      Best, beth

      I agree with what ELISHEVA says, and add that unless your XML/HTML is extremely trivial it will be much less effort to learn the appropriate modules than it will be to write robust regular expressions. It is easy to start writing the regular expressions but parsing XML and HTML is much more complex than it first appears and regular expressions are not up to the task.