chavanak has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, Update: I have decided to use HTML::Parser to get this work done!!! Please look here Using XML::Twig| HTML::Parser to help me with my other queries. I have to write a program to extract characters from a text file. The text file looks like this
dd class="xref">GeneID:947412</dd> <dd class="xref">GenomeReviews:AP009048_GR</dd> <dd class="xref">GenomeReviews:U00096_GR</dd> <dd class="xref"><a href="http://amigo.geneontolog +y.org/cgi-bin/amigo/term-details.cgi?term=GO:0005524">GO:0005524</a>< +/dd> <dd class="xref"><a href="http://amigo.geneontolog +y.org/cgi-bin/amigo/term-details.cgi?term=GO:0005886">GO:0005886</a>< +/dd> <dd class="xref"><a href="http://amigo.geneontolog +y.org/cgi-bin/amigo/term-details.cgi?term=GO:0006810">GO:0006810</a>< +/dd> <dd class="xref"><a href="http://amigo.geneontolog +y.org/cgi-bin/amigo/term-details.cgi?term=GO:0016301">GO:0016301</a>< +/dd> <dd class="xref">HOGENOM:P27254</dd>
I want to write a regular expression to extract "GO:(followed by those numbers)" only from this text and also it has to be extracted only once. This extracted string has to be stored in an array so I can pass it to another search query. I tried with this code $line1 =~ m/"^\s+(.+)(GO:)(\d+).+"/g; and it is not working. Can anyone help me out??

Replies are listed 'Best First'.
Re: Help regarding regular expression
by moritz (Cardinal) on Aug 06, 2009 at 09:39 UTC
    The regex won't match because there are quotes in there that don't appear in the input.

    This seems to work:

    use strict; use warnings; while (<DATA>) { print "$1\n" if m/GO:(\d+)/ } __DATA__ dd class="xref">GeneID:947412</dd> <dd class="xref">GenomeReviews:AP009048_GR</dd> <dd class="xref">GenomeReviews:U00096_GR</dd> <dd class="xref"><a href="http://amigo.geneontolog +y.org/cgi- +bin/amigo/term-details.cgi?term=GO:0005524">GO:0005524</a></dd> <dd class="xref"><a href="http://amigo.geneontolog +y.org/cgi- +bin/amigo/term-details.cgi?term=GO:0005886">GO:0005886</a></dd> <dd class="xref"><a href="http://amigo.geneontolog +y.org/cgi- +bin/amigo/term-details.cgi?term=GO:0006810">GO:0006810</a></dd> <dd class="xref"><a href="http://amigo.geneontolog +y.org/cgi- +bin/amigo/term-details.cgi?term=GO:0016301">GO:0016301</a></dd> <dd class="xref">HOGENOM:P27254</dd>
      Hi, Thanks a lot for the reply, the code seems to work but it is not giving any output while reading it from another file. Here is my full code:
      use LWP::Simple; my $query = shift(@ARGV); die "Dead!!" unless (open(IN,">/home/vivek/Desktop/test12.txt")); $content = get("http://amigo.geneontology.org/cgi-bin/amigo/search.cgi +?query=$query;search_constraint=gp;action=query;view=query/"); die "Couldn’t get the website!" unless defined $content; print IN "$content"; while(IN) { my $line1; print "entered whileloop"; @line1 =~ "$1\n" if m/GO:(\d+)/; print "@line1\n"; close(IN); }
      Can you please tell me where I am going wrong?
        You open IN for writing, and later try to read from it - that won't work. Also reading from a file is done with <IN> (note the angle brackets).

        You also match against @line1 which you never initialized.

Re: Help regarding regular expression
by ELISHEVA (Prior) on Aug 06, 2009 at 10:48 UTC

    Regular expressions really aren't the most reliable way to parse XML or HTML. Line breaks can show up anywhere so it is entirely possible to have one of your dd tags split across two lines. You will get the most reliable results if you use one of the many XML or HTML parsers available on CPAN, for example, XML::Twig. These will let you pick out specific elements and their attributes (e.g. the value of attribute "term" for the <dd> element).

    To make sure you have one and only one of each number, use a hash. Each time you retrieve a new number, add it as a hash key. When you are done parsing the file, you can extract the list of keys from the hash (see keys) and you will have a unique list of numbers. The pseudo-code looks like this:

    # declare your hash my %hGoIds; # read in the file using XML::Twig or line by line # if you must. Now for each number you find: $hGoIds{$go_id} = 1; # when you have read in all of the lines and found all of # the numbers, extract your keys my @aKeys = keys %hGoIds;

    For more information on hashes, see perldata and search for the word hash (for some reason it isn't in the table of contents).

    Best, beth

      Hi, I am really sorry but I am not that adept in using html::parser. Also the problem is the text I want to extract contains alphabets colon and number (e.g.: GO:1234567). Since I am not adept in html:parsing or xml parsing, I was trying reg ex. I have never used perl before only using it for two days :(

        I'm guessing from the material you posted, you will probably be doing a lot more parsing of gene data in XML format over the next few weeks, months(?), so it is *well* worth your while to learn the correct tools. It isn't as hard as you think, and there are *many* people to help you here, including some who are also doing gene research! The beauty of modules like XML::Twig is that you don't actually need to know how to parse HTML since it does the parsing for you. You just need to learn how to start the process and use the results.

        So I'd start instead by looking up XML::Twig, reading the documentation, and asking about any questions you have here or on a new thread. If you decide to stay with this thread, you might want to update your original post to indicate the change of strategy. Also it would be a good idea to change the title to something like "Using XML::Twig to parse gene data". Such a title would do a better job of attracting the right people to help you.

        If you decide to start a new node, be sure to update Help regarding regular expression with an explanation of your change in strategy and a link to a new node. Also in the new node, link back to this node so that people understand the whole context of the discussion (you'll get much better advice that way). To link to nodes within PerlMonks, you can use [id://NNNN] where NNNN is the node id of the post. (that's the number in the left column on Nodes You Wrote) The title of the node will be displayed automatically.

        I don't recommend asking your questions in reply to this node. People will be less likely to see a deeply nested node, so you won't get the widest help.

        If you have general questions about how to use CPAN or modules, you can also ask in the chatterbox (sidebar to the right). You can also get boatloads of information about XML::Twig (and any other module) using this link: cpan module search. It lets you find all of the PerlMonks nodes (questions, answers, tutorials) that discuss the module you are interested in learning how to use.

        Best, beth

        I agree with what ELISHEVA says, and add that unless your XML/HTML is extremely trivial it will be much less effort to learn the appropriate modules than it will be to write robust regular expressions. It is easy to start writing the regular expressions but parsing XML and HTML is much more complex than it first appears and regular expressions are not up to the task.

Re: Help regarding regular expression
by merlponk (Scribe) on Aug 06, 2009 at 09:50 UTC
    push(@a, $1) if ($line =~ /(GO:\d+)/);
      Still not working :( I really don't know what I am doing wrong here:
      use LWP::Simple; use warnings; #use strict; my $query = shift(@ARGV); #print "$query\n"; die "waste fellow dnt know hw to write a program" unless (open(IN,">/h +ome/vivek/Desktop/test12.txt")); my $content = get("http://amigo.geneontology.org/cgi-bin/amigo/search. +cgi?query=$query;search_constraint=gp;action=query;view=query/"); die "Couldn’t get the website!" unless defined $content; print IN "$content"; print "At the whileloop\n"; while(IN) { push(my @a, $1) if (my $line =~ /(GO:\d+)/); rint "Reached"; print "@a\n"; close(IN); }
      Printing that array gives me nothing.