Helan_Ahmed has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I am new user in perl, I want to extract two values from XML file and save them in new file. The values are ssId and subSnpClass. I create the following code, but it just print ssId twice and does not print subSnpClass. Any idea about how to fix it. Thanks

my $Gene='ds_chY.xml'; my $filename = 'ID.txt'; my $ID; my $subSnpClass; open(my $fh, '>', $filename) or die "Could not open file '$filename' $ +!"; print $fh "RS_SNPs\tsubSnpClass"; open(GTFFILE, $Gene) or die ("Cannot open the file"); while(<GTFFILE>){ $_=~ s/^\s+//; if ($_= ~/ssId=\s*?(\S+)/) { $ID=$1; $ID =~ s/"//; chop $ID; print $fh "$ID\n"; } if ($_= ~/subSnpClass=\s*?(\S+)/) { $subSnpClass=$1; $subSnpClass =~ s/"//; chop $subSnpClass; print $fh "\t$subSnpClass\n"; } }

Here is sample of my input file

===========================================================

<?xml version="1.0" encoding="UTF-8"?> <ExchangeSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml +ns="http://www.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="http: +//www.ncbi.nlm.nih.gov/SNP/docsum ftp://ftp.ncbi.nlm.nih.go v/snp/specs/docsum_3.4.xsd" specVersion="3.4" dbSnpBuild="144" generat +ed="2015-05-26 09:54"> <SourceDatabase taxId="9606" organism="human" gpipeOrgAbbr="hs"/> <Rs rsId="3894" snpClass="snp" snpType="notwithdrawn" molType="gen +omic" genotype="true" bitField="050028000005130500030100" taxId="9606 +"> <Het type="est" value="0.05" stdError="0.1547"/> <Validation byCluster="true" byOtherPop="true" byHapMap="true" + by1000G="true"> <otherPopBatchId>7179</otherPopBatchId> </Validation> <Create build="36" date="2000-09-19 17:02"/> <Update build="144" date="2015-05-07 10:52"/> <Sequence exemplarSs="491581208" ancestralAllele="C,C"> <Seq5>ATAAGCAAATAACTGAAGTTTAATCAGTCTCCTCCCAGCAAGTGATATGCAA +CTGAGATTCCTTATGACACATCTGAACACTAGTGGATTTGCTTTGTAGTAGGAACAAGGTACATTCGCG +GGATAAATGTGGCCAAGTTTTATCTGCTGCCAGGGCTTTCAAATAGGTTGACCTGACAA TGGGTCACCTCTGGGACTGA</Seq5> <Observed>C/T</Observed> <Seq3>AATTAGGAAGAGCTGGTACCTAAAATGAAAGATGCCCTTAAATTTCAGATTC +ACAATTTTTTTTTCTTAGTATAAGCATGTCCCATGTAATATCTGGGATATACTCATACCTTTAAAAATG +TGCTCATTGTTTATCTGAAATTCACATTTTAACAGGGAACCATTGTTTTGTTATTGTTT ATTGTTTTGTTTCTAAATAA</Seq3> </Sequence> <Ss ssId="3931" handle="OEFNER" batchId="489" locSnpId="M3" su +bSnpClass="snp" orient="forward" strand="bottom" molType="genomic" bu +ildId="36" methodClass="DHPLC" validated="by-cluster"> <Sequence> <Seq5>TAATCAGTCTCCTCCCAGCAAGTGATATGCAACTGAGATTCCTTATGA +CACATCTGAACACTAGTGGATTTGCTTTGTAGTAGGAACAAGGTACATTCGCGGGATAAATGTGGCCAA +GTTTTATCTGCTGCCAGGGCTTTCAAATAGGTTGACCTGACAATGGGTCACCTCTGGGA CTGA</Seq5> <Observed>C/T</Observed> <Seq3>AATTAGGAAGAGCTGGTACCTAAAATGAAAGATGCCCTTAAATTTCAG +ATTCACAATTTT</Seq3> </Sequence> </Ss> <Ss ssId="76536062" handle="AFFY" batchId="52074" locSnpId="AF +FY_6_1M_SNP_A-8397107" subSnpClass="snp" orient="forward" strand="bot +tom" molType="genomic" buildId="130" methodClass="hybridize " validated="by-submitter"> <Sequence> <Seq5>TCACCTCTGGGACTGA</Seq5> <Observed>C/T</Observed> <Seq3>AATTAGGAAGAGCTGG</Seq3> </Sequence> </Ss>

Replies are listed 'Best First'.
Re: Print multiple value from file
by kcott (Archbishop) on Aug 10, 2015 at 11:49 UTC

    G'day Helan_Ahmed,

    Welcome to the Monastery.

    The match operator (see Binding Operators) contains no internal whitespace, i.e. '=~' is correct; '= ~' is incorrect.

    Here's how Perl parses that part of the code you posted:

    $ perl -MO=Deparse,-p -e '$_= ~/ssId=\s*?(\S+)/' ($_ = (~/ssId=\s*?(\S+)/)); -e syntax OK

    Removing the erroneous whitespace, and Perl now knows what you mean:

    $ perl -MO=Deparse,-p -e '$_=~/ssId=\s*?(\S+)/' ($_ =~ /ssId=\s*?(\S+)/); -e syntax OK

    Adding appropriate whitespace (around the operator) and both Perl and human readers know what you mean:

    $ perl -MO=Deparse,-p -e '$_ =~ /ssId=\s*?(\S+)/' ($_ =~ /ssId=\s*?(\S+)/); -e syntax OK

    [Aside: Please indicate updates at the place where the update was made. I see you mentioned the update elsewhere in the thread but, until I'd read that far, earlier responses to you original post made no sense. See "How do I change/delete my post?" for details and further explanation.]

    Update (additional information): As you're new to Perl, the 'perl -MO=Deparse ...' line possibly means little to you. It's explained in B::Deparse.

    — Ken

Re: Print multiple value from file
by Laurent_R (Canon) on Aug 10, 2015 at 09:54 UTC
    You are using twice the same regex ($_= ~/ssId=\s*?(\S+)/), so you get twice the same result (ssId). Change your second regex to something matching your subSnpClass.
Re: Print multiple value from file
by poj (Abbot) on Aug 10, 2015 at 09:54 UTC

    Second if block

    ##if ($_=~ /ssId=\s*?(\S+)/) if ($_=~ /subSnpClass=\s*?(\S+)/)
    poj

      Oh sorry I made syntax error, I updated the script, but even when I search for subSnpClass, it print ssId twice

        After corrections to these 2 lines
        # $_= ~ /ssId=\s*?(\S+)/ $_ =~ /ssId=\s*?(\S+)/ #if ($_= ~ /subSnpClass=\s*?(\S+)/) if ($_ =~ /subSnpClass=\s*?(\S+)/)

        I got this result

        RS_SNPs subSnpClass3931 snp 76536062 snp
        maybe try this
        #!perl use strict; use warnings; my $Gene = 'ds_chY.xml'; my $outfile = 'ID.txt'; open my $out,'>',$outfile or die "Could not open file '$outfile' $!"; print $out "RS_SNPs\tsubSnpClass\n"; open my $in,'<',$Gene or die "Could not open file '$Gene' $!"; while (<$in>){ if ( /ssId=\s*"([^"]+)/ ){ print $out "$1\t"; } if ( /subSnpClass=\s*"([^"]+)/ ){ print $out "$1\n"; } }
        poj
Re: Print multiple value from file
by RichardK (Parson) on Aug 10, 2015 at 09:54 UTC

    That's because you search for 'ssId' twice! You'll have to change that if you want it to do something else.

    if ($_= ~/ssId=\s*?(\S+)/) { ... } if ($_= ~/ssId=\s*?(\S+)/) { ... }
Re: Print multiple value from file
by Monk::Thomas (Friar) on Aug 10, 2015 at 13:58 UTC

    Hello Helan

    Too be frank: Your posted coded looks like you cobbled it together without really understanding what you're doing. I took the liberty to clean it up somewhat. Request: Just have a look at it and see what could be done differently. The provided code is making some fundamental assumptions about your XML-format and works with the provided input - but I really, really recommend using a true XML-Parser instead.

    #!/usr/bin/perl use strict; use warnings; # ^-- Just do it. my $Gene = 'ds_chY.xml'; my $filename = 'ID.txt'; open my $fh_i, '<', $Gene or die "Cannot open the file '$Gene': $!"; open my $fh_o, '>', $filename or die "Could not open file '$filename' $!"; # ^-- use the same style of open() and error message # for both file operations # (I like to put chained and/or conditionals on the next line.) print {$fh_o} "RS_SNPs\tsubSnpClass\n"; # ^- use braced filehandles in order to make it explicit we're # writing to a file handle # either use explicit variable for looping: while (my $line = <$fh_i>) { $line =~ s/^\s+//; # this is probably irrelevant # don't write to file immediately - wait a bit my ($ssId, $subSnpClass); if ( $line =~ /ssId=["](\S*)["]\s/ ) { # regexp optimization $ssId = $1; # -> no need to s/// and chop } if ( $line =~ /subSnpClass=["](\S*)["]/ ) { $subSnpClass = $1; } # ...in order to perform a small sanity check if (defined $ssId and defined $subSnpClass) { print {$fh_o} "$ssId\t$subSnpClass\n"; } elsif (not defined $ssId and not defined $subSnpClass) { # nothing } else { print "Data inconsistency detected in line $line\n"; } } # ... or use implicit variable $_: while (<$fh_i>) { s/^\s+//; my ($ssId, $subSnpClass); if ( /ssId=["](\S*)["]\s/ ) { $ssId = $1; } if ( /subSnpClass=["](\S*)["]/ ) { $subSnpClass = $1; } if (defined $ssId and defined $subSnpClass) { print {$fh_o} "$ssId\t$subSnpClass\n"; } elsif (not defined $ssId and not defined $subSnpClass) { # nothing } else { print "Data inconsistency detected in line ", $_, "\n"; } }

    (The second while loop is useless, because there's no more stuff to be read. It's just an example how the loop could be written in a different manner.)

    Please do NOT use:

    while (<$fh_i>) { if ($_ =~ ...) ...

    ...because it's a bit like wearing your pants backwards. Sure you can do it and it keeps your legs warm, but everyone will laugh at you behind your back.

      Thanks a lot, it really helped me. could you help me to print also the content between <Observed>C/T</Observed> which is C/T for all of them

        Sorry, but no. I won't do that. As I already wrote: This is just meant as an example to improve your understanding of Perl.

        If you want more functionality then please implement it on your own. This way you also have a better understanding what goes wrong when it goes wrong. (And since the input format is XML I'm very confident something will go wrong.) You really should not use regular expressions to parse XML - use a real parser instead.

Re: Print multiple value from file
by Helan_Ahmed (Initiate) on Aug 10, 2015 at 10:06 UTC

    Oh sorry I made syntax error, I updated the script, but even when I search for subSnpClass, it print ssId twice

      Please show us a sample of your input file.

        I added a sample of input file to my post. Thanks