Parse XML with Perl regex

ad23 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to parse an XML file, with Perl regex (I know about the XML::Parse and ::Twig modules, but using regex is a requirement).

My XML document looks like this:

<?xml version="1.0"?>
<t_volume>
        <info>
            <info_name>FZGA34177.b1</info_name>
            <center_project>4085729</center_project>
            <base_file>SETARIA_ITALICA/JGI/fasta/FZGA34177.b1.fasta</b
+ase_file>
            <it_size>35000</it_size>
            <it_stdev>3500</it_stdev>
            <plate_id>357</plate_id>
            <program_id>KB 1.3.0</program_id>
            <seq_lib_id>FZGA</seq_lib_id>
            <project_id>32913</project_id>
            <info_archive>
                <ti>2167749207</ti>
                <taxid>4555</taxid>
        <basecall_length>899</basecall_length>
                <state>active</state>
            </info_archive>
        </info>
<info>
            <info_name>FZGA34177.b1</info_name>
            <center_project>4085729</center_project>
            <base_file>SETARIA_ITALICA/JGI/fasta/FZGA34177.b1.fasta</b
+ase_file>
            <it_size>35000</it_size>
            <it_stdev>3500</it_stdev>
            <plate_id>357</plate_id>
            <program_id>KB 1.3.0</program_id>
            <seq_lib_id>FZGA</seq_lib_id>
            <project_id>32913</project_id>
            <info_archive>
                <ti>2167749207</ti>
                <taxid>4555</taxid>
        <basecall_length>899</basecall_length>
                <state>active</state>
            </info_archive>
        </info>
<info>
            <info_name>FZGA34177.b1</info_name>
            <center_project>4085729</center_project>
            <base_file>SETARIA_ITALICA/JGI/fasta/FZGA34177.b1.fasta</b
+ase_file>
            <it_size>35000</it_size>
            <it_stdev>3500</it_stdev>
            <plate_id>357</plate_id>
            <program_id>KB 1.3.0</program_id>
            <seq_lib_id>FZGA</seq_lib_id>
            <project_id>32913</project_id>
            <info_archive>
                <ti>2167749207</ti>
                <taxid>4555</taxid>
        <basecall_length>899</basecall_length>
                <state>active</state>
            </info_archive>
        </info>
<t_volume>
[download]

I have written the following code so far:

#!/usr/bin/perl
my @files = glob('/abc*/info.xml')

foreach my $xmlname(@xml)
{

    open XML, $xmlname or die "Cannot open $xmlname for reading: $!\n"
+;
    
    while($line=<XML>){
    
    if($line=~ /\<info_name\>/i){
        $info_name = $line =~ /\<info_name\>(\S+)\<\/info_name\>/i;
    }
    if($line=~ /\<it_size\>/i){
        $it_size = $line =~ /\<it_size\>(\S+)\<\/it_size\>/i;
    }
    
    }
    print "$info_name : $it_size\n";
}
[download]

I want to get these values as a hash, with the data in <info_name> as key and that in <it_size> as value??

How to go about creating a hash for this??

Thanks in advance!

Comment on Parse XML with Perl regex Select or Download Code

Replies are listed 'Best First'.
Re: Parse XML with Perl regex by graff (Chancellor) on Jul 07, 2010 at 22:13 UTC
The code you've written so far has a few problems, and it's taking you in the wrong direction -- away from using XML parsing methods and towards the much more troublesome and unreliable approach of using regex matches on XML data. (Sure, regex matches on XML data seem "easier" at first, but in the long run, they aren't. Note that two XML files can differ drastically in line count and white space content, despite containing the exact same set of information. XML parsing handles this variation automatically, while line-oriented regexes tend to choke on it. Also, there are cases where the ordering of XML elements may vary, yet the data content would still be considered as "the same".) Of course, if you are really, completely sure and confident that white-space / line-break patterns in your XML data will never change from the sample data you've shown, then a regex solution would probably suffice. Since you don't have `use strict;` you may have missed the fact that you are loading `@files` with file names, but then using `@xml` to run your "foreach" loop. You're also missing a semi-colon where you need one. After you fix that, you'll want to declare the hash that will hold your data (do this before the foreach loop over the files), and then in the block that matches the "it_size" element, you assign the hash element. Here's how it probably should look (not tested): `#!/usr/bin/perl use strict; use Data::Dumper 'Dumper'; my @files = glob('/abc*/info.xml'); my %hash; my $info_name; foreach my $xmlname(@files) { open XML, $xmlname or die "Cannot open $xmlname for reading: $!\n" +; while(<XML>){ if ( /\<info_name\>/i ) { ($info_name) = (/<info_name>([^<])/i); } if ( /\<it_size\>/i ) { my ($it_size) = (/<it_size>([^<])/i); $hash{$info_name} = $it_size; } } } print Dumper( \%hash );` [download] (updated code to remove residual use of "$line", and to add parens for regex assignments to work right) Note that I've simplified the regexes a bit. One last thing that would be worth checking: might there be any duplicate "info_name" values scattered within or across your set of XML files? If so, does it matter that some "it_size" values may be lost (over-written) in the process?	[reply] [d/l] [select]
Re: Parse XML with Perl regex by graff (Chancellor) on Jul 07, 2010 at 22:24 UTC
I somehow missed this point when I made my first reply -- you said: (I know about the XML::Parse and ::Twig modules, but using regex is a requirement) Who/what makes this a requirement? Why are you required to do inferior work and to stifle your own experience in solving problems using best practices? If it's because this is a homework assignment from teacher using bad methods to make you learn regexes, shame on you for not saying so. (I might still have posted the code I did, even knowing it was homework, since you did show some effort on your own. Honesty always pays in any case.) If it's because the machine you're using doesn't have the required modules, there are ways around that -- like, installing the modules you need, inside your own home directory if necessary. If it's because of some PHB forbids using the correct tools for the job, I offer my sympathy, and suggest you keep an eye open for other places to work.	[reply]
Re^2: Parse XML with Perl regex by ad23 (Acolyte) on Jul 08, 2010 at 13:53 UTC
Thanks graff. Appreciate your help with the code. I am new to Perl and programming as a whole. Parsing XML with the modules available, is the obvious answer to my question. And I also did the same (just using XML::Simple) before using a regex approach. Regex may be the most outdated way to do this task, but for some beginners - Perl regex can be a nightmare (at least for me). So practicing more problems using regex, might help me to use these modules even better. About the "requirement" --- ITS NOT. Whenever I searched for something to parse XML in Perl, I always ended up with these modules and tutorials for the same. It might sound cliche, but that's what it is. There is no "homework assignment" and no one is "forcing" me to something. `#!/usr/bin/perl use XML::Simple; use Data::Dumper; $xml = new XML::Simple(KeyAttr=>[]); $data = $xml -> XMLin("INFO.xml"); #print Dumper($data); #print "XML read in\n"; foreach $e(@{$data->{t_volume}}) { print $e->{info_name},"\n"; print "It Size: ", $e->{it_size}, "\n"; print "\n"; }` [download] Thanks again for your help!!	[reply] [d/l]
Re: Parse XML with Perl regex by ikegami (Patriarch) on Jul 07, 2010 at 22:15 UTC
XML::Parser::Lite (from SOAP-Lite) is a regex-based XML parser.	[reply]
Re: Parse XML with Perl regex by rowdog (Curate) on Jul 07, 2010 at 23:44 UTC
You're pretty close. In a scalar context a regex returns the number of matches. In a list context, it returns the list of matches. `my ($info_name) = $line =~ /\<info_name\>(\S+)\<\/info_name\>/i;` And now for some notes... DON'T DO THAT! Using regexs on XML is fragile. Use something like XML::LibXML I see fasta in there so you may like Perl and Bioinformatics `use strict;` `use warnings;` XML element names should always be lower case, so you don't need to ignore the case in your regex. Your example XML has 3 copies of the same structure so you will end up with one unique key in your hash. Unless this is the beginning of nested t_volumes, you missed the / in `</t_volume>` For my example, I decided to rely on the fact that the interesting tags do not contain other tags. If that changes, my code breaks. I also rely on the order of the tags as shown in the example XML, which is generally a dumb assumption since things like XML::LibXML can reorder the elements. `#!/usr/bin/perl use strict; use warnings; my @files = glob('./.xml'); my %results; foreach my $xmlname (@files) { open my $fh, '<', $xmlname or die "$xmlname: $!"; while ( my $line = <$fh> ) { my ($name) = $line =~ /\<info_name\>([^<]+)\<\/info_name\>/ or next; while ( my $l = <$fh> ) { $l =~ /\<it_size\>([^<]+)\<\/it_size\>/ or next; $results{$name} = $1; last; } } } print map { "$_ => $results{$_}\n" } keys %results;` [download] `jth@reina:~/tmp$ perl 848551.pl FZGA34177.b1 => 35000` [download] And finally, my XML::LibXML alternative which does not* rely on tag ordering or the content of the tag. `#!/usr/bin/perl use strict; use warnings; use XML::LibXML; my @files = glob('./*.xml'); my %results; foreach my $xmlname (@files) { my $dom = XML::LibXML->load_xml( location => $xmlname, recover => 1, # no </t_volume> in example ) or die $!; foreach my $node ( $dom->findnodes('//info') ) { $results{ $node->find('info_name') } = $node->find('it_size'); } } print map { "$_ => $results{$_}\n" } keys %results;` [download] `jth@reina:~/tmp$ perl 848551.pl ./848551.xml:55: parser error : Premature end of data in tag t_volume +line 54 ^ ./848551.xml:55: parser error : Premature end of data in tag t_volume +line 2 ^ FZGA34177.b1 => 35000` [download]	[reply] [d/l] [select]
Re: Parse XML with Perl regex by AndyZaft (Hermit) on Jul 07, 2010 at 21:51 UTC
`my %hash; $hash { $info_name } = $it_size;` [download] Would create the hash and fill it as you go, replacing the print line with would create a hash. That answers your question, but I have a feeling you wanted to ask something else.	[reply] [d/l]
Re^2: Parse XML with Perl regex by ad23 (Acolyte) on Jul 07, 2010 at 22:01 UTC
Thanks for your reply! I replaced the "print" statement in the above code with your code and am trying to print the hash, like: `.... } $xmlhash{$info_name} = $it_size; #print "$info_name : $it_size\n"; close(XML); } foreach $k (sort keys %xmlhash){ print "$key: $xmlhash{$key}\n"; }` [download] I am very sure I am going wrong here, as obviously I cannot print the hash???Also, since I am reading from multiple files I want to print the values in all the files read in??	[reply] [d/l]
Re^3: Parse XML with Perl regex by almut (Canon) on Jul 07, 2010 at 23:11 UTC
Use `my $key` instead of `$k`. And also `use strict; use warnings;` ...	[reply] [d/l] [select]
Re^4: Parse XML with Perl regex by ad23 (Acolyte) on Jul 08, 2010 at 13:58 UTC
Re: Parse XML with Perl regex by Anonymous Monk on Jul 07, 2010 at 21:56 UTC
I know about the XML::Parse and ::Twig modules, but using regex is a requirement Isn't it also a requirement that you do your own homework?	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.