Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone, I am rather new to perl and was hoping you could help me. I need to parse a file that's in an xml type format (not full xml syntax) and pull out data to run different types of jobs. The file looks like this:

<PROJECT_ID>12345</PROJECT_ID> <JOBID>101</JOBID> <TYPE1>add</TYPE1> <FILE1>/tmp/file_data_gros</FILE1> <JOBID>102</JOBID> <TYPE2>delete</TYPE2> <FILE2>/tmp/file_myvalues</FILE2>

The above file will always have a PROJECT_ID number and can have multiple JOBID fields defined in it. Each JODID will always have a corresponding TYPE* and FILE* associated with it. What I am trying to write is something that can loop through multiple files like this. So when each file is processed the code would pull out the PROJECT_ID field value and then for each JOBID (in order) it would pull out the corresponding values for TYPE* and FILE*. So in the above example the code would pull out the PROJECT_ID value and then get the value of the first JOBID seen in the file and once it has this it would then get the value for TYPE1 and FILE1 and then output a string in the following format with the values for each field:

PROJECT_ID(value) JOBID(value) TYPE1(value) FILE1(value)

.....I would then do some processing with these values. The loop would then carry onto the next JOBID seen and then output the values for these fields:

PROJECT_ID(value) JOBID(value) TYPE2(value) FILE2(value)

.....I would then do some processing with these values The loop would then carry on to the next JOBID seen and do the same until there are no more to process within this file

I am really struggling with how to go about doing this. I was thinking maybe I should read the whole file into a hash but I not too sure that this is the right approach. I have written this so far.

open FILE, "$file_to_process" or die; my %hash; while (my $line=<FILE1>) { chomp; (my $xmltag, $xmlvalue) = split /\<|\>/, $line; $hash{$xmltag} = $xmlvalue; }

If anyone can help me with some code that would be able to do what I need I would greatly appreciate it. My attempt is not working at all :-(

Replies are listed 'Best First'.
Re: Parsing an file that has xml like syntax
by choroba (Cardinal) on Apr 02, 2014 at 18:58 UTC
    I usually process similar files line by line, keeping the status (project, job) in a variable that survives the loop. It is not clear from your description whether a job can have more than one type and file, so I assumed it can't.
    #!/usr/bin/perl use warnings; use strict; sub output { my ($id, @jobs) = @_; print join("\t", $id, @{$_}{qw(JOBID TYPE FILE)}), "\n" for @jobs; } my @jobs; my $id; while (<>) { my ($tag, $value) = m{<(.*)>(.*)</\1>} or next; if ('PROJECT_ID' eq $tag) { output($id, @jobs); $id = $value; @jobs = (); } else { $tag =~ s/[0-9]+$//; $#jobs++ if 'JOBID' eq $tag; $jobs[-1]{$tag} = $value; } } output($id, @jobs); # Don't forget to output the last job.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Wow Choroba that's great. You wrote that very quickly. Thanks ever so much. This looks like this will do exactly what I am looking for. Yes you are right there can only be one TYPE* and FILE* associated with each JOBID. Once I get these values for each JOBID I ultimately need to supply these values as arguments to a program and execute it. So when reading the first JOBID. I need the value for TYPE1 and FILE1 and with these I then execute another program with those values. Then once that has been executed I then read the next JOBID in the file and get the values for TYPE2 and FILE2 and then with those values execute another program again. And keep doing this until there are no more JOBID's defined in this file.

      I think this is the way to do it.
      use strict; use Data::Dumper; my $config; while (<DATA>) { chomp; my ($key, $variable)= ( $_ =~ /<(\w+?)>([^<]*)/); if ($key =~ m/JOBID/){ push (@{$config->{JOBID}},$variable); }else{ $config->{$key} = $variable; } } close FILE; print Dumper $config; __DATA__ <PROJECT_ID>12345</PROJECT_ID> <JOBID>101</JOBID> <JOBID>102</JOBID> <JOBID>103</JOBID> <TYPE1>add</TYPE1> <FILE1>/tmp/file_data_gros</FILE1> <JOBID>102</JOBID> <TYPE2>delete</TYPE2> <FILE2>/tmp/file_myvalues</FILE2>
      Output
      $VAR1 = { 'JOBID' => [ '101', '102', '103', '102' ], 'FILE2' => '/tmp/file_myvalues', 'TYPE2' => 'delete', 'FILE1' => '/tmp/file_data_gros', 'TYPE1' => 'add', 'PROJECT_ID' => '12345' };

        Hi crusty_collins I think I understand what you are doing here. Are you creating an array of jobid values and then associating each jobid value with the corresponding value for TYPE and FILE? So for instance if this was seen in the file:

        <PROJECT_ID>12345</PROJECT_ID> <JOBID>101</JOBID> <TYPE1>add</JOBID> <FILE1>/tmp/file_data_gros</FILE1> <JOBID>104</JOBID> <TYPE2>delete</TYPE2> <FILE2>/tmp/file_myvalues</FILE2>

        I would need to parse this file like so: found value for first JOBID and also found values for the associated FILE1 and TYPE1 and will now execute my program with these values:

        `myprogram -action add -file /tmp/file_data_gros`

        Then loop would then carry on to see if there was another JOBID and if there is, which there is in this example, then my program would execute:

        `myprogram -action delete -file /tmp/file_myvalues

        then loop would carry on to see if there was another JOBID and if there was if would find the value to the next TYPE* and FILE* fields defined and then execute my program again

        I don't fully understand your code but how would I loop over this file to do the above and execute my program?

Re: Parsing an file that has xml like syntax
by GotToBTru (Prior) on Apr 02, 2014 at 18:53 UTC

    Your split is going to return more than you expect. There will be 4 values returned by the split of your first line:

    0 '' 1 project_id 2 12345 3 /project_id

    Try:

    ($xmltag, $xmlvalue) = (split /\<|\>/, $string)[1,2];

    The parentheses tell perl to treat the returned values from split as an array, and the subscripts pick out the elements you are specifically looking for.

    I do the same thing in several of my programs using the following:

    ($tag,$value) = ($inline =~ /<(\/?\w+?)>([^<]*)/);

    I'm using a regex to handle 3 cases I see in my input:

    <tag>value</tag> or <tag> or </tag>

    I can supply the details on how this works, but you might find it informative to try to figure out how it works yourself.

      That's brilliant, that works really well. Thanks a lot. I have been trying for ages to get the tag and value out correctly. Would you be able to help me with a loop that for each JOBID it would get the value of the corresponding TYPE* value and FILE* value and print this info on one line of output. So the first line of output would be:

      12345 101 add /tmp/file_data_gros

      Then the loop moves onto find the next JOBID and the values for the associated TYPE* and FILE. The next line printed would be:

      12345 102 delete /tmp/file_myvalues

      And will continue to do this until all JOBID's and associated TYPE*s and FILE*s have been processed in the file