legendx has asked for the wisdom of the Perl Monks concerning the following question:

Hello all, I would appreciate advice on how to go about writing a Perl script solution to the following problem:

The Layout:
- There are several flat files, used as config files, that contain blocks of text similar to XML format.
- Each block starts with a "[" and ends with a "]" and in between contains configuration arguments such as a unique "ID" number, a "Start" and "Done" fields.
- The "Start" field has values that are files that must be present
- The "End" field has values of files that are created once the "Start" files are present

Example of flat file:
[ ID: 123 Start: /tmp/file.1 /tmp/file.2 /tmp/file.3 Done: /complete/success.1 /complete/success.2 ] [ ID: 456 Start: /complete/success.1 /complete/success.2 /tmp/file.3 Done: /complete/success.3 /complete/success.4 ] ... etc

What I am trying to do
I would like to parse through the file using Perl and find the file dependencies, recursively, that each "ID" has.
For example, ID 456 depends on the two files /complete/success.1 and /complete/success.2 that are from ID 123

What next?
I think that I would have to
- parse the file out and match between each block, that is, between each "[" and "]"
- save each block to a hash with the ID and then the Start/Done fields as keys and the values as the filenames
e.g.

$hash{123}{Start} => /tmp/file.1 => /tmp/file.2 => /tmp/file.3 $hash{123}{End} => /complete/success.1 => /complete/success.2
Then do I compare all the hash values to see which one matches to find what depends on what? That seems a bit complicated but I'm not sure if there is an easier way.
Could someone point me in the right direction with the logic or provide other solutions?

Replies are listed 'Best First'.
Re: Parsing a file and finding the dependencies in it
by Marshall (Canon) on Jul 05, 2011 at 21:06 UTC
    To get you started, see the parser below. I generate two hash tables which are linked by the id number. One hash table translates any output file to the id number which produced it. The other hash translates id number into the input files which created it.

    I haven't written any code to make a report, but I think that this is enough to travel backwards from an output fileA -> id, then id->inputfilesX, therefore output fileA depends upon inputfiles X. Those input files can be looked up to see where they came from, etc.

    hope this provides fuel for thought. It could be that a different data structure is better than this, but at least it shows one way to get the parsing done.

    Update: Added some printing code to make a basic report to show all files used to generate a particular output file.

    #!/usr/bin/perl -w use strict; use Data::Dumper; my %id; my %done; my %record=(); while (<DATA>) { if (my $num = /\[/.../\]/) { my ($tag, @values) = split; @{$record{$tag}} = @values; if ($num =~ /E0/) { my ($id) = @{$record{'ID:'}}; @{$id{$id}} = @{$record{'Start:'}}; foreach (@{$record{'Done:'}}) { $done{$_}= $id; } %record=(); } } } print Dumper \%done; print Dumper \%id; foreach my $file (keys %done) { print "$file\n"; my %seen; print map{" $_\n"}grep{!$seen{$_}++}priorFiles($file); print "\n"; } sub priorFiles { my ($file) = @_; return() if !exists $done{$file}; my @prior = @{$id{$done{$file}}}; foreach (@prior) { push @prior, priorFiles($_); } return @prior; } =output %done shows the id number which produced each file $VAR1 = { '/complete/success.3' => '456', '/complete/success.2' => '123', '/complete/success.1' => '123', '/complete/success.4' => '456' }; %id shows the input files were used by the id $VAR1 = { '456' => [ '/complete/success.1', '/complete/success.2', '/tmp/file.3' ], '123' => [ '/tmp/file.1', '/tmp/file.2', '/tmp/file.3' ] }; #This is a basic listing..all files that affected the first file /complete/success.3 /complete/success.1 /complete/success.2 /tmp/file.3 /tmp/file.1 /tmp/file.2 /complete/success.2 /tmp/file.1 /tmp/file.2 /tmp/file.3 /complete/success.1 /tmp/file.1 /tmp/file.2 /tmp/file.3 /complete/success.4 /complete/success.1 /complete/success.2 /tmp/file.3 /tmp/file.1 /tmp/file.2 =cut __DATA__ [ ID: 123 Start: /tmp/file.1 /tmp/file.2 /tmp/file.3 Done: /complete/success.1 /complete/success.2 ] [ ID: 456 Start: /complete/success.1 /complete/success.2 /tmp/file.3 Done: /complete/success.3 /complete/success.4 ]

      Hm I see.I had a similar idea to what you did there

      I will try this and see where it leads. Thanks!

      Thanks for adding the update and your help.

      This works fine on my machine, but I don't get $num and its regexp at first.

      if (my $num = /\[/.../\]/)

      must be

      if (my $num = $_ =~ /\[/.../\]/)

      I don't understand what the regexp '/\/.../\/' means. I asked Yape::Regex

      my $regexp='\[/.../\]'; use YAPE::Regex::Explain; my $exp = YAPE::Regex::Explain->new($regexp)->explain; print $exp;

      and it says

      The regular expression: (?-imsx:\[/.../\]) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- \[ '[' ---------------------------------------------------------------------- / '/' ---------------------------------------------------------------------- . any character except \n ---------------------------------------------------------------------- . any character except \n ---------------------------------------------------------------------- . any character except \n ---------------------------------------------------------------------- / '/' ---------------------------------------------------------------------- \] ']' ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

      Where does it match against __DATA__ lines? I have no idea. And Why /E0/ ? Someone please shed some light for me.

        It's not a single regular expression but two separate regular expressions, connected by the ... operator (see perlop).

        Also, every regular expression matches against $_ by default, so $_ =~ /foo/ is identical to /foo/.

        Hope you understood Corion's reply. "while (<DATA>)" causes $_ be set to the next line in __DATA__ at every iteration.

        This construct: my $num = /regex/.../regex/ uses what is also known as the flip-flop operator. A classic post on this by Grandfather is: Flipin good, or a total flop?.

        A single regex that matches will have a true value, I think a numeric 1 is returned. In the case where 2 regex's are joined by the ... operator, a line number is returned representing which line of the record we are on.

        I would suggest that you put a print "num=$num\n"; statement in the loop and watch what happens. You will see values like: 1,2,3,4E0.

        The 4E0 means that something is different about this line number! And indeed there is. It is the line that contains the ']' character (the last line of the record - the line that matches the 2nd regex). The E0 is just exponential notation meaning 10**0. Any number raised to the zero'th power is 1. So 4E0 = 4 * 10**0 = 4 * 1 = 4 from a numeric perspective. So this is a clever way to return 2 pieces of information with a single number. A number in exponential format means the record is over and if I wanted to do some math on this number, it is a perfectly legitimate representation of the number 4.

        Update:
        I could have written the code with a more conventional parsing scheme. When the first line of a record is detected, call a subroutine which processes lines until the last line of the record is detected. This eliminates the need to have some flag values like "I'm inside the record now..", etc. The flip-flop implementation essentially does what the below would have done:

        #!/usr/bin/perl -w use strict; while (<DATA>) { process_record() if /^\[/; #start of record } sub process_record { my %record; my $line; my $line_num=1; while (defined ($line = <DATA>) and $line !~ /\]/) { print "line= ",$line_num++," ",$line; # do splits and fill in %record here } print "Record Complete!\n\n"; # use %record here to populate other hashes # %record is thrown away when sub returns. } =prints line= 1 ID: 123 line= 2 Start: /tmp/file.1 /tmp/file.2 /tmp/file.3 line= 3 Done: /complete/success.1 /complete/success.2 Record Complete! line= 1 ID: 456 line= 2 Start: /complete/success.1 /complete/success.2 /tmp/fil +e.3 line= 3 Done: /complete/success.3 /complete/success.4 Record Complete! =cut __DATA__ [ ID: 123 Start: /tmp/file.1 /tmp/file.2 /tmp/file.3 Done: /complete/success.1 /complete/success.2 ] [ ID: 456 Start: /complete/success.1 /complete/success.2 /tmp/file.3 Done: /complete/success.3 /complete/success.4 ]
      This works but I have another question: If there is another field, say "Desc" with a value that is space delimited, how can I put the entire value into a variable.
      Example:
      [ ID: 456 Desc: This is a test job Start: /complete/success.1 /complete/success.2 /tmp/file.3 Done: /complete/success.3 /complete/success.4 ]
      The code currently splits all by space so only the first word is in the value.
      E.g.
      my ($desc) = @{$record{'Desc:'}};
      $desc will only contain the word "This " and not the full value of "This is a test job".
      What is the best way to solve this?
        There is nothing wrong with making Desc: a special case for the splitting. I show some code below...

        In this special situation, you can just test for /^Desc:/. The technique is to limit the number of things returned from the split, in this case 2 things. Doing that requires that we take care of one more detail, a chomp() is needed.

        When we let split() do its default thing, a chomp() is not needed because the trailing \n will be removed (default split is on any sequence of the 5 whitespace characters (space,\n,\f,\r,\t). If we tell split() to stop working after it has 2 things, then we have to do manually what it would have done to the last thing.

        I set up %record so that it is a Hash of Array, each value is a reference to an anonymous array of data. That is true even for a single value like the id number. "Same-ness" is a good thing in programming. So, I would do the same for the description string.

        Then the question of so what do you do with this description once the record is complete? You could say put another dimension on the hash which has id's as the key. However, there is something to be said for keeping things simple. You could just make another hash that is keyed on id's with the string as the value. Some purists might shudder in horror, but again simplicity has virtues!

        # ........ snip if (my $num = /\[/.../\]/) { if (/^Desc:/) { chomp; my ($desc, $string) = split(/\s+/,$_,2); $record{$desc} = [$string]; next; } my ($tag, @values) = split; @{$record{$tag}} = @values; #........ snip OR....perhaps... if (/^Desc:/) { chomp; my ($desc, $string) = split(/\s+/,$_,2); $record{$desc} = [$string]; # same as @{$record{$desc}} = ($s +tring); } else { my ($tag, @values) = split; @{$record{$tag}} = @values; #....snip...

        It is a custom format.

        I've heard that often, but I'll take your word for it :)

        The problem is not parsing the file, but finding the dependencies as explained above. Thanks.

        I hate to be contrary :) but yes, the problem is parsing.

        First you build a data structure (parse), then you walk the data structure looking for dependencies.

        If parsing wasn't a problem, surely you would have shared your parser, or at least, the data structure it creates?

        Your response to Marshal's node Re: Parsing a file and finding the dependencies in it firmly confirms that parsing was indeed a problem.

Re: Parsing a file and finding the dependencies in it
by Anonymous Monk on Jul 05, 2011 at 20:26 UTC

    Hello all, I would appreciate advice on how to go about writing a Perl script solution to the following problem:

    The first thing I would do is identify the file format, and look to CPAN for a parser...

    Alternatively see String Search ( esp Re: String Search,Re^5: String Search )

      It is a custom format.
      The problem is not parsing the file, but finding the dependencies as explained above. Thanks.