in reply to Parsing a file and finding the dependencies in it

To get you started, see the parser below. I generate two hash tables which are linked by the id number. One hash table translates any output file to the id number which produced it. The other hash translates id number into the input files which created it.

I haven't written any code to make a report, but I think that this is enough to travel backwards from an output fileA -> id, then id->inputfilesX, therefore output fileA depends upon inputfiles X. Those input files can be looked up to see where they came from, etc.

hope this provides fuel for thought. It could be that a different data structure is better than this, but at least it shows one way to get the parsing done.

Update: Added some printing code to make a basic report to show all files used to generate a particular output file.

#!/usr/bin/perl -w use strict; use Data::Dumper; my %id; my %done; my %record=(); while (<DATA>) { if (my $num = /\[/.../\]/) { my ($tag, @values) = split; @{$record{$tag}} = @values; if ($num =~ /E0/) { my ($id) = @{$record{'ID:'}}; @{$id{$id}} = @{$record{'Start:'}}; foreach (@{$record{'Done:'}}) { $done{$_}= $id; } %record=(); } } } print Dumper \%done; print Dumper \%id; foreach my $file (keys %done) { print "$file\n"; my %seen; print map{" $_\n"}grep{!$seen{$_}++}priorFiles($file); print "\n"; } sub priorFiles { my ($file) = @_; return() if !exists $done{$file}; my @prior = @{$id{$done{$file}}}; foreach (@prior) { push @prior, priorFiles($_); } return @prior; } =output %done shows the id number which produced each file $VAR1 = { '/complete/success.3' => '456', '/complete/success.2' => '123', '/complete/success.1' => '123', '/complete/success.4' => '456' }; %id shows the input files were used by the id $VAR1 = { '456' => [ '/complete/success.1', '/complete/success.2', '/tmp/file.3' ], '123' => [ '/tmp/file.1', '/tmp/file.2', '/tmp/file.3' ] }; #This is a basic listing..all files that affected the first file /complete/success.3 /complete/success.1 /complete/success.2 /tmp/file.3 /tmp/file.1 /tmp/file.2 /complete/success.2 /tmp/file.1 /tmp/file.2 /tmp/file.3 /complete/success.1 /tmp/file.1 /tmp/file.2 /tmp/file.3 /complete/success.4 /complete/success.1 /complete/success.2 /tmp/file.3 /tmp/file.1 /tmp/file.2 =cut __DATA__ [ ID: 123 Start: /tmp/file.1 /tmp/file.2 /tmp/file.3 Done: /complete/success.1 /complete/success.2 ] [ ID: 456 Start: /complete/success.1 /complete/success.2 /tmp/file.3 Done: /complete/success.3 /complete/success.4 ]

Replies are listed 'Best First'.
Re^2: Parsing a file and finding the dependencies in it
by legendx (Acolyte) on Jul 06, 2011 at 01:53 UTC

    Hm I see.I had a similar idea to what you did there

    I will try this and see where it leads. Thanks!

Re^2: Parsing a file and finding the dependencies in it
by Anonymous Monk on Jul 06, 2011 at 03:43 UTC
    Thanks for adding the update and your help.
Re^2: Parsing a file and finding the dependencies in it
by remiah (Hermit) on Jul 06, 2011 at 14:49 UTC

    This works fine on my machine, but I don't get $num and its regexp at first.

    if (my $num = /\[/.../\]/)

    must be

    if (my $num = $_ =~ /\[/.../\]/)

    I don't understand what the regexp '/\/.../\/' means. I asked Yape::Regex

    my $regexp='\[/.../\]'; use YAPE::Regex::Explain; my $exp = YAPE::Regex::Explain->new($regexp)->explain; print $exp;

    and it says

    The regular expression: (?-imsx:\[/.../\]) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- \[ '[' ---------------------------------------------------------------------- / '/' ---------------------------------------------------------------------- . any character except \n ---------------------------------------------------------------------- . any character except \n ---------------------------------------------------------------------- . any character except \n ---------------------------------------------------------------------- / '/' ---------------------------------------------------------------------- \] ']' ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

    Where does it match against __DATA__ lines? I have no idea. And Why /E0/ ? Someone please shed some light for me.

      It's not a single regular expression but two separate regular expressions, connected by the ... operator (see perlop).

      Also, every regular expression matches against $_ by default, so $_ =~ /foo/ is identical to /foo/.

      Hope you understood Corion's reply. "while (<DATA>)" causes $_ be set to the next line in __DATA__ at every iteration.

      This construct: my $num = /regex/.../regex/ uses what is also known as the flip-flop operator. A classic post on this by Grandfather is: Flipin good, or a total flop?.

      A single regex that matches will have a true value, I think a numeric 1 is returned. In the case where 2 regex's are joined by the ... operator, a line number is returned representing which line of the record we are on.

      I would suggest that you put a print "num=$num\n"; statement in the loop and watch what happens. You will see values like: 1,2,3,4E0.

      The 4E0 means that something is different about this line number! And indeed there is. It is the line that contains the ']' character (the last line of the record - the line that matches the 2nd regex). The E0 is just exponential notation meaning 10**0. Any number raised to the zero'th power is 1. So 4E0 = 4 * 10**0 = 4 * 1 = 4 from a numeric perspective. So this is a clever way to return 2 pieces of information with a single number. A number in exponential format means the record is over and if I wanted to do some math on this number, it is a perfectly legitimate representation of the number 4.

      Update:
      I could have written the code with a more conventional parsing scheme. When the first line of a record is detected, call a subroutine which processes lines until the last line of the record is detected. This eliminates the need to have some flag values like "I'm inside the record now..", etc. The flip-flop implementation essentially does what the below would have done:

      #!/usr/bin/perl -w use strict; while (<DATA>) { process_record() if /^\[/; #start of record } sub process_record { my %record; my $line; my $line_num=1; while (defined ($line = <DATA>) and $line !~ /\]/) { print "line= ",$line_num++," ",$line; # do splits and fill in %record here } print "Record Complete!\n\n"; # use %record here to populate other hashes # %record is thrown away when sub returns. } =prints line= 1 ID: 123 line= 2 Start: /tmp/file.1 /tmp/file.2 /tmp/file.3 line= 3 Done: /complete/success.1 /complete/success.2 Record Complete! line= 1 ID: 456 line= 2 Start: /complete/success.1 /complete/success.2 /tmp/fil +e.3 line= 3 Done: /complete/success.3 /complete/success.4 Record Complete! =cut __DATA__ [ ID: 123 Start: /tmp/file.1 /tmp/file.2 /tmp/file.3 Done: /complete/success.1 /complete/success.2 ] [ ID: 456 Start: /complete/success.1 /complete/success.2 /tmp/file.3 Done: /complete/success.3 /complete/success.4 ]

        I didn't have an idea that '...' is an operator. I'm now reading perlop's Range Operator section and I am gradually understanding the mystery.

        Thanks to Corion and Marshall. Marshall's explanation is a great help for me.

        Thousand miles to go before I sleep.

Re^2: Parsing a file and finding the dependencies in it
by legendx (Acolyte) on Jul 06, 2011 at 14:24 UTC
    This works but I have another question: If there is another field, say "Desc" with a value that is space delimited, how can I put the entire value into a variable.
    Example:
    [ ID: 456 Desc: This is a test job Start: /complete/success.1 /complete/success.2 /tmp/file.3 Done: /complete/success.3 /complete/success.4 ]
    The code currently splits all by space so only the first word is in the value.
    E.g.
    my ($desc) = @{$record{'Desc:'}};
    $desc will only contain the word "This " and not the full value of "This is a test job".
    What is the best way to solve this?
      There is nothing wrong with making Desc: a special case for the splitting. I show some code below...

      In this special situation, you can just test for /^Desc:/. The technique is to limit the number of things returned from the split, in this case 2 things. Doing that requires that we take care of one more detail, a chomp() is needed.

      When we let split() do its default thing, a chomp() is not needed because the trailing \n will be removed (default split is on any sequence of the 5 whitespace characters (space,\n,\f,\r,\t). If we tell split() to stop working after it has 2 things, then we have to do manually what it would have done to the last thing.

      I set up %record so that it is a Hash of Array, each value is a reference to an anonymous array of data. That is true even for a single value like the id number. "Same-ness" is a good thing in programming. So, I would do the same for the description string.

      Then the question of so what do you do with this description once the record is complete? You could say put another dimension on the hash which has id's as the key. However, there is something to be said for keeping things simple. You could just make another hash that is keyed on id's with the string as the value. Some purists might shudder in horror, but again simplicity has virtues!

      # ........ snip if (my $num = /\[/.../\]/) { if (/^Desc:/) { chomp; my ($desc, $string) = split(/\s+/,$_,2); $record{$desc} = [$string]; next; } my ($tag, @values) = split; @{$record{$tag}} = @values; #........ snip OR....perhaps... if (/^Desc:/) { chomp; my ($desc, $string) = split(/\s+/,$_,2); $record{$desc} = [$string]; # same as @{$record{$desc}} = ($s +tring); } else { my ($tag, @values) = split; @{$record{$tag}} = @values; #....snip...
        I didn't realize or check to see if you could limit what is returned from split. Thanks for that, that works perfectly fine.
        The "flip-flop" implementation that Marshall referred to was new to me as well, so much to learn!

        Can anyone help to explain what this line does?

        print map{" $_\n"}grep{!$seen{$_}++}priorFiles($file);

        I've read the perldoc on the map function and I think the grep{!$seen{$_}++}priorFiles($file) portion extracts unique elements and the priorFiles subroutine returns the "Start" files? Could someone explain it please?

        Also, I have been trying to figure out how I would be able to tell if an "ID" or "Desc" depends on another "ID" or "Desc" such as showing ID 456 depends on ID 423 which basically entails looking up the input or "Start" files to see where (which "ID") they came from

      It is a custom format.

      I've heard that often, but I'll take your word for it :)

      The problem is not parsing the file, but finding the dependencies as explained above. Thanks.

      I hate to be contrary :) but yes, the problem is parsing.

      First you build a data structure (parse), then you walk the data structure looking for dependencies.

      If parsing wasn't a problem, surely you would have shared your parser, or at least, the data structure it creates?

      Your response to Marshal's node Re: Parsing a file and finding the dependencies in it firmly confirms that parsing was indeed a problem.

        Yes, I agree it is a parsing issue and that I indeed was incorrect. I've been using Marshall's parser to check the dependencies but just need to get it to pull out the full value field as I noted in http://www.perlmonks.org/?node_id=912984