Parsing a file and finding the dependencies in it

legendx has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing a file and finding the dependencies in it by Marshall (Canon) on Jul 05, 2011 at 21:06 UTC
To get you started, see the parser below. I generate two hash tables which are linked by the id number. One hash table translates any output file to the id number which produced it. The other hash translates id number into the input files which created it. I haven't written any code to make a report, but I think that this is enough to travel backwards from an output fileA -> id, then id->inputfilesX, therefore output fileA depends upon inputfiles X. Those input files can be looked up to see where they came from, etc. hope this provides fuel for thought. It could be that a different data structure is better than this, but at least it shows one way to get the parsing done. Update: Added some printing code to make a basic report to show all files used to generate a particular output file. #!/usr/bin/perl -w use strict; use Data::Dumper; my %id; my %done; my %record=(); while (<DATA>) { if (my $num = /\[/.../\]/) { my ($tag, @values) = split; @{$record{$tag}} = @values; if ($num =~ /E0/) { my ($id) = @{$record{'ID:'}}; @{$id{$id}} = @{$record{'Start:'}}; foreach (@{$record{'Done:'}}) { $done{$_}= $id; } %record=(); } } } print Dumper \%done; print Dumper \%id; foreach my $file (keys %done) { print "$file\n"; my %seen; print map{" $_\n"}grep{!$seen{$_}++}priorFiles($file); print "\n"; } sub priorFiles { my ($file) = @_; return() if !exists $done{$file}; my @prior = @{$id{$done{$file}}}; foreach (@prior) { push @prior, priorFiles($_); } return @prior; } =output %done shows the id number which produced each file $VAR1 = { '/complete/success.3' => '456', '/complete/success.2' => '123', '/complete/success.1' => '123', '/complete/success.4' => '456' }; %id shows the input files were used by the id $VAR1 = { '456' => [ '/complete/success.1', '/complete/success.2', '/tmp/file.3' ], '123' => [ '/tmp/file.1', '/tmp/file.2', '/tmp/file.3' ] }; #This is a basic listing..all files that affected the first file /complete/success.3 /complete/success.1 /complete/success.2 /tmp/file.3 /tmp/file.1 /tmp/file.2 /complete/success.2 /tmp/file.1 /tmp/file.2 /tmp/file.3 /complete/success.1 /tmp/file.1 /tmp/file.2 /tmp/file.3 /complete/success.4 /complete/success.1 /complete/success.2 /tmp/file.3 /tmp/file.1 /tmp/file.2 =cut __DATA__ [ ID: 123 Start: /tmp/file.1 /tmp/file.2 /tmp/file.3 Done: /complete/success.1 /complete/success.2 ] [ ID: 456 Start: /complete/success.1 /complete/success.2 /tmp/file.3 Done: /complete/success.3 /complete/success.4 ] [download]	[reply] [d/l]
Re^2: Parsing a file and finding the dependencies in it by legendx (Acolyte) on Jul 06, 2011 at 01:53 UTC
Hm I see.I had a similar idea to what you did there I will try this and see where it leads. Thanks!	[reply]
Re^2: Parsing a file and finding the dependencies in it by Anonymous Monk on Jul 06, 2011 at 03:43 UTC
Thanks for adding the update and your help.	[reply]
Re^2: Parsing a file and finding the dependencies in it by remiah (Hermit) on Jul 06, 2011 at 14:49 UTC
This works fine on my machine, but I don't get $num and its regexp at first. `if (my $num = /\[/.../\]/)` [download] must be `if (my $num = $_ =~ /\[/.../\]/)` [download] I don't understand what the regexp '/\/.../\/' means. I asked Yape::Regex `my $regexp='\[/.../\]'; use YAPE::Regex::Explain; my $exp = YAPE::Regex::Explain->new($regexp)->explain; print $exp;` [download] and it says The regular expression: (?-imsx:\[/.../\]) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- \[ '[' ---------------------------------------------------------------------- / '/' ---------------------------------------------------------------------- . any character except \n ---------------------------------------------------------------------- . any character except \n ---------------------------------------------------------------------- . any character except \n ---------------------------------------------------------------------- / '/' ---------------------------------------------------------------------- \] ']' ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download] Where does it match against __DATA__ lines? I have no idea. And Why /E0/ ? Someone please shed some light for me.	[reply] [d/l] [select]
Re^3: Parsing a file and finding the dependencies in it by Corion (Patriarch) on Jul 06, 2011 at 14:57 UTC
It's not a single regular expression but two separate regular expressions, connected by the `...` operator (see perlop). Also, every regular expression matches against `$_` by default, so `$_ =~ /foo/` is identical to `/foo/`.	[reply] [d/l] [select]
Re^3: Parsing a file and finding the dependencies in it by Marshall (Canon) on Jul 06, 2011 at 19:35 UTC
Hope you understood Corion's reply. "while (<DATA>)" causes $_ be set to the next line in __DATA__ at every iteration. This construct: my $num = /regex/.../regex/ uses what is also known as the flip-flop operator. A classic post on this by Grandfather is: Flipin good, or a total flop?. A single regex that matches will have a true value, I think a numeric 1 is returned. In the case where 2 regex's are joined by the ... operator, a line number is returned representing which line of the record we are on. I would suggest that you put a print "num=$num\n"; statement in the loop and watch what happens. You will see values like: 1,2,3,4E0. The 4E0 means that something is different about this line number! And indeed there is. It is the line that contains the ']' character (the last line of the record - the line that matches the 2nd regex). The E0 is just exponential notation meaning 10*0. Any number raised to the zero'th power is 1. So 4E0 = 4 10*0 = 4 1 = 4 from a numeric perspective. So this is a clever way to return 2 pieces of information with a single number. A number in exponential format means the record is over and if I wanted to do some math on this number, it is a perfectly legitimate representation of the number 4. Update: I could have written the code with a more conventional parsing scheme. When the first line of a record is detected, call a subroutine which processes lines until the last line of the record is detected. This eliminates the need to have some flag values like "I'm inside the record now..", etc. The flip-flop implementation essentially does what the below would have done: #!/usr/bin/perl -w use strict; while (<DATA>) { process_record() if /^\[/; #start of record } sub process_record { my %record; my $line; my $line_num=1; while (defined ($line = <DATA>) and $line !~ /\]/) { print "line= ",$line_num++," ",$line; # do splits and fill in %record here } print "Record Complete!\n\n"; # use %record here to populate other hashes # %record is thrown away when sub returns. } =prints line= 1 ID: 123 line= 2 Start: /tmp/file.1 /tmp/file.2 /tmp/file.3 line= 3 Done: /complete/success.1 /complete/success.2 Record Complete! line= 1 ID: 456 line= 2 Start: /complete/success.1 /complete/success.2 /tmp/fil +e.3 line= 3 Done: /complete/success.3 /complete/success.4 Record Complete! =cut __DATA__ [ ID: 123 Start: /tmp/file.1 /tmp/file.2 /tmp/file.3 Done: /complete/success.1 /complete/success.2 ] [ ID: 456 Start: /complete/success.1 /complete/success.2 /tmp/file.3 Done: /complete/success.3 /complete/success.4 ] [download]	[reply] [d/l]
Re^4: Parsing a file and finding the dependencies in it by remiah (Hermit) on Jul 07, 2011 at 07:30 UTC
Re^2: Parsing a file and finding the dependencies in it by legendx (Acolyte) on Jul 06, 2011 at 14:24 UTC
This works but I have another question: If there is another field, say "Desc" with a value that is space delimited, how can I put the entire value into a variable. Example: `[ ID: 456 Desc: This is a test job Start: /complete/success.1 /complete/success.2 /tmp/file.3 Done: /complete/success.3 /complete/success.4 ]` [download] The code currently splits all by space so only the first word is in the value. E.g. `my ($desc) = @{$record{'Desc:'}};` $desc will only contain the word "This " and not the full value of "This is a test job". What is the best way to solve this?	[reply] [d/l] [select]
Re^3: Parsing a file and finding the dependencies in it by Marshall (Canon) on Jul 06, 2011 at 19:16 UTC
There is nothing wrong with making Desc: a special case for the splitting. I show some code below... In this special situation, you can just test for /^Desc:/. The technique is to limit the number of things returned from the split, in this case 2 things. Doing that requires that we take care of one more detail, a chomp() is needed. When we let split() do its default thing, a chomp() is not needed because the trailing \n will be removed (default split is on any sequence of the 5 whitespace characters (space,\n,\f,\r,\t). If we tell split() to stop working after it has 2 things, then we have to do manually what it would have done to the last thing. I set up %record so that it is a Hash of Array, each value is a reference to an anonymous array of data. That is true even for a single value like the id number. "Same-ness" is a good thing in programming. So, I would do the same for the description string. Then the question of so what do you do with this description once the record is complete? You could say put another dimension on the hash which has id's as the key. However, there is something to be said for keeping things simple. You could just make another hash that is keyed on id's with the string as the value. Some purists might shudder in horror, but again simplicity has virtues! `# ........ snip if (my $num = /\[/.../\]/) { if (/^Desc:/) { chomp; my ($desc, $string) = split(/\s+/,$_,2); $record{$desc} = [$string]; next; } my ($tag, @values) = split; @{$record{$tag}} = @values; #........ snip OR....perhaps... if (/^Desc:/) { chomp; my ($desc, $string) = split(/\s+/,$_,2); $record{$desc} = [$string]; # same as @{$record{$desc}} = ($s +tring); } else { my ($tag, @values) = split; @{$record{$tag}} = @values; #....snip...` [download]	[reply] [d/l]
Re^4: Parsing a file and finding the dependencies in it by legendx (Acolyte) on Jul 07, 2011 at 13:28 UTC
Re^4: Parsing a file and finding the dependencies in it by legendx (Acolyte) on Jul 07, 2011 at 14:49 UTC
Re^5: Parsing a file and finding the dependencies in it by Marshall (Canon) on Jul 09, 2011 at 15:10 UTC
Re^5: Parsing a file and finding the dependencies in it by Anonymous Monk on Jul 09, 2011 at 15:23 UTC
Re^3: Parsing a file and finding the dependencies in it by Anonymous Monk on Jul 06, 2011 at 15:10 UTC
It is a custom format. I've heard that often, but I'll take your word for it :) The problem is not parsing the file, but finding the dependencies as explained above. Thanks. I hate to be contrary :) but yes, the problem is parsing. First you build a data structure (parse), then you walk the data structure looking for dependencies. If parsing wasn't a problem, surely you would have shared your parser, or at least, the data structure it creates? Your response to Marshal's node Re: Parsing a file and finding the dependencies in it firmly confirms that parsing was indeed a problem.	[reply]
Re^4: Parsing a file and finding the dependencies in it by legendx (Acolyte) on Jul 06, 2011 at 15:31 UTC
Re: Parsing a file and finding the dependencies in it by Anonymous Monk on Jul 05, 2011 at 20:26 UTC
Hello all, I would appreciate advice on how to go about writing a Perl script solution to the following problem: The first thing I would do is identify the file format, and look to CPAN for a parser... Alternatively see String Search ( esp Re: String Search,Re^5: String Search )	[reply]
Re^2: Parsing a file and finding the dependencies in it by legendx (Acolyte) on Jul 06, 2011 at 01:39 UTC
It is a custom format. The problem is not parsing the file, but finding the dependencies as explained above. Thanks.	[reply]