Creates a dependency and statistics report of a directory with BASH scripts.

The script will generate a dependency report for each BASH script in the specified path.

Right now it call's 'rpm' to get dependencies for RPM's, if your system doesn't have RPM just replace the line that calls it with an undef variable.

Guess I will make a test for it in the next version.

The current output looks like below for each script and in the end there is a summary for all scripts.

/etc/init.d/multipathd |-- binary: basename |-- binary: test |-- teardown_slaves | |-- binary: pwd | |-- binary: sed | |-- binary: echo | |-- binary: readlink | |-- function: teardown_slaves | |-- binary: sed | `-- binary: echo |-- binary: test |-- binary: echo |-- binary: touch |-- binary: echo |-- binary: echo |-- binary: echo `-- binary: echo /etc/init.d/multipathd Code lines: 102 Comment lines: 13 Empty lines: 12 Total lines: 127 Function(s): teardown_slaves Uses function(s): teardown_slaves Uses binarie(s): basename echo info pwd readlink rm sed test touch Uses RPM(s): coreutils-5.97-19.el5 info-4.8-14.el5 sed-4.1.5-5.fc6

The summary looks as following. This output is slightly cut down and is taken from CentOS 5.x running a report on "/etc/init.d"

------------------------------------------- Duplicate local function names: 5 stop 5 start 2 status 2 restart 2 condrestart 1 do_restart_sanity_check 1 makedev 1 start_isdnlog ... ------------------------------------------- Most used functions: 18 stop 12 start 6 restart 4 status 4 invoke_command ... ------------------------------------------- Most used binaries: 422 echo 72 rm 57 touch 40 grep 36 test 30 awk ... ------------------------------------------- Most used RPM's: 671 coreutils-5.97-19.el5 50 grep-2.5.1-54.2.el5 30 gawk-3.1.5-14.el5 19 util-linux-2.13-0.50.el5 10 sed-4.1.5-5.fc6 9 file-4.17-15.el5_3.1 4 xorg-x11-xfs-1.0.2-4 ... ------------------------------------------- Total code lines: 4438 Total comment lines: 989 Total empty lines: 705 Total lines: 6132 Total files: 51

Without further ado here's the code in all it's ugliness.

Updated to v0.9 Solved some of the issues mentioned in the replies. Thanks again for the input.

#!/usr/bin/perl # Bash Parser v0.9 # # Creates a dependency and statistics report of a directory with BASH +scripts. # # Copyright (c) 2009, Michael Persson (mickep76@mac.com) # All rights reserved. # Error checking for files use strict; # Check arguments die "usage: bash_reporter directory\n" if $#ARGV; my $path = $ARGV[0]; die "Please specify a valid directory\n" if not -d $path; # Ignore functions our %ignore_functions = ( # 'echo' => 1, # 'true' => 1 ); # Ignore binaries our %ignore_binaries = ( # 'sed' => 1, # 'awk' => 1, # 'echo' => 1, # 'cut' => 1 ); # BASH keywords 'compgen -k' our %bash_keywords = ( 'if' => 1, 'then' => 1, 'else' => 1, 'elif' => 1, 'fi' => 1, 'case' => 1, 'esac' => 1, 'for' => 1, 'select' => 1, 'while' => 1, 'until' => 1, 'do' => 1, 'done' => 1, 'in' => 1, 'function' => 1, 'time' => 1 ); # Global hashes our %binaries_in_path; our %binaries_to_rpm; our %all_sources; our %all_functions; our %dependencies; our %variables; # Global statistics hashes our %stat_sources; our %stat_functions; our %stat_uses_binaries; our %stat_uses_functions; our %stat_uses_rpms; # Global totals our $tot_code_lines = 0; our $tot_comment_lines = 0; our $tot_empty_lines = 0; our $tot_lines = 0; our $tot_files = 0; # Get all binaries in the PATH get_binaries(); # Recurse through all files in the path recurse($path); # Go through all sourced files foreach(keys %all_sources) { match($_, 0) } # Print statistics print_stat_hash(\%stat_sources, "Most sourced files:"); print_stat_hash(\%stat_functions, "Duplicate local function names:", 1 +); print_stat_hash(\%stat_uses_functions, "Most used functions:"); print_stat_hash(\%stat_uses_binaries, "Most used binaries:"); print_stat_hash(\%stat_uses_rpms, "Most used RPM's:"); # Print totals print_separator(); print "Total code lines:\t$tot_code_lines\n"; print "Total comment lines:\t$tot_comment_lines\n"; print "Total empty lines:\t$tot_empty_lines\n"; print "Total lines:\t\t$tot_lines\n"; print "Total files:\t\t$tot_files\n"; # Get all binaries in the PATH sub get_binaries { foreach my $directory (split ':', $ENV{'PATH'}) { foreach(glob("$directory/*")) { my $file = $_; if(-f $_ && -x $_) { s/.*\///; $binaries_in_path{$_} = $file; } } } } # Recurse through all files in a given path sub recurse { my $path = shift; foreach(glob("$path/*")) { if(-d $_) { recurse($_) } else { match($_, 1) } } } # Find sources, functions and binaries used in file sub match { my $file = shift; my $check_if_bash = shift; undef %all_functions; undef %dependencies; undef %variables; my %functions; my %sources; my %uses_binaries; my %uses_functions; my %uses_rpms; my $function = undef; my $line_number = 1; my $comment_lines = 0; my $empty_lines = 0; my $text_block = undef; my $function_line = undef; open FH, "<$file" or die "Failed to open file $file: $!\n"; foreach(<FH>) { chomp(); my $ignore_line = 0; # Check if file is a Bash script if($check_if_bash == 1 && $line_number == 1 && ! /bash/) { return +} # Count comments and empty lines elsif(/(\%\%.*\%\%)/) { if($1 eq $text_block) { $text_block = undef } else { $text_block = $1; } } elsif(/^[ \t]*#/) { $comment_lines++; $ignore_line = 1; } elsif($text_block) { $ignore_line = 1 } elsif($_ eq '') { $empty_lines++ } # Get variable assignments elsif(/([\w\.\-\_]+)=(.*)/) { $variables{$1} = $2; } # Get files sourced by script elsif(/^[ \t]*source +([\w\-\/\.\_]+)/) { my $source = expand_path($1); $sources{$source}++; $stat_sources{$source}++; push @{$dependencies{$file}}, $source; get_sources($source); } # Get functions in script elsif((/function +([\w\.\-\_]+)/ || /([\w\.\-\_]+)\(\)/) && ! $bas +h_keywords{$1}) { $function = $1; $function_line = $line_number; } elsif($function ne undef && ($function_line + 1) == $line_number +) { if(/^{/) { $functions{$function}++; $all_functions{$function}++; $stat_functions{$function}++; push @{$dependencies{$file}}, $function; } else { $function = undef; $function_line = undef; } } elsif(/^}/) { $function = undef; $function_line = undef; } $line_number++; if($ignore_line) { next } # Get all binaries and functions called from script s/\#.*//; foreach my $word (split /[\`\;\| \t]+/) { if($all_functions{$word} && ! $ignore_functions{$word}) { $uses_functions{$word}++; $stat_uses_functions{$word}++; if($function eq undef) { push @{$dependencies{$file}}, "functi +on: $word" } else { push @{$dependencies{$function}}, "function: $word" } } elsif($binaries_in_path{$word} ne undef && ! $ignore_binaries{$w +ord}) { $uses_binaries{$word}++; $stat_uses_binaries{$word}++; if($function eq undef) { push @{$dependencies{$file}}, "binar +y: $word" } else { push @{$dependencies{$function}}, "binary: $word" } if($binaries_to_rpm{$word} eq undef) { my $rpm = `rpm -qf $binaries_in_path{$word}`; if($rpm =~ /([\w\.\-\_]+)/ && $rpm !~ /not owned/) { $uses_rpms{$1}++; $binaries_to_rpm{$word} = $1; $stat_uses_rpms{$1}++; } } else { $uses_rpms{$binaries_to_rpm{$word}}++; $stat_uses_rpms{$binaries_to_rpm{$word}}++; } } } } close FH; $tot_code_lines += $line_number - $comment_lines - $empty_lines - 1; $tot_comment_lines += $comment_lines; $tot_empty_lines += $empty_lines; $tot_lines += $line_number - 1; $tot_files++; print_separator(); print_dependencies($file); print "\n$file\n"; printf "\tCode lines:\t%s\n", $line_number - $comment_lines - $empty +_lines - 1; printf "\tComment lines:\t%s\n", $comment_lines; printf "\tEmpty lines:\t%s\n", $empty_lines; printf "\tTotal lines:\t%s\n", $line_number - 1; print_hash(\%sources, 'Source(s):'); print_hash(\%functions, 'Function(s):'); print_hash(\%uses_functions, 'Uses function(s):'); print_hash(\%uses_binaries, 'Uses binarie(s):'); print_hash(\%uses_rpms, 'Uses RPM(s):'); } # Get all functions for sourced files and place in all_functions sub get_sources { my $file = shift; $all_sources{$file}++; my $function = undef; open FH, "<$file" or die "Failed to open file $file: $!\n"; foreach(<FH>) { chomp(); # Get files sourced by script if(/^[ \t]*source +([\w\-\/\.\_\$]+)/) { my $source = expand_path($1); push @{$dependencies{$file}}, $source; get_sources($source); } # Get variable assignments elsif(/([\w\.\-\_]+)=(.*)/) { $variables{$1} = $2; } # Get functions in script elsif(/function +([\w\.\-]+)/ || /([\w\.\-]+)\(\)/) { $function = +$1; } elsif($function ne undef && /{/ && ! $bash_keywords{$function}) { +$all_functions{$function}++; } elsif(/}/) { $function = undef; } } close FH; } # Expand variables in path sub expand_path { my $path = shift; if($path =~ /\$([\w\-\.\_]+)/) { my $variable = $1; $path =~ s/\$$variable/$variables{$variable}/; } return $path; } # Print separator sub print_separator { print "-" x 100 . "\n"; } # Print hash sub print_hash { my $hash = shift; my $heading = shift; if(scalar(keys %{$hash}) > 0) { print "\t$heading\n" } foreach(sort keys %{$hash}) { print "\t\t$_\n" } } # Print statistics hash sub print_stat_hash { my $hash = shift; my $heading = shift; my $minimum = shift; if(scalar(keys %{$hash}) > 0) { print_separator(); print "$heading\n"; } foreach(sort { ${$hash}{$b} <=> ${$hash}{$a} } keys %{$hash}) { if(${$hash}{$_} > $minimum) { printf "%s\t$_\n", ${$hash}{$_} } } } # Print dependency tree sub print_dependencies { my $function = shift; my $padding = shift; my $where = shift; if($where eq 'middle') { print "$padding|-- "; $padding .= '| '; } elsif($where eq 'last') { print "$padding`-- "; $padding .= ' '; } print "$function\n"; $where = "middle"; my $count = 1; foreach(@{$dependencies{$function}}) { if($count++ == scalar(@{$dependencies{$function}})) { $where = "la +st" } if($function ne $_) { print_dependencies($_, $padding, $where) } } }

Replies are listed 'Best First'.
Re: Bash Parser
by JavaFan (Canon) on Sep 30, 2009 at 09:54 UTC
    undef my %functions;
    What's the point of using an explicite undef here? What's wrong with my %functions;?
    my $regex = "\$file =~ s/\\\$$variable/$path/g;"; eval "$regex";
    I don't get this. Why not just $file =~ s/\$$variable/$path/g;, or, if you're worried the substitution has a fatal syntax: eval {$file =~ s/\$$variable/$path/g;}? But you aren't checking the result of the eval, so you must be confident that it won't fail.
    open FH2, "<$file";
    Remember what one of the creators of Unix (Thompson?) said: not being able to open a file is not exceptional. Always check if opening a file succeeded.
    if(/^[ \t]*source +([\w\-\/\.\_\$]+)/)
    Hmmm. So, I can have tabs before the source keyword, but not after them? And I cannot surround the filenames with quotes? Or have non-word characters other than '-', '/', '.', '_' or '$' in them?
    if($check_if_bash == 1 && $line_number == 1 && ! /bash/) { return }
    A bash script doesn't have to start with a line saying "bash". Just like a Perl program doesn't have to start with a line saying "perl". And even if such a script starts with a she-bang line, it may says #!/usr/bin/sh, where /usr/bin/sh and /usr/bin/bash are links to the same file.
    # Count comments and empty lines elsif(/(\%\%.*\%\%)/) {
    What makes you think that %% ... %% on a single line is a comment block in bash? My bash manual mentions various meanings for %% depending on context, but none of them have anything to do with comments.

    You also seem to assume bash programs use only newlines as statement separators, and all newlines separate statements. Neither statement is true.

    You don't seem to deal with any of the many quoting and grouping mechanisms of a bash program. Nor does your program seem to be able to deal with here documents.

    There's quite some duplication of code between 'match' and 'get_sources', but there are also differences. It's totally unclear to me why there need to be any differences.

    It looks to me there's no guards against infinite recursion. If file 'A' contains source A, then your program will never stop by itself.

      Using undef my %... is redundant as you say, really serves no purpose.

      eval {$file =~ s/\$$variable/$path/g;}

      I will add proper error checking to files, just didn't bother.

      You are correct about the substitution, just couldn't find a better approach to do it. Which you neatly just gave me :D.

      if(/^[ \t]*source +([\w\-\/\.\_\$]+)/)

      The regular expression was more adapted to what people use through testing than what they might use, I will add you suggestions.

      if($check_if_bash == 1 && $line_number == 1 && ! /bash/) { return }

      Was mainly to exclude scripts that are executable but not bash. If you have a good idea how to verify if a script is indeed bash when it lacks #!/usr/bin/bash id be greatful.

      You are right %%...%% isn't a comment it's usually a text block, the main reason to detect it was to avoid parsing it later. But it should not be included in comment statistics.

      match/get_sources have slightly different structure since I don't want the statistics from sourced files since the statistics is per script and sourced files are individually included in the end.

      There is as you say a possibility of a race condition I will fix this. Thanks

        %% - do you handle here docs as well and ignore embedded multiline strings containing perl or awk, for all styles of ', ", \-escaped or HERE-document?

        Identifying bash scripts:

        If the script's not being invoked by sourcing or an explicit bash filename: -T file AND -x file AND strings [\s/](ba)?sh or sh<versionstring> in the shebang line AND (to be safe) some specific bourne-shell style idiom like N>&N.

        Also consider asking file(1) for a guess.

        Definitely a lie, but still quite common:

        #!/bin/sh #!perl # -w -Sx: line count is one off eval 'exec perl -Sx $0 ${1:+"$@"}' if 0;

        Trying to distinguish between bourne-descending shell might be an additional challenge. Bash vs ksh93 vs dash vs pdksh vs zsh vs csh-considered-harmful vs ... .

        Homework for the astute reader: which of the above shells would be the most easy to detect and exclude?

      This script is not perfect and probably has more flaws than what you mentioned. It was a quick hack to solve a problem and works fairly well and is fairly accurate but by no accounts is it perfect. Thanks for the input, will try to address your points.
        add use autodie; and you're checking for open/close failure