=pod =head1 Introduction to Text Utilities UGA Chemistry Summer Lecture Series, June 2, 2006 =head1 Intended Audience Chemistry summer research students (upperclassman & graduate) doing computational tasks. =head1 Abstract This guide is an introduction, demonstration, and reference point to expose Chemistry summer research students to the realm of text processing via the command-line interface and all of the power and efficiency it offers. Using standard text utilities will lead up to more advanced scripting with Perl. =head1 Environment Linux running the bash shell =head1 Author & Presentor David Westbrook David Westbrook, Edwestbrook@gmail.comE =head1 Warning Label This presentation is a crash course. The primary goal is exposure to the freely available tools and methods so that they can be learned and utilized in the future as the need arises. =head1 The Toolbox First we will review all of the basic tools we have available by default for working with and manipulating text files. =over 4 =item man THIS IS ONE OF THE MOST IMPORTANT COMMANDS. "man" stands for "Manual" and provdes documentation for all of the commands in this document. man man man cd man ls =back =head2 Seeing files Before we can work with a file, we have to know where it is. =over 2 =item cd This is used to Change Directories. cd /tmp cd ~ cd - cd .. cd ../../foo/bar =item pwd This Prints the Working Directory -- i.e. tells you what directory you are in. =item ls LiSts files. Shows the contents of the current directory. See the man page for the many options. ls ls /tmp ls -lart ls -lart ../foo =item locate Searches all the filenames on the system for a given search string. The availability of this command can be system-dependent, and there are several caveats: 1) it works off a database that is generated nightly, so files created today won't be found 2) It respects permissions, so files that you can't read won't be found 3) It only works on the local filesystem so files in mounted directories won't be found. locate host locate pass locate etc/pa =item find Recursively lists all of the files in the given (defaults to current) directory. Many, many options in the man page. find find /tmp find -type f /tmp find /tmp -type f -maxdepth 1 -mtime +6 -exec echo {} \; =back =head2 File information We need to be able to obtain basic information about a file to know what we're working with. =over 2 =item ls -l List the details of a file. This includes the permissions, owner, group, size, and last modified date. =item wc WordCount. Displays the number of lines, words, and bytes in a file. =item file Attempts to determine the file's contents -- e.g. html or text or binary or excel, etc. =item identify Similar to L but for graphics files. Will include size and color information. This is provided by the L toolset. =back =head2 File contents Now we can begin to work with the file's actual contents. Note that "print" means "output to the screen" in this context. =over 2 =item cat Just prints out the contents of each file it's given. (Same as I in DOS) cat file1 cat file1 file2 cat -n file1 =item less Shows a file one screen at a time (known as a 'pager'). (There is also a command 'more', but it has less features than L.) =item head Prints out the first N lines of a file. head file1 head -2 file1 head -3 file1 file2 =item tail Prints out the last N lines of a file. tail file1 tail -2 file1 tail -3 file1 file2 tail +5 file1 =item grep Search files for a given string and print the matching lines. See man page for many, many options. grep foo file1 grep -i foo file1 grep -l foo * grep -n foo file1 grep -A3 foo file1 =item strings Prints out all the words found in a file. Especially useful on binary files for finding the pieces of text buried in its compiled contents. strings a.out strings foo.exe strings /bin/ls =item sort Orders (i.e. sorts!!) the lines of a file. See man page for details. sort file1 =item uniq Displays just the unique lines of a file. The file must be sorted. =item cut Print just the specified columns of a file. See man page for details. cut -f1,3 file1 cut -f1,5,6 -d: file1 =item split Split a file into chunks. See man page. =item join Combine two files based on a common column. See man page. =back =head2 File Management These are listed for quick reference -- refer to the man pages for further details. =over 2 =item cp =item mv =item rm =item mkdir =item rmdir =back =head2 Editors =over 2 =item vi vi (or vim) does have a little bit of a learning curve, but is well worth it -- it is very powerful and is available on pretty much every *nix machine (there is gvim for Windows, too). It is best to find a reference (book or online tutorial) for the commands. Some essentials: :q quits :q! quits w/o saving :w save :w! force save :wq saves and quits i enter editing (insert) mode ESC return to command mode /foo search for foo :s/foo/bar replace foo with bar Others that you'll want to know (in no particular order): yy p dd dw w :$ :1 :55 :s/foo/bar/g :%s/foo/bar :%s/foo/bar/g :5,10s/foo/bar n N :n :N :wn :wN ctrl-g s x =item view Same as L but starts it in read-only mode. It's a very good habit to use L when you know you're only looking at a file so you don't accidentally change it. =back =head2 Miscellaneous =over 2 =item clear clears the screen -- same as cls in DOS =item echo Just displays its arguments to the screen (same as DOS). echo blah echo path=$PATH echo -n foo echo -e foo\tbar\nstuff =item touch Updates the last modified timestamp on a file. If file doesn't exist, creates a 0-byte file. =item seq Prints out sequences of numbers. See options. Also see L for example usage. seq 1 10 seq 1 10 2 =item cal Prints out a nicely formatted calendar. cal cal 7 2006 =item look Prints out words from a dictionary file that start with the given string. look foo look princ =item date Prints out the date. See options in man page for various formats. date date -e =item sleep Pauses for N seconds. sleep 2 =item alias Define your own commands. alias cls=clear =item wget Gets files from the web (or ftp). Extremely useful and powerful -- can mirror entire sites. See man page for lots of options. wget http://foo.example.com/blah.tar.gz wget ftp://foo.example.com/blah.tar.gz =item curl Another tool to get remote files (in case wget isn't available). curl --remote http://foo.example.com/blah.tar.gz =item lynx A text-based web browser! USeful for simple pages, testing connections, sucking down source code, converting html to text, or downloading files from HTTP or FTP sites. =back =head1 Combining Tools =head2 Pipes '|' is the "pipe" character. It is uses to take the output from the left-hand side (LHS) and give/shove ("pipe") it as input to the right-hand side (RHS). Here are several example tasks that consectutively use two or more of the tools we have discussed. =head3 Find a word that starts with "c" and has a "mel" in it. look c | grep mel =head3 See if the word FOO is in the first 3 lines of a file. head -3 file1 | grep FOO =head3 Take the lines that have FOO, look at just the first column, and show the unique values grep FOO file1 | cut -f1 | sort -u =head3 Determine the location of a file with FOO in its name. find | grep FOO locate FOO =head2 Redirection The output of a command can be saved to another file. =head3 Output grep FOO file1 > foo_lines =head3 Append Output grep FOO file1 >> foo_lines grep FOO file2 >> foo_lines =head3 Input grep FOO file1 cat file1 | grep FOO grep FOO < file1 a.out < input.dat =head3 Backticks echo `date` ls -lart `find | grep FOO` =head1 Bash A commonly used shell (although there are many) is bash. Besides just running regular commands, it also supports setting/retrieving of variables and loops and conditionals. The man page is extensive. =head2 Variables foo=Bar echo my foo var = '$foo' We won't discus it here, but bash supports variable mangling. e.g. foo=blah.stuff.bar echo $foo echo ${foo%%.*} echo ${foo##*.} echo ${foo#*.} =head2 Loops for s in foo bar stff ; do echo s=$s ; done for s in foo bar stff do echo s=$s done for n in `seq 1 5` ; do touch /tmp/f$n.txt ; done =head1 Text Processing There are three powerful interpreters that can be used to filter text. The man pages for each contain a wealth of information. =head2 sed Useful & efficient for substituions. sed s/1/AAAA/g /etc/hosts =head2 awk Useful for working with columns. awk '{print $2,$1}' /etc/hosts =head2 perl Useful for everything :) We'll come back to it a moment, but here are examples that serve as replacements for many of the above commands. # echo perl -pe '' $f # sed s/// perl -pe 's/1/AAAA/g' $f # cut/awk perl -ane 'print $F[1], " ", $F[0]' $f # grep perl -ne 'print if /foo/' $f # head perl -ne 'print if $. <= 10' $f =head1 Regular Expressions What is a regular expression (regex)? It is just a pattern of something you want to match in a string. And that pattern can be anything, simple or very complex. What uses them? grep/egrep, sed, vi, and perl (and other lnaguages) Note that there are several different "flavors" of regex depending on what's using it, but they are all more-or-less the same. We will focus on perl regex. men perlretut man perlre Regular expressions can be scary at first so we will try to look at them from a general overview: =head2 Matching /a/ The I's simply mark our pattern (note that perl can use anything for the delimitersi with the I operator, e.g. I, I) and the I is what we're matching, which is just the lower-case letter 'a'. /a*b/ This is 0 or more 'a' followed by a 'b'. /a+b/ This is one or more 'a' followed by a 'b' /a\+b/ This is literally "a+b" -- the backslash is used to escape otherwise special characters. /Number:\d+ Some word: \w+/ This is a string that includes a number and a word, e.g. "Blah Number: 1234 Some word: foo1bar Blah" =head2 Substitution Expressions can be replaced with new values using the I substitution operator: s/a/b/ Replaces an 'a' with 'b' s/a/b/g Replaces all 'a' with 'b' s/a/b/ig Replaces all 'a' or 'A' with 'b' s/n=(\d+)/N($1)/ Changes "n=1234" to "N(1234)". When there are parentheses in the pattern, they are used for grouping and for capturing -- the first set of parens because $1, the second $2, and so on. =head2 More Regex This has barely scratched the surface, but we will see example usage of more regex components below. =head1 Perl The first place to start with command-line perl is the perlrun manpage, and looking at & copying/using one-liner examples. perl -e 'print "hello world\n"' Using I<-p> to loop through a file and print each line: f=/tmp/datafile.txt perl -pe '' $f perl -pe 's/a/BBBBB/' $f perl -pe 's/a/BBBBB/g' $f Using I<-n> to loop through a file and look at each line: perl -ne '' $f perl -ne 'print' $f perl -ne 'print $_' $f perl -ne 'print if /a/' $f perl -ne 'print "$.)" . $_' $f perl -ne 'print "$.)" . $_ if $. % 2 == 0' $f Some things seen so far: =over 2 =item $_ This is one of many special variables (see man perlvar) that perl has. It is perhaps the most special because it is the "default" -- whenever you don't supply a command with something it assumes you want to use $_ =item if(){} Basic IF clause in perl -- similar to other languages. I =item ... if ... ; Perl lets you short-hand simple if statements by reversing the order, which is also nice because it's less lines (and no curlies) and can be more natural to read. Perl also provides I which is simply a shortcut for I print "ok" if $ok; print "bad" if ! $ok; print "bad" unless $ok; while( ... ){ next unless ... ; last if ... ; } =item $. This is another special variable (see man perlvar) that is the current line number when reading in a file. =back So now we can take a closer look at this: perl -ne 'print if /a/' $f And write it more explicitly in several ways to demonstrate the syntax: perl -ne 'print $_ if /a/' $f perl -ne 'if( /a/ ){ print $_ }' $f perl -ne 'print $_ if $_ =~ /a/' $f perl -ne 'print $_ unless $_ !~ /a/' $f Here is a good time to note that the unofficial Perl motto is B (There's More Than One Way To Do It). Another powerful command-line option is I<-a> to Auto-split, much like cut & awk do. -aF =head1 Examples =head2 A geometry file needs to become many files split --lines=30 geoms.xyz /tmp/g___ for f in /tmp/g___* ; do d=`head -1 $f | sed s/^**//` mkdir -p blah/$d tail +2 $f > blah/$d/geom done =head2 Rename a bunch of .tpl files, dropping the extension for n in `seq 1 3` ; do touch f$n.tpl ; done for f in *.tpl ; do mv $f ${f%.tpl} ; done ls *.tpl | perl -ne 'chomp;$f0=$_;s/\.tpl$//;print "mv $f0 $_\n"' ls *.tpl | perl -pe 's/^(.+)(\..*)/mv $1$2 $1/' =head2 Get the first & fourth numbers from certain lines of a file If you look at the second line, it starts with BOMD, and then numbers. I want to pick a first (-264.05765232) and the fourth number (0.00000000000) and write it in a new file. Then I want to repeat this in every data entries in the file (as you can see, one entry takes 13 lines). grep '^ BOMD' deMon.mol | awk '{print $3, $6}' > deMon.mol.filtered grep '^ BOMD' deMon.mol | perl -alne 'print "$F[2] $F[5]"' > deMon.mol.filtered perl -alne 'print "$F[2] $F[5]" if $F[0] eq "BOMD"' deMon.mol > deMon.mol.filtered =head2 Get the number of days between two dates perl -MDate::Calc=Delta_Days -le 'print Delta_Days(2005,9,16, 2006,2,28)' =head2 Display a web page's source perl-MLWP::Simple -e "print get(shift)" http://www.perlmonks.org wget -O - http://www.perlmonks.org lynx --source http://www.perlmonks.org =head2 Get lines N -> M of a file These examples show how to display lines 5-8, inclusive from the /etc/passwd file: head -8 /etc/passwd | tail -4 tail +5 /etc/passwd | head -4 perl -ne 'print if 5<=$. && $.<=8' /etc/passwd # man perlvar for explanation of $. cat -n /etc/passwd | perl -ne 'print if s/^\s*[5678]\s+//' Now, to get the lines from /etc/password starting at a line with "news" in it, and stopping at a line with "ftp" in it, these all work (all the same except ordering, which determines whether or not the start and/or end lines are included): =over 1 =item [start, end] perl -ne '$ok||=/news/; print if $ok; $ok=0 if /ftp/' /etc/passwd =item [start,end) perl -ne '$ok||=/news/; $ok=0 if /ftp/; print if $ok' /etc/passwd =item (start,end] perl -ne 'print if $ok; $ok||=/news/; $ok=0 if /ftp/' /etc/passwd =item (start,end) perl -ne '$ok=0 if /ftp/; print if $ok; $ok||=/news/' /etc/passwd =back Basic approach is to take advantage of -n (man perlrun) and flip a flag on/off at the boundries. Note that the /news/ is a regex, and can take complex patterns (man perlre) =head2 Lazy math perl -le 'print( 3+5 )' # need the parens here There is also 'bc' command. =head2 Make & use a program to sum numbers alias add="perl -lne '\$x+=\$_; END{print \$x}'" cut -f1 file1 | add =head2 Perl One-liners =over 2 =item Favourite One-liners? L A web server! perl -MIO::All -e 'io(":8080")->fork->accept->(sub { $_[0] < io(-x $1 ? "./$1 |" : $1) if /^GET \/(.*) / })' dos2unix perl -pi -e 's/\r//' filename =item What one-liners do people actually use? L =item One Liners L =back =head1 Reference Material =over 2 =item man Also note that the command I searches man pages. =item man perl Which is basically a table of contents for the many perl man pages. Ones of particular interest are these manpages: perl perlrun perlsyn perlfunc perlre perlretut =item perldoc Displays documentation for everything perl. perldoc -f sleep perldoc perlfunc perldoc -q how perldoc File::Find =item CPAN L is one of Perl's great strengths -- it is a huge repository of modules (libraries) to do pretty much anything and everything with perl. =item Perl Monks L is a great Perl community site. The knowledge base of the forums, tutorials, and FAQ's is very extensive and the members are very open & willing to help with any level (complete beginner through guru) question. This talk is posted at http://perlmonks.org/?node_id=553278 =item ME! I love to help with this stuff -- it's my vocation & hobby. I'm reachable at Edwestbrook@gmail.comE or as davidryan0 on AIM. =back =cut