=pod =head1 Introduction to Text Utilities UGA Chemistry Summer Lecture Series, June 2, 2006 =head1 Intended Audience Chemistry summer research students (upperclassman & graduate) doing co +mputational tasks. =head1 Abstract This guide is an introduction, demonstration, and reference point to e +xpose Chemistry summer research students to the realm of text process +ing via the command-line interface and all of the power and efficienc +y it offers. Using standard text utilities will lead up to more advan +ced scripting with Perl. =head1 Environment Linux running the bash shell =head1 Author & Presentor David Westbrook David Westbrook, E<lt>dwestbrook@gmail.comE<gt> =head1 Warning Label This presentation is a crash course. The primary goal is exposure to +the freely available tools and methods so that they can be learned an +d utilized in the future as the need arises. =head1 The Toolbox First we will review all of the basic tools we have available by defau +lt for working with and manipulating text files. =over 4 =item man THIS IS ONE OF THE MOST IMPORTANT COMMANDS. "man" stands for "Manual" + and provdes documentation for all of the commands in this document. man man man cd man ls =back =head2 Seeing files Before we can work with a file, we have to know where it is. =over 2 =item cd This is used to Change Directories. cd /tmp cd ~ cd - cd .. cd ../../foo/bar =item pwd This Prints the Working Directory -- i.e. tells you what directory you + are in. =item ls LiSts files. Shows the contents of the current directory. See the man + page for the many options. ls ls /tmp ls -lart ls -lart ../foo =item locate Searches all the filenames on the system for a given search string. T +he availability of this command can be system-dependent, and there ar +e several caveats: 1) it works off a database that is generated night +ly, so files created today won't be found 2) It respects permissions, + so files that you can't read won't be found 3) It only works on the + local filesystem so files in mounted directories won't be found. locate host locate pass locate etc/pa =item find Recursively lists all of the files in the given (defaults to current) +directory. Many, many options in the man page. find find /tmp find -type f /tmp find /tmp -type f -maxdepth 1 -mtime +6 -exec echo {} \; =back =head2 File information We need to be able to obtain basic information about a file to know wh +at we're working with. =over 2 =item ls -l List the details of a file. This includes the permissions, owner, grou +p, size, and last modified date. =item wc WordCount. Displays the number of lines, words, and bytes in a file. =item file Attempts to determine the file's contents -- e.g. html or text or bina +ry or excel, etc. =item identify Similar to L<file> but for graphics files. Will include size and color + information. This is provided by the L<http://www.imagemagick.org> +toolset. =back =head2 File contents Now we can begin to work with the file's actual contents. Note that "p +rint" means "output to the screen" in this context. =over 2 =item cat Just prints out the contents of each file it's given. (Same as I<type> + in DOS) cat file1 cat file1 file2 cat -n file1 =item less Shows a file one screen at a time (known as a 'pager'). (There is also + a command 'more', but it has less features than L<less>.) =item head Prints out the first N lines of a file. head file1 head -2 file1 head -3 file1 file2 =item tail Prints out the last N lines of a file. tail file1 tail -2 file1 tail -3 file1 file2 tail +5 file1 =item grep Search files for a given string and print the matching lines. See man + page for many, many options. grep foo file1 grep -i foo file1 grep -l foo * grep -n foo file1 grep -A3 foo file1 =item strings Prints out all the words found in a file. Especially useful on binary +files for finding the pieces of text buried in its compiled contents. strings a.out strings foo.exe strings /bin/ls =item sort Orders (i.e. sorts!!) the lines of a file. See man page for details. sort file1 =item uniq Displays just the unique lines of a file. The file must be sorted. =item cut Print just the specified columns of a file. See man page for details. cut -f1,3 file1 cut -f1,5,6 -d: file1 =item split Split a file into chunks. See man page. =item join Combine two files based on a common column. See man page. =back =head2 File Management These are listed for quick reference -- refer to the man pages for fur +ther details. =over 2 =item cp =item mv =item rm =item mkdir =item rmdir =back =head2 Editors =over 2 =item vi vi (or vim) does have a little bit of a learning curve, but is well wo +rth it -- it is very powerful and is available on pretty much every * +nix machine (there is gvim for Windows, too). It is best to find a r +eference (book or online tutorial) for the commands. Some essentials +: :q quits :q! quits w/o saving :w save :w! force save :wq saves and quits i enter editing (insert) mode ESC return to command mode /foo search for foo :s/foo/bar replace foo with bar Others that you'll want to know (in no particular order): yy p dd dw w :$ :1 :55 :s/foo/bar/g :%s/foo/bar :%s/foo/bar/g :5,10s +/foo/bar n N :n :N :wn :wN ctrl-g s x =item view Same as L<vi> but starts it in read-only mode. It's a very good habit + to use L<view> when you know you're only looking at a file so you do +n't accidentally change it. =back =head2 Miscellaneous =over 2 =item clear clears the screen -- same as cls in DOS =item echo Just displays its arguments to the screen (same as DOS). echo blah echo path=$PATH echo -n foo echo -e foo\tbar\nstuff =item touch Updates the last modified timestamp on a file. If file doesn't exist, + creates a 0-byte file. =item seq Prints out sequences of numbers. See options. Also see L<Loops> for e +xample usage. seq 1 10 seq 1 10 2 =item cal Prints out a nicely formatted calendar. cal cal 7 2006 =item look Prints out words from a dictionary file that start with the given stri +ng. look foo look princ =item date Prints out the date. See options in man page for various formats. date date -e =item sleep Pauses for N seconds. sleep 2 =item alias Define your own commands. alias cls=clear =item wget Gets files from the web (or ftp). Extremely useful and powerful -- can + mirror entire sites. See man page for lots of options. wget http://foo.example.com/blah.tar.gz wget ftp://foo.example.com/blah.tar.gz =item curl Another tool to get remote files (in case wget isn't available). curl --remote http://foo.example.com/blah.tar.gz =item lynx A text-based web browser! USeful for simple pages, testing connections +, sucking down source code, converting html to text, or downloading f +iles from HTTP or FTP sites. =back =head1 Combining Tools =head2 Pipes '|' is the "pipe" character. It is uses to take the output from the l +eft-hand side (LHS) and give/shove ("pipe") it as input to the right- +hand side (RHS). Here are several example tasks that consectutively +use two or more of the tools we have discussed. =head3 Find a word that starts with "c" and has a "mel" in it. look c | grep mel =head3 See if the word FOO is in the first 3 lines of a file. head -3 file1 | grep FOO =head3 Take the lines that have FOO, look at just the first column, an +d show the unique values grep FOO file1 | cut -f1 | sort -u =head3 Determine the location of a file with FOO in its name. find | grep FOO locate FOO =head2 Redirection The output of a command can be saved to another file. =head3 Output grep FOO file1 > foo_lines =head3 Append Output grep FOO file1 >> foo_lines grep FOO file2 >> foo_lines =head3 Input grep FOO file1 cat file1 | grep FOO grep FOO < file1 a.out < input.dat =head3 Backticks echo `date` ls -lart `find | grep FOO` =head1 Bash A commonly used shell (although there are many) is bash. Besides just + running regular commands, it also supports setting/retrieving of var +iables and loops and conditionals. The man page is extensive. =head2 Variables foo=Bar echo my foo var = '$foo' We won't discus it here, but bash supports variable mangling. e.g. foo=blah.stuff.bar echo $foo echo ${foo%%.*} echo ${foo##*.} echo ${foo#*.} =head2 Loops for s in foo bar stff ; do echo s=$s ; done for s in foo bar stff do echo s=$s done for n in `seq 1 5` ; do touch /tmp/f$n.txt ; done =head1 Text Processing There are three powerful interpreters that can be used to filter text. + The man pages for each contain a wealth of information. =head2 sed Useful & efficient for substituions. sed s/1/AAAA/g /etc/hosts =head2 awk Useful for working with columns. awk '{print $2,$1}' /etc/hosts =head2 perl Useful for everything :) We'll come back to it a moment, but here are + examples that serve as replacements for many of the above commands. # echo perl -pe '' $f # sed s/// perl -pe 's/1/AAAA/g' $f # cut/awk perl -ane 'print $F[1], " ", $F[0]' $f # grep perl -ne 'print if /foo/' $f # head perl -ne 'print if $. <= 10' $f =head1 Regular Expressions What is a regular expression (regex)? It is just a pattern of somethi +ng you want to match in a string. And that pattern can be anything, +simple or very complex. What uses them? grep/egrep, sed, vi, and perl (and other lnaguages) No +te that there are several different "flavors" of regex depending on w +hat's using it, but they are all more-or-less the same. We will focu +s on perl regex. men perlretut man perlre Regular expressions can be scary at first so we will try to look at th +em from a general overview: =head2 Matching /a/ The I</>'s simply mark our pattern (note that perl can use anything fo +r the delimitersi with the I<m//> operator, e.g. I<m#a#>, I<m!a!>) an +d the I<a> is what we're matching, which is just the lower-case lette +r 'a'. /a*b/ This is 0 or more 'a' followed by a 'b'. /a+b/ This is one or more 'a' followed by a 'b' /a\+b/ This is literally "a+b" -- the backslash is used to escape otherwise s +pecial characters. /Number:\d+ Some word: \w+/ This is a string that includes a number and a word, e.g. "Blah Number: + 1234 Some word: foo1bar Blah" =head2 Substitution Expressions can be replaced with new values using the I<s///> substitu +tion operator: s/a/b/ Replaces an 'a' with 'b' s/a/b/g Replaces all 'a' with 'b' s/a/b/ig Replaces all 'a' or 'A' with 'b' s/n=(\d+)/N($1)/ Changes "n=1234" to "N(1234)". When there are parentheses in the patt +ern, they are used for grouping and for capturing -- the first set of + parens because $1, the second $2, and so on. =head2 More Regex This has barely scratched the surface, but we will see example usage o +f more regex components below. =head1 Perl The first place to start with command-line perl is the perlrun manpage +, and looking at & copying/using one-liner examples. perl -e 'print "hello world\n"' Using I<-p> to loop through a file and print each line: f=/tmp/datafile.txt perl -pe '' $f perl -pe 's/a/BBBBB/' $f perl -pe 's/a/BBBBB/g' $f Using I<-n> to loop through a file and look at each line: perl -ne '' $f perl -ne 'print' $f perl -ne 'print $_' $f perl -ne 'print if /a/' $f perl -ne 'print "$.)" . $_' $f perl -ne 'print "$.)" . $_ if $. % 2 == 0' $f Some things seen so far: =over 2 =item $_ This is one of many special variables (see man perlvar) that perl has. + It is perhaps the most special because it is the "default" -- whenev +er you don't supply a command with something it assumes you want to u +se $_ =item if(){} Basic IF clause in perl -- similar to other languages. I<if( ... ){ . +.. }elsif( ... ){ ... }else{ ... }> =item ... if ... ; Perl lets you short-hand simple if statements by reversing the order, +which is also nice because it's less lines (and no curlies) and can b +e more natural to read. Perl also provides I<unless> which is simply + a shortcut for I<if(!( ... ))> print "ok" if $ok; print "bad" if ! $ok; print "bad" unless $ok; while( ... ){ next unless ... ; last if ... ; } =item $. This is another special variable (see man perlvar) that is the current + line number when reading in a file. =back So now we can take a closer look at this: perl -ne 'print if /a/' $f And write it more explicitly in several ways to demonstrate the syntax +: perl -ne 'print $_ if /a/' $f perl -ne 'if( /a/ ){ print $_ }' $f perl -ne 'print $_ if $_ =~ /a/' $f perl -ne 'print $_ unless $_ !~ /a/' $f Here is a good time to note that the unofficial Perl motto is B<TMTOWT +DI> (There's More Than One Way To Do It). Another powerful command-line option is I<-a> to Auto-split, much like + cut & awk do. -aF =head1 Examples =head2 A geometry file needs to become many files split --lines=30 geoms.xyz /tmp/g___ for f in /tmp/g___* ; do d=`head -1 $f | sed s/^**//` mkdir -p blah/$d tail +2 $f > blah/$d/geom done =head2 Rename a bunch of .tpl files, dropping the extension for n in `seq 1 3` ; do touch f$n.tpl ; done for f in *.tpl ; do mv $f ${f%.tpl} ; done ls *.tpl | perl -ne 'chomp;$f0=$_;s/\.tpl$//;print "mv $f0 $_\n"' ls *.tpl | perl -pe 's/^(.+)(\..*)/mv $1$2 $1/' =head2 Get the first & fourth numbers from certain lines of a file If you look at the second line, it starts with BOMD, and then numbers. + I want to pick a first (-264.05765232) and the fourth number (0.000 +00000000) and write it in a new file. Then I want to repeat this in +every data entries in the file (as you can see, one entry takes 13 li +nes). grep '^ BOMD' deMon.mol | awk '{print $3, $6}' > deMon.mol.filtered grep '^ BOMD' deMon.mol | perl -alne 'print "$F[2] $F[5]"' > deMon.m +ol.filtered perl -alne 'print "$F[2] $F[5]" if $F[0] eq "BOMD"' deMon.mol > deMo +n.mol.filtered =head2 Get the number of days between two dates perl -MDate::Calc=Delta_Days -le 'print Delta_Days(2005,9,16, 2006,2 +,28)' =head2 Display a web page's source perl-MLWP::Simple -e "print get(shift)" http://www.perlmonks.org wget -O - http://www.perlmonks.org lynx --source http://www.perlmonks.org =head2 Get lines N -> M of a file These examples show how to display lines 5-8, inclusive from the /etc/ +passwd file: head -8 /etc/passwd | tail -4 tail +5 /etc/passwd | head -4 perl -ne 'print if 5<=$. && $.<=8' /etc/passwd # man perlvar for explanation of $. cat -n /etc/passwd | perl -ne 'print if s/^\s*[5678]\s+//' Now, to get the lines from /etc/password starting at a line with "news" in it, and stopping at a line with "ftp" in it, these all work (all the same except ordering, which determines whether or not the start and/or end lines are included): =over 1 =item [start, end] perl -ne '$ok||=/news/; print if $ok; $ok=0 if /ftp/' /etc/passwd =item [start,end) perl -ne '$ok||=/news/; $ok=0 if /ftp/; print if $ok' /etc/passwd =item (start,end] perl -ne 'print if $ok; $ok||=/news/; $ok=0 if /ftp/' /etc/passwd =item (start,end) perl -ne '$ok=0 if /ftp/; print if $ok; $ok||=/news/' /etc/passwd =back Basic approach is to take advantage of -n (man perlrun) and flip a flag on/off at the boundries. Note that the /news/ is a regex, and can take complex patterns (man perlre) =head2 Lazy math perl -le 'print( 3+5 )' # need the parens here There is also 'bc' command. =head2 Make & use a program to sum numbers alias add="perl -lne '\$x+=\$_; END{print \$x}'" cut -f1 file1 | add =head2 Perl One-liners =over 2 =item Favourite One-liners? L<http://perlmonks.org/?node_id=470397> A web server! perl -MIO::All -e 'io(":8080")->fork->accept->(sub { $_[0] < io(-x $ +1 ? "./$1 |" : $1) if /^GET \/(.*) / })' dos2unix perl -pi -e 's/\r//' filename =item What one-liners do people actually use? L<http://perlmonks.org/?node_id=515336> =item One Liners L<http://perlmonks.org/?node_id=421195> =back =head1 Reference Material =over 2 =item man Also note that the command I<apropos> searches man pages. =item man perl Which is basically a table of contents for the many perl man pages. On +es of particular interest are these manpages: perl perlrun perlsyn pe +rlfunc perlre perlretut =item perldoc Displays documentation for everything perl. perldoc -f sleep perldoc perlfunc perldoc -q how perldoc File::Find =item CPAN L<http://search.cpan.org> is one of Perl's great strengths -- it is a +huge repository of modules (libraries) to do pretty much anything and + everything with perl. =item Perl Monks L<http://perlmonks.org> is a great Perl community site. The knowledge + base of the forums, tutorials, and FAQ's is very extensive and the m +embers are very open & willing to help with any level (complete begin +ner through guru) question. This talk is posted at http://perlmonks.org/?node_id=553278 =item ME! I love to help with this stuff -- it's my vocation & hobby. I'm reach +able at E<lt>dwestbrook@gmail.comE<gt> or as davidryan0 on AIM. =back =cut
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: RFC: Text Processing for Chemists Tutorial
by kvale (Monsignor) on Jun 02, 2006 at 15:31 UTC | |
by davidrw (Prior) on Jun 02, 2006 at 15:52 UTC | |
|
Re: RFC: Text Processing for Chemists Tutorial
by planetscape (Chancellor) on Jun 03, 2006 at 13:41 UTC |