they use a porgram called spider that looks up for a file called "robots.txt" in the root dir of the specified (registered) domain. So they try to get "http://www.perlmonks.org/robots.txt") Inside this file is specified what the spider may read and what directories he might search. But still not.
You can read about this very good on the special section of these search engines, where is described how such a file must and can look like and how it works.