comment on

On my to-do list is to write a google-desktop-search plug-in (probably in Perl) that hands to google only searchable words or only the unique searchable words from text files (unless the text file is short enough to not require such processing) and to log when that leaves parts of the text file unindexed. And then override many of Google's "filters" to use this one instead.

GDS has some unfortunate design problems (last I checked and as near as I can tell -- I haven't checked out GDS v2 because I could find absolutely no mention of improvements in these problems and it sounds like just a lots-of-flash, Microsoft-imitating interface do-over that I'd probably hate):

It will only index 5000 "words" per file
Repeated words count against this total
Punctuation counts against this total even though you can't search on punctuation
80 punctuation characters in a row counts as 80 words! (but you can't search on any of them)
The tool gives you no way to check which files have been indexed and makes no mention of the fact that it only indexed the first 2% or 5% or whatever of tons of your files

Note that I have many files that have well under 5000 unique "words" (where here "words" means things that GDS will actually let me search for) that GDS silently only bothered to index the first tiny fraction of, in part, because they contained chunks of punctuation characters (I dislike speed-bump comments, but they were enshrined in the company coding standard before I arrived). It was a long and frustrating task to figure out that this was the problem.

But I'm sure that if I only had a PhD or two, I'd understand why this design is actually superior to one that, I don't know, indexes most of the words of files such that you can search for them or at least tells you when its indexing of a file missed the vast majority of its content.

(Yes, I do understand that most people have limited disk space, that sometimes a hard upper bound is a necessary evil, and that there is some validity to the Microsoft^WGoogle mindset of not showing people too much information because it confuses many of them. But it appears that Google felt it much more important to give me the ability to look at "cached" copies of the first few kilobytes of every previous version of a file over letting me search beyond the first few kilobytes. I also understand that resorting to only unique words will mean that searches for "adjacent words" won't work if those two words weren't adjacent the first time they appear in the document. Silly me, I find being able to search the entire content of larger files w/o "adjacent words" always working to be far superior to being able to use "adjacent words" and others searches over 100% of the first 2% of the file.) Yes, I'm bitter; thanks for noticing. (:

- tye

In reply to Re: Write Google Desktop Search plug-ins in Perl. ("words") by tye
in thread Write Google Desktop Search plug-ins in Perl. by techcode

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Syntactic Confectionery Delight
	PerlMonks