On my to-do list is to write a google-desktop-search plug-in (probably in Perl) that hands to google only searchable words or only the unique searchable words from text files (unless the text file is short enough to not require such processing) and to log when that leaves parts of the text file unindexed. And then override many of Google's "filters" to use this one instead.

GDS has some unfortunate design problems (last I checked and as near as I can tell -- I haven't checked out GDS v2 because I could find absolutely no mention of improvements in these problems and it sounds like just a lots-of-flash, Microsoft-imitating interface do-over that I'd probably hate):

  1. It will only index 5000 "words" per file
  2. Repeated words count against this total
  3. Punctuation counts against this total even though you can't search on punctuation
  4. 80 punctuation characters in a row counts as 80 words! (but you can't search on any of them)
  5. The tool gives you no way to check which files have been indexed and makes no mention of the fact that it only indexed the first 2% or 5% or whatever of tons of your files

Note that I have many files that have well under 5000 unique "words" (where here "words" means things that GDS will actually let me search for) that GDS silently only bothered to index the first tiny fraction of, in part, because they contained chunks of punctuation characters (I dislike speed-bump comments, but they were enshrined in the company coding standard before I arrived). It was a long and frustrating task to figure out that this was the problem.

But I'm sure that if I only had a PhD or two, I'd understand why this design is actually superior to one that, I don't know, indexes most of the words of files such that you can search for them or at least tells you when its indexing of a file missed the vast majority of its content.

(Yes, I do understand that most people have limited disk space, that sometimes a hard upper bound is a necessary evil, and that there is some validity to the Microsoft^WGoogle mindset of not showing people too much information because it confuses many of them. But it appears that Google felt it much more important to give me the ability to look at "cached" copies of the first few kilobytes of every previous version of a file over letting me search beyond the first few kilobytes. I also understand that resorting to only unique words will mean that searches for "adjacent words" won't work if those two words weren't adjacent the first time they appear in the document. Silly me, I find being able to search the entire content of larger files w/o "adjacent words" always working to be far superior to being able to use "adjacent words" and others searches over 100% of the first 2% of the file.) Yes, I'm bitter; thanks for noticing. (:

- tye        


In reply to Re: Write Google Desktop Search plug-ins in Perl. ("words") by tye
in thread Write Google Desktop Search plug-ins in Perl. by techcode

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.