Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Write Google Desktop Search plug-ins in Perl. ("words")

by tye (Sage)
on Sep 08, 2005 at 20:27 UTC ( [id://490310]=note: print w/replies, xml ) Need Help??


in reply to Write Google Desktop Search plug-ins in Perl.

On my to-do list is to write a google-desktop-search plug-in (probably in Perl) that hands to google only searchable words or only the unique searchable words from text files (unless the text file is short enough to not require such processing) and to log when that leaves parts of the text file unindexed. And then override many of Google's "filters" to use this one instead.

GDS has some unfortunate design problems (last I checked and as near as I can tell -- I haven't checked out GDS v2 because I could find absolutely no mention of improvements in these problems and it sounds like just a lots-of-flash, Microsoft-imitating interface do-over that I'd probably hate):

  1. It will only index 5000 "words" per file
  2. Repeated words count against this total
  3. Punctuation counts against this total even though you can't search on punctuation
  4. 80 punctuation characters in a row counts as 80 words! (but you can't search on any of them)
  5. The tool gives you no way to check which files have been indexed and makes no mention of the fact that it only indexed the first 2% or 5% or whatever of tons of your files

Note that I have many files that have well under 5000 unique "words" (where here "words" means things that GDS will actually let me search for) that GDS silently only bothered to index the first tiny fraction of, in part, because they contained chunks of punctuation characters (I dislike speed-bump comments, but they were enshrined in the company coding standard before I arrived). It was a long and frustrating task to figure out that this was the problem.

But I'm sure that if I only had a PhD or two, I'd understand why this design is actually superior to one that, I don't know, indexes most of the words of files such that you can search for them or at least tells you when its indexing of a file missed the vast majority of its content.

(Yes, I do understand that most people have limited disk space, that sometimes a hard upper bound is a necessary evil, and that there is some validity to the Microsoft^WGoogle mindset of not showing people too much information because it confuses many of them. But it appears that Google felt it much more important to give me the ability to look at "cached" copies of the first few kilobytes of every previous version of a file over letting me search beyond the first few kilobytes. I also understand that resorting to only unique words will mean that searches for "adjacent words" won't work if those two words weren't adjacent the first time they appear in the document. Silly me, I find being able to search the entire content of larger files w/o "adjacent words" always working to be far superior to being able to use "adjacent words" and others searches over 100% of the first 2% of the file.) Yes, I'm bitter; thanks for noticing. (:

- tye        

  • Comment on Re: Write Google Desktop Search plug-ins in Perl. ("words")

Replies are listed 'Best First'.
Re^2: Write Google Desktop Search plug-ins in Perl. ("words")
by zby (Vicar) on Sep 09, 2005 at 19:55 UTC
    Interesting. Is it likely that the web google indexer has similar limitations?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://490310]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (2)
As of 2024-04-25 02:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found