Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Searching local files

by tomazos (Deacon)
on Jan 13, 2006 at 13:34 UTC ( [id://522975]=perlquestion: print w/replies, xml ) Need Help??

tomazos has asked for the wisdom of the Perl Monks concerning the following question:

I have a directory containing 50,000 text files for a total of 1.5GB of data.

I want to quickly grep it for a Perl regular expression, and get a list of occurances by file name and line number.

I could just:

use strict; use warnings; use File::Find; my $regexp = shift @ARGV; sub process_file { open FH, $File::Find::name; my $count = 0; while (<FH>) { if ( print $File::Find::name . " " . $count . "\n" if /$regexp/; $count++; } close; } find(\&process_file, @ARGV);

(or indeed just use unix grep)

...but speed is important and disk space is cheap.

My vision is to somehow run a process (overnight) that creates a large index of regexp-beginning vs files. (Possibly even 30GB+ if it would be useful.)

The index could cache the results for all possible starting prefixes every regexp to some depth limited by the index size. This would allow the search algorithm to skip some of the work.

I realize search is a hard problem, but has anyone got any ideas about:

Architecture? I'd have to reimplement or modify Perl's regexp parser to do this right? No way to tokenize a regexp to get a prefix?

CPAN modules that might be useful?

Caching algorithm for the index?

Alternative solutions (open source only)? Similiar projects?

General feedback about the idea?

-Andrew.

Replies are listed 'Best First'.
Re: Searching local files
by holli (Abbot) on Jan 13, 2006 at 17:15 UTC
    There is, how surprising ;-), something readymade on CPAN (MyConText) that does exactly what the above posts suggest:
    MyConText is a pure man's solution for indexing contents of documents. It uses the MySQL database to store the information about words and documents and provides Perl interface for indexing new documents, making changes and searching for matches. For MyConText, a document is nearly anything -- Perl scalar, file, Web document, database field.
    It's so simple as:
    use MyConText; use DBI; # connect to database (regular DBI) my $dbh = DBI->connect('dbi:mysql:database', 'user', 'passwd'); # create a new index my $ctx = MyConText->create($dbh, 'indexname', 'frontend' => 'file', ' +backend' => 'phrase'); # or open existing one # my $ctx = MyConText->open($dbh, 'indexname'); # index documents $ctx->index_document('/path/to/file'); # search for matches, finds documents that contain "anybody", "somebod +y", "nobody", etc. my @documents = $ctx->contains('%body');
    Alternatives:


    holli, /regexed monk/
Re: Searching local files
by NetWallah (Canon) on Jan 13, 2006 at 16:35 UTC
    Agree with Sioln's answer, and would like to make it more "relational":

    I would make 3 tables :
    FILE (Contains ID and File Name, and possibly # of lines, or some indicator of size);
    REGEX (Contains an ID, the Regex)
    OCCURRANCE (Contains Regex_ID, FIle_ID, Line number).

    You can pick a regex from the REGEX table, and find all occurrances from the OCCURRANCE table.

    Obviously, run a batch process to populate the tables.

    Consider SQLite, if you want a fast, very lightweight,free database.

         You're just jealous cause the voices are only talking to me.

         No trees were killed in the sending of this message.    However, a large number of electrons were terribly inconvenienced.

Re: Searching local files
by Sioln (Sexton) on Jan 13, 2006 at 13:50 UTC

    Sorry, I don't trully understand the question.

    But the asnwer is the same :) Use DB, build index on the files, content and use SQL:LIKE to extract.

    HINT:
    TABLE STRUCTURE:
    ID(PRIMARY KEY,LONG)
    FILE_NAME(TEXT)
    STR_NUMBER(LONG) # - STRING NUMBER IN FILE
    STR_DATA(TEXT) # - STRING ¹ STR_NUMBER FROM FILE_NAME

    AND SQL: SELECT FILE_NAME, STR_NUMBER FROM TABLE WHERE (place UCASE here if DB casedependent)STR_DATA LIKE %REGEXP_HERE%

Re: Searching local files
by superfrink (Curate) on Jan 13, 2006 at 22:08 UTC
    You might want to look into "full text search" tools. You did say regexes and I think full-text searches might be limited to substrings but it might be worth checking into.

    CPAN lists the DBIx::FullTextSearch module that might be useful.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://522975]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (1)
As of 2024-04-25 00:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found