saranrsm has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I came across "Bit vector" from one of my posts where I was said that I could compress my file by 75% using Bit vector and thus 75% less memory required and 75% shorter the search space to look up for a substring from a string.

So now I wanted to know that what a "bit vectors" is all about,(Sry I havn't heard of it) and how can i use it for my exact string matching problem.I request to our fellow monks to shed some light over it or direct me to some digestible article so that i could grasp it easily (as i am not from the computer science background).

Here is the reference node where i was introduced to it at first Re: Exact string matching

Replies are listed 'Best First'.
Re: Bit Vector ??
by BrowserUk (Patriarch) on Oct 24, 2011 at 10:58 UTC

    There is a big problem with encoding sequences in 2-bit format.

    Whilst they occupy 1/4 the space, and are theoretically 4x faster to search, the problem is that computers only compare bytes.

    With each byte holding 4 codons, you will only be able to find matches at every fourth position.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Bit Vector ??
by Anonymous Monk on Oct 24, 2011 at 10:28 UTC
Re: Bit Vector ??
by mrstlee (Beadle) on Oct 24, 2011 at 13:02 UTC
    Maybe it would be possible to split the file into smaller files and run scripts in parallel on the chunks?
    Of course you would have to compensate for the string lengths:

    For example if you want length to be 3 then the last 2 characters in file n have to be combined with the 1st and 2nd characters in file n+1. So for each chunk there will 2 extra strings - except for the last chunk.

    Naturally parallel processing is only viable if you have sufficient CPU & RAM.
      So what's the conclusion guys..does index uses Boyer moore algorithm or what??
        Dear Anonymous monk, I think you are lost and you are in the wrong thread. We are here discussing about bit vectors...