in reply to Re^2: Selecting the difference between two strings
in thread Selecting the difference between two strings

I am parsing the set of file merges and branches and finding the branching structure of the source archive. There could easily be tens of thousands or hundreds of thousands of these records that have to be parsed.

Well, how complicated is the branching structure, really? Is it something as simple as this?

- branch1 - /- branch2 -\ /-- branch3 --\ *** ---- branch4 ----[same sub-structure for all branches] \-- branch5 --/ \- branch6 -/ \ ... /
Or is it more complicated? Do some branches only contain one or another subset of the overall structure (branch FOO contains everything, but branch BAR only contains component X)? Or are there additional versioning branches at lower levels along some paths (branch FOO contains components X, Y and Z, but there are two versions of Y within FOO)?

If it's really as simple as my crude diagram, you only need to look for common strings from one direction (left-to-right, or "top-down"). If it's not that simple, you need to be a little more explicit about what you want to derive from the overall structure. What sort of representation for the (more complicated) branching structure would be desirable?

(Not in the sense of "tell us what you want so we can do it for you", but rather "make sure you really know what you want so you work on solving the right problems".)

Replies are listed 'Best First'.
Re^4: Selecting the difference between two strings
by qazwart (Scribe) on Sep 27, 2006 at 04:27 UTC
    The branching structure can be quite complex with one branch coming off another. And, there is no way to detect this by simply looking at the file names since the branching structure is flattened in the file names.

    Nor, is there any one file on all branches. There are dozens of applications in this source archive, and almost all branches are involved with just a single app.

    In order to analyze the branching structure, I have to look at the tens of thousands of branching record. Each record is one file being branched in a single branching event. Creating one branch can create hundreds or even thousands of these records since there could be hundreds or thousands of files branched at a single time.

    The best way to analyze the data is to simplify these records: If I can strip out the directory and file information from the branch names, I then get a simple fromBranch->toBranch record. Throw away the duplicates, and I have maybe a few dozen records. Build a data tree from these records, and I have the branching structure.

    Where I am getting stuck is removing the directory and file names from the branch names. That's why I asked this particular question.

    Even though looping isn't that efficient, I could have easily written a program with a loop in an hour or two, and I doubt the whole program would have taken more than a few minutes to run. I would have saved a lot of time in attempting to research this problem and requesting help. My problem would have been solved, I would have gotten the kudos of those around me, and at the end of the week, I would collect my paycheck. What I wouldn't have done is improve my Perl hashing skills.

    Instead, I decided there has to be a better way to manipulate the whole string at once instead of looping a single character at a time. Given Perl's toolkit of bitwise operations and regular expressions, I figured there must be some way to XOR or AND the strings together to separate the chaff from the wheat.

    Finding an answer improves my understanding of Perl. That's what I am really after.