in reply to Re: Selecting the difference between two strings
in thread Selecting the difference between two strings

This problem has to do with the Perforce version control system, and finding the branching structure in the source archive. I am parsing the set of file merges and branches and finding the branching structure of the source archive. There could easily be tens of thousands or hundreds of thousands of these records that have to be parsed.

The truth is I can take a few shortcuts to make this task easier. For example, it is common practice in Perforce to put branch names in all caps, so I could really search for the first instance of m</[A-Z_]+/> and find what I'm looking for. Or, I can simply assume that the fourth set of characters between the slashes is the name of the branch because that's where it is in my particular source archive.

However, I look at this and say These are strings, I want to manipulate them, and that's what you use Perl for. There's gotta be a way to simply say "Take these two strings and snip the matching parts". Maybe not quite that simple, but you get the idea.

I'm really just looking for a Perl solution to this problem. Something that will make Python programmers weep and shell scripters green with envy.

  • Comment on Re^2: Selecting the difference between two strings

Replies are listed 'Best First'.
Re^3: Selecting the difference between two strings
by hgolden (Pilgrim) on Sep 26, 2006 at 22:22 UTC
    I don't know much about Perforce, but from what you've said, there are probably easier ways to do this than what you're asking for.

    If all of the information is of the format that you show in your original post, you could pretty easily form an indented list showing the branches.

    For simplicity, let's call the text between slashes units. Now, sort the list ASCIbetically, and do a very simple loop that preserves the entry above. By comparing the current entry to the above entry, you could determine the number of tabs to print before the current entry. I.E if they have no units in common, we would print no spaces. If they have three units in common, but are both longer than three units, then we insert three tabs. If they have three units in common, but the above entry is only three units long, then we need four tabs.

    Anyway, you can fit the particulars to what you want. To me, a Perl solution is one that's elegant in Perl, and not one that takes advantage of Perl's string abilities.

    Hays

      There is a difference between a branch and a directory for a file. It just happens that the branch name is the fourth of what you called an entity. I have a few dozen branches, but tens of thousands of files, sorting the way you tell me will simply give me a sorted list of all files and not the branches.

      Second of all, although I can tell you that the branch name is the fourth entity, I cannot by simply looking at the name tell you the relationship between branches. For example:

      //Efp/Acme/MAIN/Mydir/bar.c //Efp/Acme/FOO/Mydir/bar.c //Efp/Acme/BAR/Mydir/bar.c
      In this example, MAIN, FOO, and BAR are branches. However, what are the relationship between these branches? Did both BAR and FOO branch directly off of MAIN, or did BAR branch off of MAIN, and then FOO branched off of BAR? Nothing in the name is telling you this. Maybe MAIN was what branched off of FOO. Simply sorting this information tells me nothing.

      The only way I know is by looking at the integration records that tell me for each and every file the fromBranch and the toBranch. So, I might have a few hundred records of files branching from MAIN to FOO. By stripping out the file names (which is what I want to do), and removing duplicates, I end up with a single record telling me that FOO branched off of MAIN.

      By taking these from/to branch entries, I can then reconstruct the entire tree structure of the branches. However, getting there means stripping the file information which is why I asked this query.

Re^3: Selecting the difference between two strings
by graff (Chancellor) on Sep 27, 2006 at 03:37 UTC
    I am parsing the set of file merges and branches and finding the branching structure of the source archive. There could easily be tens of thousands or hundreds of thousands of these records that have to be parsed.

    Well, how complicated is the branching structure, really? Is it something as simple as this?

    - branch1 - /- branch2 -\ /-- branch3 --\ *** ---- branch4 ----[same sub-structure for all branches] \-- branch5 --/ \- branch6 -/ \ ... /
    Or is it more complicated? Do some branches only contain one or another subset of the overall structure (branch FOO contains everything, but branch BAR only contains component X)? Or are there additional versioning branches at lower levels along some paths (branch FOO contains components X, Y and Z, but there are two versions of Y within FOO)?

    If it's really as simple as my crude diagram, you only need to look for common strings from one direction (left-to-right, or "top-down"). If it's not that simple, you need to be a little more explicit about what you want to derive from the overall structure. What sort of representation for the (more complicated) branching structure would be desirable?

    (Not in the sense of "tell us what you want so we can do it for you", but rather "make sure you really know what you want so you work on solving the right problems".)

      The branching structure can be quite complex with one branch coming off another. And, there is no way to detect this by simply looking at the file names since the branching structure is flattened in the file names.

      Nor, is there any one file on all branches. There are dozens of applications in this source archive, and almost all branches are involved with just a single app.

      In order to analyze the branching structure, I have to look at the tens of thousands of branching record. Each record is one file being branched in a single branching event. Creating one branch can create hundreds or even thousands of these records since there could be hundreds or thousands of files branched at a single time.

      The best way to analyze the data is to simplify these records: If I can strip out the directory and file information from the branch names, I then get a simple fromBranch->toBranch record. Throw away the duplicates, and I have maybe a few dozen records. Build a data tree from these records, and I have the branching structure.

      Where I am getting stuck is removing the directory and file names from the branch names. That's why I asked this particular question.

      Even though looping isn't that efficient, I could have easily written a program with a loop in an hour or two, and I doubt the whole program would have taken more than a few minutes to run. I would have saved a lot of time in attempting to research this problem and requesting help. My problem would have been solved, I would have gotten the kudos of those around me, and at the end of the week, I would collect my paycheck. What I wouldn't have done is improve my Perl hashing skills.

      Instead, I decided there has to be a better way to manipulate the whole string at once instead of looping a single character at a time. Given Perl's toolkit of bitwise operations and regular expressions, I figured there must be some way to XOR or AND the strings together to separate the chaff from the wheat.

      Finding an answer improves my understanding of Perl. That's what I am really after.