in reply to Re^11: Renaming an image file
in thread Renaming an image file

So where is the problem?
Let's see, you have a back-end storage. It stores imagines. The user has stored an image twice (a copy, perhaps). For whatever reason, filenames are remapped. (You aren't assuming the actual user is going to remember the random file names, are you?). The OP implements the suggestion to rehash based on content - making the backend merge the two files. The user, whatever front end he's using, still sees two files. He then decides to delete one of his copies. Or modify one (keeping the other copy as an original).

Oops.

Of course, maybe the OP's system doesn't work that way. We do not know. But I think it's really, really bad to give the OP an advice that may cause to data loss, and then, if it's pointed out, try to wiggle out of it by making assumptions on the OPs system.

Replies are listed 'Best First'.
Re^13: Renaming an image file
by Anonymous Monk on Nov 29, 2010 at 09:37 UTC
    But I think it's really, really bad to give the OP an advice that may cause to data loss, and then, if it's pointed out, try to wiggle out of it by making assumptions on the OPs system.

    Its not that bad. To answer the question you have to make some assumptions, like taking the OP at his word. You want better advice, ask better questions.

    And btw, none is trying to wiggle out of anything. I think its poor form to go off tangent on other peoples advice, when what you really wanted was to tell the OP that his requirements are unclear

      Uhm, I don't find anything unclear about the OPs request. It certainly doesn't provide any wiggle space to come up with a solution that makes it possible to delete data. Now, I don't BrowserUK realized the chance of collision is 1 if there are duplicated files when he suggested to hash on file content; but after morgon pointed this out, Anonymous Monk presented that as a feature. That was what I was objecting to, and then BrowserUK came back he couldn't see data loss is happening, using some assumptions on how the OPs environment may look like.

      Instead if people had said "you know, morgon is right, hashing on content has a much higher chance of collisions than hashing on file names", instead of defending the bad advice, this long subthread wouldn't have happened.

      But Perl is a religion, isn't? We don't admit mistakes or better points of view.

        But Perl is a religion, isn't? We don't admit mistakes or better points of view.

        You know thats bullshit

Re^13: Renaming an image file
by BrowserUk (Patriarch) on Nov 29, 2010 at 16:47 UTC
    Or modify one (keeping the other copy as an original).

    If the OPs system allows users to "edit" their images, then when the image has been edited, the new image will hash to a different value. No data loss.

    (You aren't assuming the actual user is going to remember the random file names, are you?).

    Indeed not. Hence the only reason for stating my one assumption about the OPs system: That for the renaming process to work, users would not be able to have direct links to their files.

    Therefore, as part of the renaming process it would be necessary to adjust the mechanisms that map user references to their files.

    The user, whatever front end he's using, still sees two files. He then decides to delete one of his copies.... Oops.

    As a part of the necessary re-mapping of user references to files, it would be equally necessary--and frankly, a trivial matter--to detect such double references.

    And once detected, any number of mechanisms might be used to deal with your assumptive scenario.

    The OP might opt for a simple mechanism as is sometimes used by the windows explorer for example: "Copy of joe_abcdef1234567890.jpg". It has the merit of bringing the duplication to the users attention. But maybe too wordy.

    Perhaps the older mechanism: "joe_abcdef1234567890(2).jpg"

    Or perhaps it makes sense for the OPs system to simply implement a reference counting mechanism (say in his DB), that ensures deletions only delete references--not files--until the reference count drops to one.

    Of course, maybe the OP's system doesn't work that way. We do not know. But I think it's really, really bad to give the OP an advice that may cause to data loss,

    All the alternatives so far offered have possible failure scenarios. For example, the "hash the name" suggestion can fail--silently and undetectably--as I outlined in Re^5: Renaming an image file:

    why replacing the number with the md5 of that number is a poor idea?
    Let's say you have file joe_1.jpg and you rename it to joe_c4ca4238a0b923820dcc509a6f75849b,jpg, then next week joe uploads a new photo and calls it joe_1.jpg. Different photo, but same name. Overwrite. People tend to be very lazy and will likely stick to numbers in the low thousands, the potential for clashes is huge.

    It is the OPs knowledge of his system that is required in order to make a definitive selection between the solutions offered. I offered my suggestion in good faith, just as I'm sure all the other responders(*) to the original question did.

    Without further details of the OPs system, none of us could possible either: fully guarantee and endorse, nor withdraw as broken, our initial offerings.

    And it is impossible to realistically address possible flaws in our suggestions, until realistic failure modes are stated rather than implied. As I did above, and as you didn't until I prompted you to do.

    But this has gone way beyond what is good for the OPs system.


    Of course, maybe the OP's system doesn't work that way. We do not know. But I think it's really, really bad to give the OP an advice that may cause to data loss, and then, if it's pointed out, try to wiggle out of it by making assumptions on the OPs system.

    I was not seeking "wiggle room", just a clear statement of the envisaged failure scenario--along with the assumptions that underpin it--rather than some half-mumbled, indirect implications of doom and disaster. Shame it took so many direct questions to get a direct answer.

    Perhaps the OPs best solution would be to hash the existing name combined with the files contents. And while he is at it he could throw the file length and upload/creation date-time into the pot. That would certainly address all the failure scenarios so far outlined.

    But, of course, all those failure modes--including yours--are based upon assumptions about the OPs system.

    And I'm damn sure that whatever solution is offered, in the absence of specific details of the OPs system, someone will be able to imagineer a new failure scenario

    Bottom line: Whenever we reply to an OP, we inevitably make some assumptions. It is unavoidable in the absence of a fully detailed specification document--which from experience can run to 100s if not 1000s of pages. We here are not in that game.

    We make some assumptions; come up with solutions based upon our assumptions; and offer them to the OP. It falls to the OP to assess them within their knowledge of their systems and arrive at their conclusions.

    In the case of this specific OPs question, I considered my "hash the contents" suggestion superior to the "hash the name" or "generate a number from the date and rand()" solutions previously offered, because I envisaged fewer failure modes. And based upon the specific scenarios so far outlined; I still do.

    Do I see my original, brief suggestion as the perfect solution? No.

    But it is still superior to the solution that you offered. By simple virtue of the fact that I actually offered one.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      If the OPs system allows users to "edit" their images, then when the image has been edited, the new image will hash to a different value. No data loss.
      Really? What did the OP write that makes you assume this? All he describes is renaming existing files. I'd say that's another assumption.
      Whenever we reply to an OP, we inevitably make some assumptions.
      Sure. State them in your reply. Knowingly assuming things, and keeping them hidden from your reply is very, very bad. This was your answer: Rather than a random number, I'd suggest using a digest of the files contents. Digest::MD5 for example. Whilst not guaranteed to be unique, the chance of collision is remote. All the assumptions you made here, you kept for yourself.

      But, again, I'd like to point out, I did not have a problem with your original suggestion. I had problems with the replies following morgons observation there's a problem if there are two files with the same content.

      But it is still superior to the solution that you offered.
      I did not offer a solution because by the time I read the thread, good suggestions were already made. There's little value in repeating suggestions.
        What did the OP write that makes you assume this?

        The OP doesn't enter the equation. You invented the "editing" scenario.

        I took it to the logical conclusion, should the "hash the contents" be adopted.

        So that's a strawman.

        All the assumptions you made here, you kept for yourself.

        Oh my! Do you really credit me with prescient foresight that I can instantly foresee every possible scenario that the OP might encounter?

        Sir! You do me much too much credit.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.