I don't think forks wouldn't allow me to write output to the same file in sequence without some kind of IPC
I guess you didn't bother to follow the link or didn't bother to understand the material presented at that link.
without some kind of IPC which would probably slow things down
So you ignored or don't believe the point I just stated. If you bothered to read and understand what I linked to, then you would not have clung to this assumption.
the worker threads in my script do a lot of work, but they also require a lot of data
Then you will get better performance if you do unusual work to arrange for that data to be made available more efficiently than it can be by the easy things like threads::shared. Copying data from a parent process to child processes will be significantly faster using vanilla fork than using forks (which is significantly faster than threads::shared).
Since you said "IPC [...] would probably slow things down", you haven't even tried that. Frankly, that is what I would try first (probably using an approach similar to MCE, though I have my own, simpler implementation of that type of approach).
If that method of communicating data is too slow, then you probably want to do more work to communicate the data using shared memory. That can be done from parent to child by just storing the data in a contiguous block of memory where the children can read it without having to copy-on-write pages of memory (as happens with read-only access to Perl data structures). Going the other direction is similar but harder.
I'd rather have a dozen cores being utilized than one.
And I'd think you'd rather have those cores getting more real work done than having them spend time doing expensive operations like making tons of new Perl "threads".
the worker threads in my script do a lot of work
Then optimizing how the real code works should probably not be based on comparisons in performance of approaches that are benchmarked using threads that are doing trivial amounts of work.