Multiple independent embedded Python Interpreters on multiple operating system threads invoked from C/C++ program
It's not exactly an answer to your question, but you could use separate processes instead of threads, then the problems should vanish.
Pros:
- No need hacking python (and making sure the result works in all of the intended cases)
- Probably less development effort overall
- Easy upgrading to new python versions
- Clearly defined interfaces between different processes, thus easier to get right and debug
Cons:
- Maybe slightly more overweight, depending on your platform (relatively light-weight processes on linux)
If you use shared memory for IPC, your resulting application code shouldn't differ too much from what you'd get with threads.
Given that some people are arguing you should always use processes over threads, I'd at least consider it as an alternative if it fits your constraints in any way.