C++ replacement for C99 VLAs (goal: preserve performance)

c++ arrays performance variable-length-array stack-allocation

Create a large buffer (MB+) in thread-local storage. (Actual memory on heap, management in TLS).

Allow clients to request memory from it in FILO manner (stack-like). (this mimics how it works in C VLAs; and it is efficient, as each request/return is just an integer addition/subtraction).

Get your VLA storage from it.

Wrap it pretty, so you can say stack_array<T> x(1024);, and have that stack_array deal with construction/destruction (note that ->~T() where T is int is a legal noop, and construction can similarly be a noop), or make stack_array<T> wrap a std::vector<T, TLS_stack_allocator>.

Data will be not as local as the C VLA data is because it will be effectively on a separate stack. You can use SBO (small buffer optimization), which is when locality really matters.

A SBO stack_array<T> can be implemented with an allocator and a std vector unioned with a std array, or with a unique ptr and custom destroyer, or a myriad of other ways. You can probably retrofit your solution, replacing your new/malloc/free/delete with calls to the above TLS storage.

I say go with TLS as that removes need for synchronization overhead while allowing multi-threaded use, and mirrors the fact that the stack itself is implicitly TLS.

Stack-buffer based STL allocator? is a SO Q&A with at least two "stack" allocators in the answers. They will need some adaption to automatically get their buffer from TLS.

Note that the TLS being one large buffer is in a sense an implementation detail. You could do large allocations, and when you run out of space do another large allocation. You just need to keep track each "stack page" current capacity and a list of stack pages, so when you empty one you can move onto an earlier one. That lets you be a bit more conservative in your TLS initial allocation without worrying about running OOM; the important part is that you are FILO and allocate rarely, not that the entire FILO buffer is one contiguous one.

c++ arrays performance variable-length-array stack-allocation

I think you have already enumerated most options in your question and the comments.

Use std::vector. This is the most obvious, most hassle-free but maybe also the slowest solution.
Use platform-specific extensions on those platforms that provide them. For example, GCC supports variable-length arrays in C++ as an extension. POSIX specifies alloca which is widely supported to allocate memory on the stack. Even Microsoft Windows provides _malloca, as a quick web search told me.
In order to avoid maintenance nightmares, you'll really want to encapsulate these platform dependencies into an abstract interface that automatically and transparently chooses the appropriate mechanism for the current platform. Implementing this for all platforms will be a bit of work but if this single feature accounts for 3 × speed differences as you're reporting, it might be worth it. As a fallback for unknown platforms, I'd keep std::vector in reserve as a last resort. It is better to run slow but correctly than to behave erratic or not run at all.
Build your own variable-sized array type that implements a “small array” optimization embedded as a buffer inside the object itself as you have shown in your question. I'll just note that I'd rather try using a union of a std::array and a std::vector instead of rolling my own container.
Once you have a custom type in place, you can do interesting profiling such as maintaining a global hash table of all occurrences of this type (by source-code location) and recording each allocation size during a stress test of your program. You can then dump the hash table at program exit and plot the distributions in allocation sizes for the individual arrays. This might help you to fine-tune the amount of storage to reserve for each array individually on the stack.
Use a std::vector with a custom allocator. At program startup, allocate a few megabytes of memory and give it to a simple stack allocator. For a stack allocator, allocation is just comparing and adding two integers and deallocation is simply a subtraction. I doubt that the compiler-generated stack allocation can be much faster. Your “array stack” would then pulsate correlated to your “program stack”. This design would also have the advantage that accidental buffer overruns – while still invoking undefined behavior, trashing random data and all that bad stuff – wouldn't as easily corrupt the program stack (return addresses) as they would with native VLAs.
Custom allocators in C++ are a somewhat dirty business but some people do report they're using them successfully. (I don't have much experience with using them myself.) You might want to start looking at cppreference. Alisdair Meredith who is one of those people that promote the usage of custom allocators gave a double-session talk at CppCon'14 titled “Making Allocators Work” (part 1, part 2) that you might find interesting as well. If the std::allocator interface it too awkward to use for you, implementing your own variable (as opposed to dynamically) sized array class with your own allocator should be doable as well.

c++ arrays performance variable-length-array stack-allocation

Regarding support for MSVC:

MSVC has _alloca which allocates stack space. It also has _malloca which allocates stack space if there is enough free stack space, otherwise falls back to dynamic allocation.

You cannot take advantage of the VLA type system, so you would have to change your code to work based in a pointer to first element of such an array.

You may end up needing to use a macro which has different definitions depending on the platform. E.g. invoke _alloca or _malloca on MSVC, and on g++ or other compilers, either calls alloca (if they support it), or makes a VLA and a pointer.

Consider investigating ways to rewrite the code without needing to allocate an unknown amount of stack. One option is to allocate a fixed-size buffer that is the maximum you will need. (If that would cause stack overflow it means your code is bugged anyway).

CodeHunter

C++ replacement for C99 VLAs (goal: preserve performance)

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last