All about thread-local storage

theamk · on Feb 16, 2021

Interesting!

I had known that thread-local variables can be access pretty fast, via dedicated segment register, but I was not clear how can one make this work for dynamically loaded PIC code, like most .so files.

Turns out you can't. You only get fast access via dedicated register if you are using variable declared in the main program. The .so files have to call special function which does multiple memory lookups to get the actual location, probably severely reducing performance.

(and this is another case when seemingly simple operation -- getting variable value -- gets internally translated to dozens of operations and a function call)

bregma · on Feb 16, 2021

That's one particular implementation for one language runtime one common OS. It's neither a requirement of the C language nor of the ELF executable file format.

LargoLasskhyfv · on Feb 16, 2021

What do you think of this, then? https://github.com/ssrg-vt/hermitux/wiki/Fast-system-calls-i...

I just had this in mind because skimmed it a few hours ago in the context of https://ssrg-vt.github.io/hermitux

mentioned in https://news.ycombinator.com/item?id=26142285

richardwhiuk · on Feb 16, 2021

That's relying on a unikernel and static linking I think.

zmodem · on Feb 16, 2021

https://www.akkadia.org/drepper/tls.pdf is also a great write-up

jstrong · on Feb 16, 2021

I have done plenty with threads but never used thread-local storage except when forced to by some other library using it. To me it seems like a bolted-on monstrosity that provides thread safety to thread-naive code after the fact. Am I missing something? Is this a good solution to some problem I haven't encountered? Are there situations where the performance of TLS is better than some other solution?

CodesInChaos · on Feb 16, 2021

I've used it to store the state of a global RNG. Thread local state is faster than locking a shared RNG and the observable behaviour is the same.

For high quality code you often pass around the RNG instance explicitly (improves testability), but often I just want a random number without bothering with explicit state management.

jstrong · on Feb 16, 2021

that seems pretty legit.

electricshampo1 · on Feb 16, 2021

One use is scalable counting:

Is Parallel Programming Hard, And, If So, What Can You Do About It?

https://cdn.kernel.org/pub/linux/kernel/people/paulmck/perfb...

jstrong · on Feb 16, 2021

I read the section in that book about storing each thread's count in a TLS variable, and I think the root of the confusion is related to how C vs. Rust (the language I use and learned how to write multithreaded code in) deal with threads.

if I wanted to store a per-thread count in rust, it would make zero sense to use TLS for that, it would just be on the stack in the context of the thread's lexical scope:

    let mut threads = Vec::new();
    let global_count = Arc::new(AtomicUsize::new(0));
    for _ in 0..n_threads {
        threads.push(std::thread::spawn({
            let global_count = global_count.clone();
            move || {
                let mut thread_count = 0;
                for _ in some_iteration {
                    thread_count += 1;
                }
                global_count.fetch_add(thread_count, Ordering::Release);
            }
        });
    }
    for handle in threads { let _ = handle.join().unwrap(); }
    println!("global count = {}", global_count.load(Ordering::Acquire));

that is a form of "thread-local storage", I guess, but does not involve any of the TLS primitives.