top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

How would you measure the time spent in a context switch?

+3 votes
338 views
How would you measure the time spent in a context switch?
posted Oct 11, 2014 by Kali Mishra

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

1 Answer

0 votes

You can't easily differentiate the waste due to thread-switching and that due to memory cache contention. You CAN measure the thread contention.. Namely, on linux, you can cat /proc/PID/XXX and get tons of detailed per-thread statistics. HOWEVER, since the pre-emptive scheduler is not going to shoot itself in the foot, you're not going to get more than say 30 ctx switches per second no matter how many threads you use.. And that time is going to be relatively small v.s. the amount of work you're doing.. The real cost of context-switching is the cache pollution. e.g. there is a high probability that you'll have mostly cache misses once you're context-switched back in. Thus OS time and context-switch-counts are of minimal value.

What's REALLY valuable is the ratio of inter-thread cache-line dirties. Depending on the CPU, a cache-line dirty followed by a peer-CPU read is SLOWER than a cache-miss - because you have to force the peer CPU to write it's value to main-mem before you can even start reading.. Some CPUs let you pull from peer cache-lines without hitting main-mem.

So the key is the absolutely minimize ANY shared modified memory structures.. Make everything as read-only as possible.. This INCLUDES share FIFO buffers (including Executor pools).. Namely if you used a synchronized queue - then every sync-op is a shared dirty memory region. And more-over, if the rate is high enough, it'll likely trigger an OS trap to stall, waiting for peer thread's mutex's.

The ideal is to segment RAM, distribute to a fixed number of workers a single large unit of work, then use a count-down-latch or some other memory barrier (such that each thread would only touch it once). Ideally any temporary buffers are pre-allocated instead of going into and out of a shared memory pool (which then causes cache contention). Java 'synchronized' blocks leverage (behind the scenes) a shared hash-table memory space and thus trigger the undesirable dirty-reads, I haven't determined if java 5 Lock objects avoid this, but you're still leveraging OS stalls which won't help in your throughput. Obviously most OutputStream operations trigger such synchronized calls (and of course are typically filling a common stream buffer).

Generally my experience is that single-threading is faster than mulithreading for a common byte-array/object-array, etc. At least with simplistic sorting/filtering algorithms that I've experimented with. This is true both in Java and C in my experience. I haven't tried FPU intesive ops (like divides, sqrt), where cache-lines may be less of a factor.

Basically if you're a single CPU you don't have cache-line problems (unless the OS is always flushing the cache even in shared threads), but multithreading buys you less than nothing. In hyperthreading, it's the same deal. In single-CPU shared L2/L3 cache configurations (e.g. AMDs), you might find some benefit. In multi CPU Intel BUS's, forget it - shared write-memory is worse than single-threading

answer Jun 6, 2017 by Manikandan J
Similar Questions
+1 vote

Which is the best page replacement algorithm and Why? How much time is spent usually in each phases and why?

+1 vote

Can anyone suggest how you can measure space and time complexities for any algorithms in C/C++?

+1 vote

while going through various synchronization mechanism like semaphore, mutex. I stopped at Monitor. I want to know how it is differ from semaphore and mutex. when use of monitor is preferable than semaphore and mutex ?

...