Monday, March 23, 2009

Concurrency

I'm depressed.

Over the last couple of weeks, I've been playing with various Smalltalk implementations. I always had this idea that some of them could use multiple CPU cores. As it turns out, none of them do it, and the ones that do concurrency don't do it well.

VisualWorks
is single-threaded. Squeak is single-threaded. Smalltalk/X didn't fully use two CPU cores. Smalltalk/MT should, but didn't. Huemul 0.7 tried... but crashed.

[update: Smalltalk/MT does use multiple cores as Peter Lount points out in the comments, but is severely limitied by the garbage collector.]

Igor's Hydra VM does concurrency the hard way: by making a completely separate VM in a separate thread but in the same process (using OS terminology here): VMs can communicate with each other, but each is single-threaded. Gemstone, I hear, can make multiple Gems which share a transacted memory, but each Gem is single threaded.

Surely it can't be that hard to make a VM that does fine-grained concurrency?

Smalltalk, the language, is ideally suited to concurrency. The language lends itself to creating concurrent abstractions and already has basic support, but there's no implementation which can take advantage of the CPU power.

Smalltalk VMs are not going to be running any faster if we don't start exploring concurrency. The free ride in MHz increases is about to end soon (in theory, anyway). Moreso, the best Smalltalk VM can only use quarter of the power of a quad-core CPU, and this unused potential will increase exponentially as the number of cores increases exponentially.

What am I going to do about it? Well, I'm going to be modifying the Process scheduler in Squeak to simulate multiple CPU cores. Individual Processes will have their speed throttled down so that using multiple forked processes will be required to get a speed increase. Then I will, eventually, start writing concurrent frameworks to take advantage of the "multiple cores". This way, if some smart alec makes me a nice concurrent VM, they have concurrent code to take advantage of it.

7 comments:

Peter William Lount said...

>Smalltalk/MT should, but didn't.

What was your test? Please post your code test.

Peter William Lount said...

Run this example after starting the Windows Resource Monitoring Tools and finding the Smalltalk process in the list.

10 timesRepeat: [
  [
    100000 timesRepeat: [170 factorial].
  ] fork
]


This rather simplistic threading example demonstrates that Smalltalk/MT does in fact start up windows Native threads and use more than 25% of a quad core processor - meaning more than one core. The proof in part, the Windows Thread count for the process goes up by ten when running the above code, and then drops when the threads complete, and a CPU usage of 34% of the FOUR cores, meaning 100% on one core and 9% on another (or some combination thereof).

Of course, looking at the source code always helps and there we see that the forking calls WINAPI functions to do the deed.

What specific limitations have you found in Smalltalk/MT's implementation of Native Threading?

I've been told by the Smalltalk/MT guys that there is a limit to the number of threads, which is the limit set in the Windows API and OS Version (if I'm not mistaken). Currently this is more than four as the above example shows, but less then 60 (can't remember the exact limit at the moment).

Peter William Lount said...

I do agree with you that overall the Smalltalk versions really need multi-core capabilities. It's quite shocking how lacking this is.

In pressing this issue in the Smalltalk community no one seems to want to do the very hard work.

Although it may be that vendors are working on it without revealing the full scope of their plans.

Most efforts that I've heard about are the lame-o attempts at using one image per thread and one or more threads per core. Not quite the Smalltalk ideals that we'd hope for I think most would agree.

They claim they want it simpler. Ok, make it simpler but let's also make it powerful for those that need the true power of multi-core that other languages have! Yes, and some of the programs too.

As I've pointed out it's not possible to completely avoid all the multi-threading issues by using the one-image per native thread unless you also make it only one-smalltalk-green-thread per image-native-thread.

Furthermore, the required fracturing and segregation of the world of objects into multiple images presents just as many thorny issues except in the simplest of cases, so you're back to where you started. Stuck.

I've said it before and I'll say it here again, Smalltalk needs FULL multi-core capabilities. In fact this is so important that it needs to be in any new Smalltalk Standard!

Peter William Lount said...

Oh, the quad core that I ran the tests on had some other competing processes. After killing them I put the simplistic example to exercise with longer iterations of a million loops and 100+ threads forked doing their thang. With this I've seen up to 44% cpu utilization by Smalltalk/MT.

While I've seen higher cpu utilization with this example it tends to hover just above an average of ~26.25% which leads one to the conclusion that maybe something wonky is going on.... hmmm...

I wonder if Task/Thread Processor Affinity is getting in the way?

"One of the primary parallelism blockers is CPU affinity. If I have a task that is affined to one CPU, then if that CPU is busy but another is available, I have no means of executing the task. One of the classical cases of CPU affinity was the NDIS interrupt affinity, binding the network card to a single CPU which can process its incoming packets. If that CPU was particularly busy processing other interrupts, network receive-side processing was adversely affected. Furthermore, if multiple CPUs were available to perform receive-side processing, only one of them would do the actual work." - Parallelism and CPU Affinity

"Thread affinity and CPU affinity, by the way, are not the same. CPU affinity means that a specific thread is bound to a specific CPU. Thread affinity means that a set of tasks is bound to a specific thread."

Oh, I gather I started too many threads as it's got a debugger now.

ps. I'm running Vista 64 bit with the latest 32 bit Smalltalk/MT version.

T.Kerroum said...

The problem is that you're introducing garbage collection in the test.
Try:
10 timesRepeat: [
[
1000000000 timesRepeat: ['abc' size].
] fork
]
It will give you 100% CPU.

With factorial, everything greater than a SmallInteger (2^31) creates garbage so you mostly end up testing GC.

GC has to suspend all managed threads during the mark phase (it is also possible to have unmanaged threads running at the same time, the condition is that they don't allocate from the default object pool).

This is why, in the factorial example, CPU shortly goes to 100% then drops and hovers around 50+x%, as most work is being done in GC not in the actual code.

In a real world application, you usually have less GC and more computations, though the bottleneck with GC will remain as long as objects are shared betwen threads. Another design would be to have separated object spaces, but this places a burden on the application, which then has to know about object boundaries and deal with shared objects explicitly. Also, separated object spaces waste memory and are less scalable.

The number of threads in MT has been limited to 62 so far (Windows cannot wait for more than 64 objects). The next version has a much higher limit (512), and the 64-bit version we're working on is unlimited. We are also looking at ways to speed up GC by using multiple threads within the GC, but since GC is memory-bound, I don't expect huge savings.

Peter William Lount said...

Tarik Kerroum,

Thanks very much for the correction to the example and the information on Garbage Collection and the number of threads limit.

I've run the new example test and can confirm the results of 100% utilization of the four cores on the quad core cpu! Yes! (A few percent, ~6%, were other processes in other applications and the system that run as usual).

This is an important demonstration of how at least one Smalltalk, Smalltalk/MT, does in fact make FULL use of multiple cpu cores today! So to all those naysayers who say it can't be done or that it's too hard to do or that we need to wimp out and do a crippled version of multi-threading be advised that it has been done and that it works well!

Sure, it's not perfect - you have to watch out for your garbage - but the next step towards perfection is on it's way with the 64 bit version of Smalltalk/MT.

Clearly one must consider the garbage collection usage in any mutli-core application, whether it's in Smalltalk or not. I saw a paper presented at OOPSLA on the complexities of multi-core garbage collection and it's not pretty for anyone to solve. Many millions of dollars has been spent on researching multi-core garbage collection.

What's cool about Smalltalk/MT is that it does work modestly well now. With the new information about issues with GC understood it's clear that there are ways to work around this in actual practical real world applications.

I purposefully used a simple example to demonstrate multi-core running on four cores and in the process unintentionally uncovered issues with GC that can be worked around. The purpose of the simple example is to keep it simple and provide a benchmark for other Smalltalks to live up to!

Now it's time for other Smalltalk vendors to step up to the multi-core plate!

10 timesRepeat: [
  [
    1000000000 timesRepeat: [
       'abc' size
    ]
  ] fork
]


This example demonstrates conclusively that Smalltalk/MT uses all the available multiple cores and thus is the current king of the hill in Smalltalk land when it comes to mutli-core usage! And Smalltalk/MT works well today! It has worked for years in fact! Download it, give it a try and buy it! Support your Smalltalk vendor.

Still depressed gulik?

How do other smalltalk systems compare?

All the best,

Peter

gulik said...

Peter: well, yes, I'm still pretty depressed, but I am feeling a bit better.

You can get 100% CPU on two cores using Smalltalk/MT but only if you don't make any garbage. I replaced the "'abc' size" with "'abc' copy" and CPU utilisation dropped down to 50%. It seems that the garbage collector interferes with concurrency somewhat significantly.

The problems I had with Smalltalk/MT are:
1. Semaphores were odd. They didn't implement >>signal (instead, they had >>release) and didn't behave like I expected them too.
2. If you did Semaphore>>wait, the UI froze up and I couldn't work out how to recover it.
3. The tools were a bit unfamiliar. Eventually I'd work them out, but it slowed me down.
4. Windows only, and very specifically so. Even the WINAPI calls were exposed to Smalltalk, which is good for Windows hackers, but bad for cross-platform code.

Being limited to a mere 62 threads seems a bit sucky.

But then again, the other Smalltalks had issues too.

The code I used was similar to yours: fork a busy block a few times. I've discarded the code since; sorry!