Saturday, September 12, 2009

Why Mono 2.4 is slow?

Let me start by saying that I am a big fan of .NET, and Mono. I believe that C# as a language is superior to Java and C++, and if I have the choice, I always work in C#. So I would like to believe that C# is as portable as Java, thanks to Novell and the Mono project.

Recently, in connection with a molecular docking software, I wrote a sort of benchmark program, which takes a large list of points in 3D space, calculates the distances between each pair of points, and finds the largest distance. The core of the program is a double loop which looks like this:

for (int i = 0; i < N3; i++)
{
P = points[i];

for (int j=i+1; j < N; j++)
{
if ( dist(P, Q) > max_dist)
max_dist = dist(P,Q);
}
}

... where points is a System.Collections.Generic.List with 27000 items. I have rewritten this program also in C++, using STL vectors, and compiled with GCC for my Intel Core 2 processor, using --mtune=native

If I compare the running time on Windows between the C++ and the C# versions, they are roughly equal. However, on Linux, if I use Mono 2.4 as the CLR, the GNU/C++ version is much faster, it is around 2.5 sec, while running with Mono takes more than 7 seconds.

So why running with Mono is slow? Using the Mono Profiler, it turns out that each access to the points list entails an 'array bounds check' for the i and j indices:

########################
58070.103 364540502 0.000 System.Collections.Generic.List`1::get_Count()
Callers (with count) that contribute at least for 1%:
364540502 5 % Benchmarks.MainClass::Main(string[])

... even though allegedly array bound checks are eliminated from for loops! Indeed, they are, but here I am using List and not Array. Let's switch to Array, and the running time decreases from 7.3 sec to 5.4. This is still twice as much as the time taken by the C++ version, so let's run mono --profile again:

########################
59853.928 364486500 0.000 Benchmarks.MainClass::dist(Point,Point)
Callers (with count) that contribute at least for 1%:

... we still have 3 billion calls to the dist(Point,Point) function. Why is it not inlined? Checking the GNU Compiler's output for the C++ version, I can see that the dist() is inlined. But for mono, it is probably too big (let me add here that I always use the --optimize=all option for mono, which includes inline.)
Let's inline the dist(...) function manually:

for (int i = 0; i < N3; i++)
{
P = points[i];

for (int j=i+1; j < N; j++)
{
double d = Math.Sqrt( .... );

if ( d > max_dist)
max_dist = d;
}
}

This speeds up the C# / Mono version to 3.9 sec, still far from the 2.5 s of the C++ version.

Why is it that the C# version on Windows / Microsoft CLR is as fast as the C++ version, without any manual optimization (using generic List, no manual inlining...)

Comparing the GCC generated assembly code with the code, generated by mono --aot -O=all, we can see the the GCC-emitted code is using SSE-based arithmetics, while the Mono JIT code is using x87 floating point instructions:

10ae: dd 45 e8 fldl 0xffffffe8(%ebp)
10b1: dd 45 e8 fldl 0xffffffe8(%ebp)
10b4: de c9 fmulp %st,%st(1)
10b6: de c1 faddp %st,%st(1)
10b8: d9 fa fsqrt

Moreover, for the if ( d > max_dist) max_dist = d; statement, the GCC code is using the maxsd instruction, while the Mono-generated code is branching with a jump...

So Miguel de Icasa and his team still has a lot to do... I look forward to Mono 2.6, which perhaps will amend some of these issues.

0 comments: