Analysing .NET start-up time with Flamegraphs
Recently I gave a talk at the NYAN Conference called ‘From ‘dotnet run’ to ‘hello world’:
In the talk I demonstrate how you can use PerfView to analyse where the .NET Runtime is spending it’s time during start-up:
From 'dotnet run' to 'hello world' from
Matt Warren
This post is a step-by-step guide to that demo.
Code Sample
For this exercise I delibrately only look at what the .NET Runtime is doing during program start-up, so I ensure the minimum amount of user code is runing, hence the following ‘Hello World’:
using System;
namespace HelloWorld
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Hello World!");
Console.WriteLine("Press to exit");
Console.ReadLine();
}
}
}
The Console.ReadLine()
call is added because I want to ensure the process doesn’t exit whilst PerfView is still collecting data.
Data Collection
PerfView is a very powerful program, but not the most user-friendly of tools, so I’ve put togerther a step-by-step guide:
- Download and run a recent version of ‘PerfView.exe’
- Click ‘Run a command’ or (Alt-R’) and “collect data while the command is running”
- Ensure that you’ve entered values for:
- “Command”
- “Current Dir”
- Tick ‘Cpu Samples’ if it isn’t already selected
- Set ‘Max Collect Sec’ to 15 seconds (because our ‘HelloWorld’ app never exits, we need to ensure PerfView stops collecting data at some point)
- Ensure that ‘.NET Symbol Collection’ is selected
- Hit ‘Run Command
If you then inspect the log you can see that it’s collecting data, obtaining symbols and then finally writing everything out to a .zip file. Once the process is complete you should see the newly created file in the left-hand pane of the main UI, in this case it’s called ‘PerfViewData.etl.zip’
Data Processing
Once you have your ‘.etl.zip’ file, double-click on it and you will see a tree-view with all the available data. Now, select ‘CPU Stacks’ and you’ll be presented with a view like this:
Notice there’s alot of ‘?’ characters in the list, this means that PerfView is not able to work out the method names as it hasn’t resolved the necessary symbols for the Runtime dlls. Lets fix that:
- Open ‘CPU Stacks’
- In the list, select the ‘HelloWorld’ process (PerfView collects data machine-wide)
- In the ‘GroupPats’ drop-down, select ‘[no grouping]’
- Optional, change the ‘Symbol Path’ from the default to something else
- In the ‘By name’ tab, hit ‘Ctrl+A’ to select all the rows
- Right-click and select ‘Lookup Symbols’ (or just hit ‘Alt+S’)
Now the ‘CPU Stacks’ view should look something like this:
Finally, we can get the data we want:
- Select the ‘Flame Graph’ tab
- Change ‘GroupPats’ to one of the following for a better flame graph:
- [group module entries] {%}!=>module $1
- [group class entries] {%!*}.%(=>class $1;{%!*}::=>class $1
- Change ‘Fold%’ to a higher number, maybe 3%, to get rid of any thin bars (any higher and you start to loose information)
Now, at this point I actually recommend exporting the PerfView data into a format that can be loaded into https://speedscope.app/ as it gives you a much better experience. To do this click File -> Save View As and then in the ‘Save as type’ box select Speed Scope Format. Once that’s done you can ‘browse’ that file at speedscope.app, or if you want you can just take a look at one I’ve already created.
Note: If you’ve never encountered ‘flamegraphs’ before, I really recommend reading this excellent explanation by Julia Evans:
perf & flamegraphs pic.twitter.com/duzWs2hoLT
— 🔎Julia Evans🔍 (@b0rk)
December 26, 2017
Anaylsis of .NET Runtime Startup
Finally, we can answer our original question:
Where does the .NET Runtime spend time during start-up?
Here’s the data from the flamegraph summarised as text, with links the corresponding functions in the ‘.NET Core Runtime’ source code:
- Entire Application - 100% - 233.28ms
- Everything except
helloworld!wmain
- 21%
helloworld!wmain
- 79% - 184.57ms
hostpolicy!create_hostpolicy_context
- 30% - 70.92ms here
hostpolicy!create_coreclr
- 22% - 50.51ms here
coreclr!CorHost2::Start
- 9% - 20.98ms here
coreclr!CorHost2::CreateAppDomain
- 10% - 23.52ms here
hostpolicy!runapp
- 20% - 46.20ms here, ends up calling into Assembly::ExecuteMainMethod
here
coreclr!RunMain
- 9.9% - 23.12ms here
coreclr!RunStartupHooks
- 8.1% - 19.00ms here
hostfxr!resolve_frameworks_for_app
- 3.4% - 7.89ms here
So, the main places that the runtime spends time are:
- 30% of total time is spent Launching the runtime, controlled via the ‘host policy’, which mostly takes place in
hostpolicy!create_hostpolicy_context
(30% of total time)
- 22% of time is spend on Initialisation of the runtime itself and the initial (and only) AppDomain it creates, this can be see in
CorHost2::Start
(native) and CorHost2::CreateAppDomain
(managed). For more info on this see The 68 things the CLR does before executing a single line of your code
- 20% was used JITting and executing the
Main
method in our ‘Hello World’ code sample, this started in Assembly::ExecuteMainMethod
above.
To confirm the last point, we can return to PerfView and take a look at the ‘JIT Stats Summary’ it produces. From the main menu, under ‘Advanced Group’ -> ‘JIT Stats’ we see that 23.1 ms or 9.1% of the total CPU time was spent JITing:
Tue, 3 Mar 2020, 12:00 am
Under the hood of "Default Interface Methods"
Background
‘Default Interface Methods’ (DIM) sometimes referred to as ‘Default Implementations in Interfaces’, appeared in C# 8. In case you’ve never heard of the feature, here’s some links to get you started:
Also, there are quite a few other blogs posts discussing this feature, but as you can see opinion is split on whether it’s useful or not:
But this post isn’t about what they are, how you can use them or if they’re useful or not. Instead we will be exploring how ‘Default Interface Methods’ work under-the-hood, looking at what the .NET Core Runtime has to do to make them work and how the feature was developed.
Table of Contents
Development Timeline and PRs
First of all, there are a few places you can go to get a ‘high-level’ understanding of what was done:
Initial work, Prototype and Timeline
Interesting PR’s done after the prototype (newest -> oldest)
Once the prototype was merged in, there was additional feature work done to ensure that DIM’s worked across different scenarios:
Bug fixes done since the Prototype (newest -> oldest)
In addition, there were various bugs fixes done to ensure that existing parts of the CLR played nicely with DIMs:
Possible future work
Finally, there’s no guarantee if or when this will be done, but here are the remaining issues associated with the project:
Default Interface Methods ‘in action’
Now that we’ve seen what was done, let’s look at what that all means, starting with this code that simply demonstrates ‘Default Interface Methods’ in action:
interface INormal {
void Normal();
}
interface IDefaultMethod {
void Default() => WriteLine("IDefaultMethod.Default");
}
class CNormal : INormal {
public void Normal() => WriteLine("CNormal.Normal");
}
class CDefault : IDefaultMethod {
// Nothing to do here!
}
class CDefaultOwnImpl : IDefaultMethod {
void IDefaultMethod.Default() => WriteLine("CDefaultOwnImpl.IDefaultMethod.Default");
}
// Test out the Normal/DefaultMethod Interfaces
INormal iNormal = new CNormal();
iNormal.Normal(); // prints "CNormal.Normal"
IDefaultMethod iDefault = new CDefault();
iDefault.Default(); // prints "IDefaultMethod.Default"
IDefaultMethod iDefaultOwnImpl = new CDefaultOwnImpl();
iDefaultOwnImpl.Default(); // prints "CDefaultOwnImpl.IDefaultMethod.Default"
The first way we can understand how they are implemented is by using Type.GetInterfaceMap(Type)
(which actually had to be fixed to work with DIMs), this can be done with code like this:
private static void ShowInterfaceMapping(Type @implemetation, Type @interface) {
InterfaceMapping map = @implemetation.GetInterfaceMap(@interface);
Console.WriteLine($"{map.TargetType}: GetInterfaceMap({map.InterfaceType})");
for (int counter = 0; counter TestApp.CNormal::Normal (different)
MethodHandle 0x7FF993916A80 --> MethodHandle 0x7FF993916B10
FunctionPtr 0x7FF99385FC50 --> FunctionPtr 0x7FF993861880
TestApp.CDefault: GetInterfaceMap(TestApp.IDefaultMethod)
TestApp.IDefaultMethod::Default --> TestApp.IDefaultMethod::Default (same)
MethodHandle 0x7FF993916BD8 --> MethodHandle 0x7FF993916BD8
FunctionPtr 0x7FF99385FC78 --> FunctionPtr 0x7FF99385FC78
TestApp.CDefaultOwnImpl: GetInterfaceMap(TestApp.IDefaultMethod)
TestApp.IDefaultMethod::Default --> TestApp.CDefaultOwnImpl::TestApp.IDefaultMethod.Default (different)
MethodHandle 0x7FF993916BD8 --> MethodHandle 0x7FF993916D10
FunctionPtr 0x7FF99385FC78 --> FunctionPtr 0x7FF9938663A0
So here we can see that in the case of IDefaultMethod
interface on the CDefault
class the interface and method implementations are the same. As you can see, in the other scenarios the interface method maps to a different method implementation.
But lets look at bit lower, making use of WinDBG and the SOS extension to get a peek into the internal ‘data structures’ that the runtime uses.
First, lets take a look at the MethodTable
(dumpmt
) for the INormal
interface:
> dumpmt -md 00007ff8bcc31dd8
EEClass: 00007FF8BCC2C420
Module: 00007FF8BCC0F788
Name: TestApp.INormal
mdToken: 0000000002000002
File: C:\DefaultInterfaceMethods\TestApp\bin\Debug\netcoreapp3.0\TestApp.dll
BaseSize: 0x0
ComponentSize: 0x0
Slots in VTable: 1
Number of IFaces in IFaceMap: 0
--------------------------------------
MethodDesc Table
Entry MethodDesc JIT Name
00007FF8BCB70580 00007FF8BCC31DC8 NONE TestApp.INormal.Normal()
So we can see that the interface has an entry for the Normal()
method, as expected, but lets look in more detail at the MethodDesc
(dumpmd
):
> dumpmd 00007FF8BCC31DC8
Method Name: TestApp.INormal.Normal()
Class: 00007ff8bcc2c420
MethodTable: 00007ff8bcc31dd8
mdToken: 0000000006000001
Module: 00007ff8bcc0f788
IsJitted: no
Current CodeAddr: ffffffffffffffff
Version History:
ILCodeVersion: 0000000000000000
ReJIT ID: 0
IL Addr: 0000000000000000
CodeAddr: 0000000000000000 (MinOptJitted)
NativeCodeVersion: 0000000000000000
So whilst the method exists in the interface definition, it’s clear that the method has not been jitted (IsJitted: no
) and in fact it never will, as it can never be executed.
Now lets compare that output with the one for the IDefaultMethod
interface, again the MethodTable
(dumpmt
) and the MethodDesc
(dumpmd
):
> dumpmt -md 00007ff8bcc31e68
EEClass: 00007FF8BCC2C498
Module: 00007FF8BCC0F788
Name: TestApp.IDefaultMethod
mdToken: 0000000002000003
File: C:\DefaultInterfaceMethods\TestApp\bin\Debug\netcoreapp3.0\TestApp.dll
BaseSize: 0x0
ComponentSize: 0x0
Slots in VTable: 1
Number of IFaces in IFaceMap: 0
--------------------------------------
MethodDesc Table
Entry MethodDesc JIT Name
00007FF8BCB70590 00007FF8BCC31E58 JIT TestApp.IDefaultMethod.Default()
> dumpmd 00007FF8BCC31E58
Method Name: TestApp.IDefaultMethod.Default()
Class: 00007ff8bcc2c498
MethodTable: 00007ff8bcc31e68
mdToken: 0000000006000002
Module: 00007ff8bcc0f788
IsJitted: yes
Current CodeAddr: 00007ff8bcb765c0
Version History:
ILCodeVersion: 0000000000000000
ReJIT ID: 0
IL Addr: 0000000000000000
CodeAddr: 00007ff8bcb765c0 (MinOptJitted)
NativeCodeVersion: 0000000000000000
Here we see something very different, the MethodDesc
entry in the MethodTable
actually has jitted, executable code associated with it.
Enabling Methods on an Interface
So we’ve seen that ‘default interface methods’ are wired up by the runtime, but how does that happen?
Firstly, it’s very illuminating to look at the initial prototype of the feature in CoreCLR PR #10505, because we can understand at the lowest level what the feature is actually enabling, from /src/vm/classcompat.cpp:
Here we see why DIM didn’t require any changes to the .NET ‘Intermediate Language’ (IL) op-codes, instead they are enabled by relaxing a previous restriction. Before this change, you weren’t able to add ‘virtual, non-abstract’ or ‘non-virtual’ methods to an interface:
- “Virtual Non-Abstract Interface Method.” (
BFA_VIRTUAL_NONAB_INT_METHOD
)
- “Nonvirtual Instance Interface Method.” (
BFA_NONVIRT_INST_INT_METHOD
)
This ties in with the proposed changes to the ECMA-335 specification, from the ‘Default interface methods’ design doc:
The major changes are:
- Interfaces are now allowed to have instance methods (both virtual and non-virtual). Previously we only allowed abstract virtual methods.
- Interfaces obviously still can’t have instance fields.
- Interface methods are allowed to MethodImpl other interface methods the interface requires (but we require the MethodImpls to be final to keep things simple) - i.e. an interface is allowed to provide (or override) an implementation of another interface’s method
However, just allowing ‘virtual, non-abstract’ or ‘non-virtual’ methods to exist on an interface is only the start, the runtime then needs to allow code to call those methods and that is far harder!
Resolving the Method Dispatch
In .NET, since version 2.0, all interface methods calls have taken place via a mechanism known as Virtual Stub Dispatch:
Virtual stub dispatching (VSD) is the technique of using stubs for virtual method invocations instead of the traditional virtual method table. In the past, interface dispatch required that interfaces had process-unique identifiers, and that every loaded interface was added to a global interface virtual table map. This requirement meant that all interfaces and all classes that implemented interfaces had to be restored at runtime in NGEN scenarios, causing significant startup working set increases. The motivation for stub dispatching was to eliminate much of the related working set, as well as distribute the remaining work throughout the lifetime of the process.
Although it is possible for VSD to dispatch both virtual instance and interface method calls, it is currently used only for interface dispatch.
For more information I recommend reading the section on C#’s slotmaps in the excellent article on ‘Interface Dispatch’ by Lukas Atkinson.
So, to make DIM work, the runtime has to wire up any ‘default methods’, so that they integrate with the ‘virtual stub dispatch’ mechanism. We can see this in action by looking at the call stack from the hand-crafted assembly stub (ResolveWorkerAsmStub
) all the way down to FindDefaultInterfaceImplementation(..)
which finds the correct method, given an interface (pInterfaceMD
) and the default method to call (pInterfaceMT
):
- coreclr.dll!MethodTable::FindDefaultInterfaceImplementation(MethodDesc *pInterfaceMD, MethodTable *pInterfaceMT, MethodDesc **ppDefaultMethod, int allowVariance, int throwOnConflict) Line 6985 C++
- coreclr.dll!MethodTable::FindDispatchImpl(unsigned int typeID, unsigned int slotNumber, DispatchSlot *pImplSlot, int throwOnConflict) Line 6851 C++
- coreclr.dll!MethodTable::FindDispatchSlot(unsigned int typeID, unsigned int slotNumber, int throwOnConflict) Line 7251 C++
- coreclr.dll!VirtualCallStubManager::Resolver(MethodTable *pMT, DispatchToken token, OBJECTREF *protectedObj, unsigned __int64 *ppTarget, int throwOnConflict) Line 2208 C++
- coreclr.dll!VirtualCallStubManager::ResolveWorker(StubCallSite *pCallSite, OBJECTREF *protectedObj, DispatchToken token, VirtualCallStubManager::StubKind stubKind) Line 1874 C++
- coreclr.dll!VSD_ResolveWorker(TransitionBlock *pTransitionBlock, unsigned __int64 siteAddrForRegisterIndirect, unsigned __int64 token, unsigned __int64 flags) Line 1683 C++
- coreclr.dll!ResolveWorkerAsmStub() Line 42 Unknown
If you want to explore the call-stack in more detail, you can follow the links below:
ResolveWorkerAsmStub
here
VSD_ResolveWorker(..)
here
VirtualCallStubManager::ResolveWorker(..)
here
VirtualCallStubManager::Resolver(..)
here
MethodTable::FindDispatchSlot(..)
here
[MethodTable::FindDispatchImpl(..)
here or here
- Finally ending up in
MethodTable::FindDefaultInterfaceImplementation(..)
here
Analysis of FindDefaultInterfaceImplementation(..)
So the code in FindDefaultInterfaceImplementation(..)
is at the heart of the feature, but what does it need to do and how does it do it? This list from Finalize override lookup algorithm #12753 gives us some idea of the complexity:
- properly detect diamond shape positive case (where I4 overrides both I2/I3 which both overrides I1) by keep tracking of a current list of best candidates. I went for the simplest algorithm and didn’t build any complex graph / DFS since the majority case the list of interfaces would be small, and interface dispatch cache would ensure majority of cases we don’t need to redo the (slow) dispatch. If needed we can revisit this to make it a proper topological sort.
- VerifyVirtualMethodsImplemented now properly validates default interface scenarios - it is happy if there is at least one implementation and early returns. It doesn’t worry about conflicting overrides, for performance reasons.
- NotSupportedException thrown in conflicting override scenario now has a proper error message
- properly supports GVM when detecting method impl overrides
- Revisited code that adds method impl for interfaces. added proper methodimpl validation and ensure methodimpl are virtual and final (and throw exception if it is not final)
- Added test scenario with method that has multiple method impl. found and fixed a bug where the slot array is not big enough when building method impls for interfaces.
In addition, the ‘two-pass’ algorithm was implemented in Implement two pass algorithm for variant interface dispatch #21355, which contains an interesting discussion of the edge-cases that need to be handled.
So onto the code, this is the high-level view of the algorithm:
- Which actually starts in
MethodTable::FindDispatchImpl(..)
here, where FindDefaultInterfaceImplementation
can be called twice:
- First time to try and find an ‘exact match’ (
allowVariance
=false)
- Then if that fails, it’s called again to try and find a ‘variant match’ (
allowVariance
=true)
- The entire
FindDefaultInterfaceImplementation
method is here, it’s fairly straight-forward and relatively easy to understand, plus there’s only ~270 LOC and they’re all very well commented. The high-level algorithm is the following:
- Walk interface from derived class to parent class here, this is a straight-forward implementation that may me revisited if it doesn’t scale well
- Then scan through each class looking for a match:
- an ‘exact match’
- a ‘generic variance match’, i.e. the interfaces match via ‘casting’, but ultimately have the same
TypeDef
- a ‘more specific interface’ that matches, this match is made more complicated by the fact that ‘generic instantiations’ are involved
- a ‘more specific interface’ matches, but without generics involved, so much simpler to calculate
- If the previous step produced a match, double-check that it is the most specific interface match seen so far, by keeping a ‘candidates list’ and classifying each scenario as:
- a ‘tie’ which is ignored, i.e. a ‘variant match’ on the same type
- a ‘more specific’ match, which is used to update the ‘candidates list’
- a ‘less-specific’ match, so no need to carry on with this candidate
- Finally, a scan is done to see if there are any conflicts here, which is acceptable when
allowVariance=true
, but otherwise throws an exception
- That’s it, the ‘best-candidate’ is then returned to the caller (assuming there is one)
Diamond Inheritance Problem
Finally, the ‘diamond inheritance problem’ was mentioned in a few of the PRs/Issues related to the feature, but what is it?
A good place to starts is one of the test cases, diamondshape.cs. However there’s a more concise example in the C#8 Language Proposal:
interface IA
{
void M();
}
interface IB : IA
{
override void M() { WriteLine("IB"); }
}
class Base : IA
{
void IA.M() { WriteLine("Base"); }
}
class Derived : Base, IB // allowed?
{
static void Main()
{
Ia a = new Derived();
a.M(); // what does it do?
}
}
So the issue is which of the matching interface methods should be used, in this case IB.M()
or Base.IA.M()
? The resolution, as outlined in the C#8 language proposal was to use the most specific override:
Closed Issue: Confirm the draft spec, above, for most specific override as it applies to mixed classes and interfaces (a class takes priority over an interface). See https://github.com/dotnet/csharplang/blob/master/meetings/2017/LDM-2017-04-19.md#diamonds-with-classes.
Which ties in with the ‘more-specific’ and ‘less-specific’ steps we saw in the outline of FindDefaultInterfaceImplementation
above.
Summary
So there you have it, an entire feature delivered end-to-end, yay for .NET (Core) being open source! Thanks to the runtime engineers for making their Issues and PRs easy to follow and for adding such great comments to their code! Also kudos to the language designers for making their proposals and meeting notes available for all to see (e.g. LDM-2017-04-19).
Whether you think they are useful or not, it’s hard to argue that ‘Default Interface Methods’ aren’t well designed and well implemented.
But what makes it even more unique feature is that it required the compiler and runtime teams working together to make it possible!
Wed, 19 Feb 2020, 12:00 am
Research based on the .NET Runtime
Over the last few years, I’ve come across more and more research papers based, in some way, on the ‘Common Language Runtime’ (CLR).
So armed with Google Scholar and ably assisted by Semantic Scholar, I put together the list below.
Note: I put the papers into the following categories to make them easier to navigate (papers in each category are sorted by date, newest -> oldest):
- Using the .NET Runtime as a case-study
- to prove its correctness, study how it works or analyse its behaviour
- Research carried out by Microsoft Research, the research subsidiary of Microsoft.
- “It was formed in 1991, with the intent to advance state-of-the-art computing and solve difficult world problems through technological innovation in collaboration with academic, government, and industry researchers” (according to Wikipedia)
- Papers based on the Mono Runtime
- a ‘Cross-Platform, open-source .NET framework’
- Using ‘Rotor’, real name ‘Shared Source CLI (SSCLI)’
- from Wikipedia “Microsoft provides the Shared Source CLI as a reference CLI implementation suitable for educational use”
Any papers I’ve missed? If so, please let me know in the comments or on Twitter
- .NET Runtime as a Case-Study
- Pitfalls of C# Generics and Their Solution Using Concepts (Belyakova & Mikhalkovich, 2015)
- Efficient Compilation of .NET Programs for Embedded Systems (Sallenaveab & Ducournaub, 2011)
- Type safety of C# and .Net CLR (Fruja, 2007)
- Modeling the .NET CLR Exception Handling Mechanism for a Mathematical Analysis (Fruja & Börger, 2006)
- Analysis of the .NET CLR Exception Handling Mechanism (Fruja & Börger, 2005)
- A Modular Design for the Common Language Runtime (CLR) Architecture (Fruja, 2005)
- Cross-language Program Slicing in the .NET Framework (Pócza, Biczó & Porkoláb, 2005)
- Design and Implementation of a high-level multi-language . NET Debugger (Strein, 2005)
- A High-Level Modular Definition of the Semantics of C# (Börger, Fruja, Gervasi & Stärk, 2004)
- An ASM Specification of C# Threads and the .NET Memory Model (Stärk and Börger, 2004)
- Common Language Runtime : a new virtual machine (Ferreira, 2004)
- JVM versus CLR: a comparative study (Singer, 2003)
- Runtime Code Generation with JVM And CLR (Sestoft, 2002)
- Microsoft Research
- Project Snowflake: Non-blocking safe manual memory management in .NET (Parkinson, Vaswani, Costa, Deligiannis, Blankstein, McDermott, Balkind & Vytiniotis, 2017)
- Simple, Fast and Safe Manual Memory Management (Kedia, Costa, Vytiniotis, Parkinson, Vaswani & Blankstein, 2017)
- Uniqueness and Reference Immutability for Safe Parallelism (Gordon, Parkinson, Parsons, Bromfield & Duffy, 2012)
- A study of concurrent real-time garbage collectors (Pizlo, Petrank & Steensgaard, 2008)
- Optimizing concurrency levels in the. net threadpool: A case study of controller design and implementation (Hellerstein, Morrison & Eilebrecht, 2008)
- Stopless: a real-time garbage collector for multiprocessors. (Pizlo, Frampton, Petrank & Steensgaard, 2007)
- Securing the .NET Programming Model (Kennedy, 2006)
- Combining Generics, Pre-compilation and Sharing Between Software-Based Processes (Syme & Kennedy, 2004)
- Formalization of Generics for the .NET Common Language Runtime (Yu, Kennedy & Syme, 2004)
- Runtime Verification of .NET Contracts (Barnett & Schulte, 2003)
- Design and Implementation of Generics for the .NET Common Language Runtime (Kennedy & Syme, 2001)
- Typing a Multi-Language Intermediate Code (Gordon & Syme, 2001)
- Mono Runtime
- Static and Dynamic Analysis of Android Malware and Goodware Written with Unity Framework (Shim, Lim, Cho, Han & Park, 2018)
- Reducing startup time of a deterministic virtualizing runtime environment (Däumler & Werner, 2013)
- Detecting Clones Across Microsoft .NET Programming Languages (Al-Omari, Keivanloo, Roy & Rilling, 2012)
- Language-independent sandboxing of just-in-time compilation and self-modifying code (Ansel & Marchenko, 2012)
- VMKit: a Substrate for Managed Runtime Environments (Geoffray, Thomas, Lawall, Muller & Folliot, 2010)
- MMC: the Mono Model Checker (Ruys & Aan de Brugh, 2007)
- Numeric performance in C, C# and Java (Sestoft, 2007)
- [Mono versus .Net: A Comparative Study of Performance for Distributed Processing. (Blajian, Eggen, Eggen & Pitts, 2006)]()
- Mono versus .Net: A Comparative Study of Performance for Distributed Processing. (Blajian, Eggen, Eggen & Pitts, 2006)
- Automated detection of performance regressions: the mono experience (Kalibera, Bulej & Tuma, 2005)
- Shared Source Common Language Infrastructure (SSCLI) - a.k.a ‘Rotor’
- Efficient virtual machine support of runtime structural reflection (Ortina, Redondoa & Perez-Schofield, 2009)
- Extending the SSCLI to Support Dynamic Inheritance (Redondo, Ortin & Perez-Schofield, 2008)
- Sampling profiler for Rotor as part of optimizing compilation system (Chilingarova & Safonov, 2006)
- To JIT or not to JIT: The effect of code-pitching on the performance of .NET framework (Anthony, Leung & Srisa-an, 2005)
- Adding structural reflection to the SSCLI (Ortin, Redondo, Vinuesa & Lovelle, 2005)
- Static Analysis for Identifying and Allocating Clusters of Immortal Objects (Ravindar & Srikant, 2005)
- An Optimizing Just-InTime Compiler for Rotor (Trindade & Silva, 2005)
- Software Interactions into the SSCLI platform (Charfi & Emsellem, 2004)
- Experience Integrating a New Compiler and a New Garbage Collector Into Rotor (Anderson, Eng, Glew, Lewis, Menon & Stichnoth, 2004)
.NET Runtime as a Case-Study
Abstract
In comparison with Haskell type classes and C ++ concepts, such object-oriented languages as C# and Java provide much limited mechanisms of generic programming based on F-bounded polymorphism. Main pitfalls of C# generics are considered in this paper. Extending C# language with concepts which can be simultaneously used with interfaces is proposed to solve the problems of generics; a design and translation of concepts are outlined.
Abstract
Compiling under the closed-world assumption (CWA) has been shown to be an appropriate way for implementing object-oriented languages such as Java on low-end embedded systems. In this paper, we explore the implications of using whole program optimizations such as Rapid Type Analysis (RTA) and coloring on programs targeting the .NET infrastructure. We extended RTA so that it takes into account .NET specific features such as (i) array covariance, a language feature also supported in Java, (ii) generics, whose specifications in .Net impacts type analysis and (iii) delegates, which encapsulate methods within objects. We also use an intraprocedural control flow analysis in addition to RTA . We eval-uated the optimizations that we implemented on programs written in C#. Preliminary results show a noticeable reduction of the code size, class hierarchy and polymorphism of the programs we optimize. Array covariance is safe in almost all cases, and some delegate calls can be implemented as direct calls.
Abstract
Type safety plays a crucial role in the security enforcement of any typed programming language. This thesis presents a formal proof of C#’s type safety. For this purpose, we develop an abstract
framework for C#, comprising formal specifications of the language’s grammar, of the statically correct programs, and of the static and operational semantics. Using this framework, we prove that C# is type-safe, by showing that the execution of statically correct C# programs does not lead to type errors.
Abstract
This work is part of a larger project which aims at establishing some important properties of C# and CLR by mathematical proofs. Examples are the correctness of the bytecode verifier of CLR, the type safety (along the lines of the first author’s correctness proof for the definite assignment rules) of C#, the correctness of a general compilation scheme.
Abstract
We provide a complete mathematical model for the exception handling mechanism of the Common Language Runtime (CLR), the virtual machine underlying the interpretation of .NET programs. The goal is to use this rigorous model in the corresponding part of the still-to-be-developed soundness proof for the CLR bytecode verifier.
Abstract
This paper provides a modular high-level design of the Common Language Runtime (CLR) architecture. Our design is given in terms of Abstract State Machines (ASMs) and takes the form of an interpreter. We describe the CLR as a hierarchy of eight submachines, which correspond to eight submodules into which the Common Intermediate Language (CIL) instruction set can be decomposed.
Abstract
Dynamic program slicing methods are very attractive for debugging because many statements can be ignored in the process of localizing a bug. Although language interoperability is a key concept in modern development platforms, current slicing techniques are still restricted to a single language. In this paper a cross-language dynamic program slicing technique is introduced for the .NET environment. The method is utilizing the CLR Debugging Services API, hence it can be applied to large multi-language applications.
Abstract
The Microsoft .NET Common Language Runtime (CLR) provides a low-level debugging application programmers interface (API), which can be used to implement traditional source code debuggers but can also be useful to implement other dynamic program introspection tools. This paper describes our experience in using this API for the implementation of a high-level debugger. The API is difficult to use from a technical point of view because it is implemented as a set of Component Object Model (COM) interfaces instead of a managed .NET API. Nevertheless, it is possible to implement a debugger in managed C# code using COM-interop. We describe our experience in taking this approach. We define a high-level debugging API and implement it in the C# language using COM-interop to access the low-level debugging API. Furthermore, we describe the integration of this high-level API in the multi-language development environment X-develop to enable source code debugging of .NET languages. This paper can be useful for anybody who wants to take the same approach to implement debuggers or other tools for dynamic program introspection.
Abstract
We propose a structured mathematical definition of the semantics of programs to provide a platform-independent interpreter view of the language for the programmer, which can also be used for a precise analysis of the ECMA standard of the language and as a reference model for teaching. The definition takes care to reflect directly and faithfully—as much as possible without becoming inconsistent or incomplete—the descriptions in the standard to become comparable with the corresponding models for Java in Stärk et al. (Java and Java Virtual Machine—Definition, Verification, Validation, Springer, Berlin, 2001) and to provide for implementors the possibility to check their basic design decisions against an accurate high-level model. The model sheds light on some of the dark corners of and on some critical differences between the ECMA standard and the implementations of the language.
Abstract
We present a high-level ASM model of C# threads and the .NET memory model. We focus on purely managed, fully portable threading features of C#. The sequential model interleaves the computation steps of the currently running threads and is suitable for uniprocessors. The parallel model addresses problems of true concurrency on multiprocessor systems. The models provide a sound basis for the development of multi-threaded applications in C#. The thread and memory models complete the abstract operational semantics of C# in.
Abstract
Virtual Machines provide a runtime execution platform combining bytecode portability with a performance close to native code. An overview of current approaches precedes an insight into Microsoft CLR (Common Language Runtime), comparing it to Sun JVM (Java Virtual Machine) and to a native execution environment (IA 32). A reference is also made to CLR in a Unix platform and to techniques on how CLR improves code execution.
Abstract
We present empirical evidence to demonstrate that there is little or no difference between the Java Virtual Machine and the .NET Common Language Runtime, as regards the compilation and execution of object-oriented programs. Then we give details of a case study that proves the superiority of the Common Language Runtime as a target for imperative programming language compilers (in particular GCC).
Abstract
Modern bytecode execution environments with optimizing just-in-time compilers, such as Sun’s Hotspot Java Virtual Machine, IBM’s Java Virtual Machine, and Microsoft’s Common Language Runtime, provide an infrastructure for generating fast code at runtime. Such runtime code generation can be used for efficient implementation of parametrized algorithms. More generally, with runtime code generation one can introduce an additional binding-time without performance loss. This permits improved performance and improved static correctness guarantees.
Microsoft Research
Abstract
Garbage collection greatly improves programmer productivity and ensures memory safety. Manual memory management on the other hand often delivers better performance but is typically unsafe and can lead to system crashes or security vulnerabilities. We propose integrating safe manual memory management with garbage collection in the .NET runtime to get the best of both worlds. In our design, programmers can choose between allocating objects in the garbage collected heap or the manual heap. All existing applications run unmodified, and without any performance degradation, using the garbage collected heap. Our programming model for manual memory management is flexible: although objects in the manual heap can have a single owning pointer, we allow deallocation at any program point and concurrent sharing of these objects amongst all the threads in the program. Experimental results from our .NET CoreCLR implementation on real-world applications show substantial performance gains especially in multithreaded scenarios: up to 3x savings in peak working sets and 2x improvements in runtime.
Abstract
Safe programming languages are readily available, but many applications continue to be written in unsafe languages, because the latter are more efficient. As a consequence, many applications continue to have exploitable memory safety bugs. Since garbage collection is a major source of inefficiency in the implementation of safe languages, replacing it with safe manual memory management would be an important step towards solving this problem.
Previous approaches to safe manual memory management use programming models based on regions, unique pointers, borrowing of references, and ownership types. We propose a much simpler programming model that does not require any of these concepts. Starting from the design of an imperative type safe language (like Java or C#), we just add a delete operator to free memory explicitly and an exception which is thrown if the program dereferences a pointer to freed memory. We propose an efficient implementation of this programming model that guarantees type safety. Experimental results from our implementation based on the C# native compiler show that this design achieves up to 3x reduction in peak working set and run time.
Abstract
A key challenge for concurrent programming is that side-effects (memory operations) in one thread can affect the behavior of another thread. In this paper, we present a type system to restrict the updates to memory to prevent these unintended side-effects. We provide a novel combination of immutable and unique (isolated) types that ensures safe parallelism (race freedom and deterministic execution). The type system includes support for polymorphism over type qualifiers, and can easily create cycles of immutable objects. Key to the system’s flexibility is the ability to recover immutable or externally unique references after violating uniqueness without any explicit alias tracking. Our type system models a prototype extension to C# that is in active use by a Microsoft team. We describe their experiences building large systems with this extension. We prove the soundness of the type system by an embedding into a program logic.
Abstract
Concurrent garbage collection is highly attractive for real-time systems, because offloading the collection effort from the executing threads allows faster response, allowing for extremely short deadlines at the microseconds level. Concurrent collectors also offer much better scalability over incremental collectors. The main problem with concurrent real-time collectors is their complexity. The first concurrent real-time garbage collector that can support fine synchronization, STOPLESS, has recently been presented by Pizlo et al. In this paper, we propose two additional (and different) algorithms for concurrent real-time garbage collection: CLOVER and CHICKEN. Both collectors obtain reduced complexity over the first collector STOPLESS, but need to trade a benefit for it. We study the algorithmic strengths and weaknesses of CLOVER and CHICKEN and compare them to STOPLESS. Finally, we have implemented all three collectors on the Bartok compiler and runtime for C# and we present measurements to compare their efficiency and responsiveness.
Abstract
This paper presents a case study of developing a hill climb-ing concurrency controller (HC 3) for the .NET ThreadPool. The intent of the case study is to provide insight into soft-ware considerations for controller design, testing, and imple-mentation. The case study is structured as a series of issues encountered and approaches taken to their resolution. Ex-amples of issues and approaches include: (a) addressing the need to combine a hill climbing control law with rule-based techniques by the use of hybrid control; (b) increasing the ef-ficiency and reducing the variability of the test environment by using resource emulation; and (c) effectively assessing design choices by using test scenarios for which the optimal concurrency level can be computed analytically and hence desired test results are known a priori. We believe that these issues and approaches have broad application to controllers for resource management of software systems.
Abstract
We present STOPLESS: a concurrent real-time garbage collector suitable for modern multiprocessors running parallel multithreaded applications. Creating a garbage-collected environment that sup- ports real-time on modern platforms is notoriously hard, especially if real-time implies lock-freedom. Known real-time collectors ei- ther restrict the real-time guarantees to uniprocessors only, rely on special hardware, or just give up supporting atomic operations (which are crucial for lock-free software). STOPLESS is the first collector that provides real-time responsiveness while preserving lock-freedom, supporting atomic operations, controlling fragmen- tation by compaction, and supporting modern parallel platforms. STOPLESS is adequate for modern languages such as C# or Java. It was implemented on top of the Bartok compiler and runtime for C# and measurements demonstrate high responsiveness (a factor of a 100 better than previously published systems), virtually no pause times, good mutator utilization, and acceptable overheads.
Abstract
The security of the .NET programming model is studied from the standpoint of fully abstract compilation of C#. A number of failures of full abstraction are identified, and fixes described. The most serious problems have recently been fixed for version 2.0 of the .NET Common Language Runtime.
Abstract
We describe problems that have arisen when combining the proposed design for generics for the Microsoft .NET Common Language Runtime (CLR) with two resource-related features supported by the Microsoft CLR implementation: application domains and pre-compilation. Application domains are “software based processes” and the interaction between application domains and generics stems from the fact that code and descriptors are generated on a pergeneric-instantiation basis, and thus instantiations consume resources which are preferably both shareable and recoverable. Pre-compilation runs at install-time to reduce startup overheads. This interacts with application domain unloading: compilation units may contain shareable generated instantiations. The paper describes these interactions and the different approaches that can be used to avoid or ameliorate the problems.
Abstract
We present a formalization of the implementation of generics in the .NET Common Language Runtime (CLR), focusing on two novel aspects of the implementation: mixed specialization and sharing, and efficient support for run-time types. Some crucial constructs used in the implementation are dictionaries and run-time type representations. We formalize these aspects type-theoretically in a way that corresponds in spirit to the implementation techniques used in practice. Both the techniques and the formalization also help us understand the range of possible implementation techniques for other languages, e.g., ML, especially when additional source language constructs such as run-time types are supported. A useful by-product of this study is a type system for a subset of the polymorphic IL proposed for the .NET CLR.
Abstract
We propose a method for implementing behavioral interface specifications on the .NET platform. Our interface specifications are expressed as executable model programs. Model programs can be run either as stand-alone simulations or used as contracts to check the conformance of an implementation class to its specification. We focus on the latter, which we call runtime verification.In our framework, model programs are expressed in the new specification language AsmL. We describe how AsmL can be used to describe contracts independently from any implementation language, how AsmL allows properties of component interaction to be specified using mandatory calls, and how AsmL is used to check the behavior of a component written in any of the .NET languages, such as VB, C#, or C++.
Abstract
The Microsoft .NET Common Language Runtime provides a shared type system, intermediate language and dynamic execution environment for the implementation and inter-operation of multiple source languages. In this paper we extend it with direct support for parametric polymorphism (also known as generics), describing the design through examples written in an extended version of the C# programming language, and explaining aspects of implementation by reference to a prototype extension to the runtime. Our design is very expressive, supporting parameterized types, polymorphic static, instance and virtual methods, “F-bounded” type parameters, instantiation at pointer and value types, polymorphic recursion, and exact run-time types. The implementation takes advantage of the dynamic nature of the runtime, performing justin-time type specialization, representation-based code sharing and novel techniques for efficient creation and use of run-time types. Early performance results are encouraging and suggest that programmers will not need to pay an overhead for using generics, achieving performance almost matching hand-specialized code.
Abstract
The Microsoft .NET Framework is a new computing architecture designed to support a variety of distributed applications and web-based services. .NET software components are typically distributed in an object-oriented intermediate language, Microsoft IL, executed by the Microsoft Common Language Runtime. To allow convenient multi-language working, IL supports a wide variety of high-level language constructs, including class-based objects, inheritance, garbage collection, and a security mechanism based on type safe execution. This paper precisely describes the type system for a substantial fragment of IL that includes several novel features: certain objects may be allocated either on the heap or on the stack; those on the stack may be boxed onto the heap, and those on the heap may be unboxed onto the stack; methods may receive arguments and return results via typed pointers, which can reference both the stack and the heap, including the interiors of objects on the heap. We present a formal semantics for the fragment. Our typing rules determine well-typed IL instruction sequences that can be assembled and executed. Of particular interest are rules to ensure no pointer into the stack outlives its target. Our main theorem asserts type safety, that well-typed programs in our IL fragment do not lead to untrapped execution errors. Our main theorem does not directly apply to the product. Still, the formal system of this paper is an abstraction of informal and executable specifications we wrote for the full product during its development. Our informal specification became the basis of the product team’s working specification of type-checking. The process of writing this specification, deploying the executable specification as a test oracle, and applying theorem proving techniques, helped us identify several security critical bugs during development.
Mono Runtime
Abstract
Unity is the most popular cross-platform development framework to develop games for multiple platforms such as Android, iOS, and Windows Mobile. While Unity developers can easily develop mobile apps for multiple platforms, adversaries can also easily build malicious apps based on the “write once, run anywhere” (WORA) feature. Even thoughmalicious apps were discovered among Android apps written with Unity framework (Unity apps), little research has been done on analysing the malicious apps. We propose static and dynamic reverse engineering techniques for malicious Unity apps. We first inspect the executable file format of a Unity app and present an effective static analysis technique of the Unity app. Then, we also propose a systematic technique to analyse dynamically the Unity app. Using the proposed techniques, the malware analyst can statically and dynamically analyse Java code, native code in C or C ++, and the Mono runtime layer where the C# code is running.
Abstract
Virtualized runtime environments like Java Virtual Machine (JVM) or Microsoft .NET’s Common Language Runtime (CLR) introduce additional challenges to real-time software development. Since applications for such environments are usually deployed in platform independent intermediate code, one issue is the timing of code transformation from intermediate code into native code. We have developed a solution for this problem, so that code transformation is suitable for real-time systems. It combines pre-compilation of intermediate code with the elimination of indirect references in native code. The gain of determinism comes with an increased application startup time. In this paper we present an optimization that utilizes an Ahead-of-Time compiler to reduce the startup time while keeping the real-time suitable timing behaviour. In an experiment we compare our approach with existing ones and demonstrate its benefits for certain application cases.
Abstract
The Microsoft .NET framework and its language family focus on multi-language development to support interoperability across several programming languages. The framework allows for the development of similar applications in different languages through the reuse of core libraries. As a result of such a multi-language development, the identification and trace ability of similar code fragments (clones) becomes a key challenge. In this paper, we present a clone detection approach for the .NET language family. The approach is based on the Common Intermediate Language, which is generated by the .NET compiler for the different languages within the .NET framework. In order to achieve an acceptable recall while maintaining the precision of our detection approach, we define a set of filtering processes to reduce noise in the raw data. We show that these filters are essential for Intermediate Language-based clone detection, without significantly affecting the precision of the detection approach. Finally, we study the quantitative and qualitative performance aspects of our clone detection approach. We evaluate the number of reported candidate clone-pairs, as well as the precision and recall (using manual validation) for several open source cross-language systems, to show the effectiveness of our proposed approach.
Abstract
When dealing with dynamic, untrusted content, such as on the Web, software behavior must be sandboxed, typically through use of a language like JavaScript. However, even for such specially-designed languages, it is difficult to ensure the safety of highly-optimized, dynamic language runtimes which, for efficiency, rely on advanced techniques such as Just-In-Time (JIT) compilation, large libraries of native-code support routines, and intricate mechanisms for multi-threading and garbage collection. Each new runtime provides a new potential attack surface and this security risk raises a barrier to the adoption of new languages for creating untrusted content. Removing this limitation, this paper introduces general mechanisms for safely and efficiently sandboxing software, such as dynamic language runtimes, that make use of advanced, low-level techniques like runtime code modification. Our language-independent sandboxing builds on Software-based Fault Isolation (SFI), a traditionally static technique. We provide a more flexible form of SFI by adding new constraints and mechanisms that allow safety to be guaranteed despite runtime code modifications. We have added our extensions to both the x86-32 and x86-64 variants of a production-quality, SFI-based sandboxing platform; on those two architectures SFI mechanisms face different challenges. We have also ported two representative language platforms to our extended sandbox: the Mono common language runtime and the V8 JavaScript engine. In detailed evaluations, we find that sandboxing slowdown varies between different benchmarks, languages, and hardware platforms. Overheads are generally moderate and they are close to zero for some important benchmark/platform combinations.
Abstract
Managed Runtime Environments (MREs), such as the JVM and the CLI, form an attractive environment for program execution, by providing portability and safety, via the use of a bytecode language and automatic memory management, as well as good performance, via just-in-time (JIT) compilation. Nevertheless, developing a fully featured MRE, including e.g. a garbage collector and JIT compiler, is a herculean task. As a result, new languages cannot easily take advantage of the benefits of MREs, and it is difficult to experiment with extensions of existing MRE based languages. This paper describes and evaluates VMKit, a first attempt to build a common substrate that eases the development of high-level MREs. We have successfully used VMKit to build two MREs: a Java Virtual Machine and a Common Language Runtime. We provide an extensive study of the lessons learned in developing this infrastructure, and assess the ease of implementing new MREs or MRE extensions and the resulting performance. In particular, it took one of the authors only one month to develop a Common Language Runtime using VMKit. VMKit furthermore has performance comparableto the well established open source MREs Cacao, Apache Harmony and Mono, and is 1.2 to 3 times slower than JikesRVM on most of the Dacapo benchmarks.
Abstract
The Mono Model Checker (mmc) is a software model checker for cil bytecode programs. mmc has been developed on the Mono platform. mmc is able to detect deadlocks and assertion violations in cil programs. The design of mmc is inspired by the Java PathFinder (jpf), a model checker for Java programs. The performance of mmc is comparable to jpf. This paper introduces mmc and presents its main architectural characteristics.
Abstract
We compare the numeric performance of C, C# and Java on three small cases.
Abstract
Microsoft has released .NET, a platform dependent standard for the C#,programming language. Sponsored by Ximian/Novell, Mono, the open source development platform based on the .NET framework, has been developed to be a platform independent version of the C#,programming environment. While .NET is platform dependent, Mono allows developers to build Linux and crossplatform applications. Mono’s .NET implementation is based on the ECMA standards for C#. This paper examines both of these programming environments with the goal of evaluating the performance characteristics of each. Testing is done with various algorithms. We also assess the trade-offs associated with using a cross-platform versus a platform.
Abstract
Engineering a large software project involves tracking the impact of development and maintenance changes on the software performance. An approach for tracking the impact is regression benchmarking, which involves automated benchmarking and evaluation of performance at regular intervals. Regression benchmarking must tackle the nondeterminism inherent to contemporary computer systems and execution environments and the impact of the nondeterminism on the results. On the example of a fully automated regression benchmarking environment for the mono open-source project, we show how the problems associated with nondeterminism can be tackled using statistical methods.
Shared Source Common Language Infrastructure (SSCLI) - a.k.a ‘Rotor’
Abstract
Increasing trends towards adaptive, distributed, generative and pervasive software have made object-oriented dynamically typed languages become increasingly popular. These languages offer dynamic software evolution by means of reflection, facilitating the development of dynamic systems. Unfortunately, this dynamism commonly imposes a runtime performance penalty. In this paper, we describe how to extend a production JIT-compiler virtual machine to support runtime object-oriented structural reflection offered by many dynamic languages. Our approach improves runtime performance of dynamic languages running on statically typed virtual machines. At the same time, existing statically typed languages are still supported by the virtual machine.
We have extended the .Net platform with runtime structural reflection adding prototype-based object-oriented semantics to the statically typed class-based model of .Net, supporting both kinds of programming languages. The assessment of runtime performance and memory consumption has revealed that a direct support of structural reflection in a production JIT-based virtual machine designed for statically typed languages provides a significant performance improvement for dynamically typed languages.
Abstract
This paper presents a step forward on a research trend focused on increasing runtime adaptability of commercial JIT-based virtual machines, describing how to include dynamic inheritance into this kind of platforms. A considerable amount of research aimed at improving runtime performance of virtual machines has converted them into the ideal support for developing different types of software products. Current virtual machines do not only provide benefits such as application interoperability, distribution and code portability, but they also offer a competitive runtime performance.
Since JIT compilation has played a very important role in improving runtime performance of virtual machines, we first extended a production JIT-based virtual machine to support efficient language-neutral structural reflective primitives of dynamically typed programming languages. This article presents the next step in our research work: supporting language-neutral dynamic inheritance for both statically and dynamically typed programming languages. Executing both kinds of programming languages over the same platform provides a direct interoperation between them.
Abstract
This paper describes a low-overhead self-tuning sampling-based runtime profiler integrated into SSCLI virtual machine. Our profiler estimates how “hot” a method is and builds a call context graph based on managed stack samples analysis. The frequency of sampling is tuned dynamically at runtime, based on the information of how often the same activation record appears on top of the stack. The call graph is presented as a novel Call Context Map (CC-Map) structure that combines compact representation and accurate information about the context. It enables fast extraction of data helpful in making compilation decisions, as well as fast placing data into the map. Sampling mechanism is integrated with intrinsic Rotor mechanisms of thread preemption and stack walk. A separate system thread is responsible for organizing data in the CC-Map. This thread gathers and stores samples quickly queued by managed threads, thus decreasing the time they must hold up their user-scheduled job
Abstract
The.NET Compact Framework is designed to be a highperformance virtual machine for mobile and embedded devices that operate on Windows CE (version 4.1 and later). It achieves fast execution time by compiling methods dynamically instead of using interpretation. Once compiled, these methods are stored in a portion of the heap called code-cache and can be reused quickly to satisfy future method calls. While code-cache provides a high-level of reusability, it can also use a large amount of memory. As a result, the Compact Framework provides a “code pitching ” mechanism that can be used to discard the previously compiled methods as needed. In this paper, we study the effect of code pitching on the overall performance and memory utilization of.NET applications. We conduct our experiments using Microsoft’s Shared-Source Common Language Infrastructure (SSCLI). We profile the access behavior of the compiled methods. We also experiment with various code-cache configurations to perform pitching. We find that programs can operate efficiently with a small code-cache without incurring substantial recompilation and execution overheads.
Abstract
Although dynamic languages are becoming widely used due to the flexibility needs of specific software prod- ucts, their major drawback is their runtime performance. Compiling the source program to an abstract machine’s intermediate language is the current technique used to obtain the best performance results. This intermediate code is then executed by a virtual machine developed as an interpreter. Although JIT adaptive optimizing com- pilation is currently used to speed up Java and .net intermediate code execution, this practice has not been em- ployed successfully in the implementation of dynamically adaptive platforms yet. We present an approach to improve the runtime performance of a specific set of structural reflective primitives, extensively used in adaptive software development. Looking for a better performance, as well as interaction with other languages, we have employed the Microsoft Shared Source CLI platform, making use of its JIT compiler. The SSCLI computational model has been enhanced with semantics of the prototype-based object-oriented com- putational model. This model is much more suitable for reflective environments. The initial assessment of per- formance results reveals that augmenting the semantics of the SSCLI model, together with JIT generation of native code, produces better runtime performance than the existing implementations.
Abstract
Long living objects lengthen the trace time which is a critical phase of the garbage collection process. However, it is possible to recognize object clusters i.e. groups of long living objects having approximately the same lifetime and treat them separately to reduce the load on the garbage collector and hence improve overall performance. Segregating objects this way leaves the heap for objects with shorter lifetimes and now a typical collection can nd more garbage than before. In this paper, we describe a compile time analysis strategy to identify object clusters in programs. The result of the compile time analysis is the set of allocation sites that contribute towards allocating objects belonging to such clusters. All such allocation sites are replaced by a new allocation method that allocates objects into the cluster area rather than the heap. This study was carried out for a concurrent collector which we developed for Rotor, Microsoft’s Shared Source Implementation of .NET. We analyze the performance of the program with combina- tions of the cluster and stack allocation optimizations. Our results show that the clustering optimization reduces the number of collections by 66.5% on average, even eliminating the need for collection in some programs. As a result, the total pause time reduces by 62.8% on average. Using both stack allocation and the cluster optimizations brings down the number of collections by 91.5% thereby improving the total pause time by 79.33%.
Abstract
The Shared Source CLI (SSCLI), also known as Rotor, is an implementation of the CLI released by Microsoft in source code. Rotor includes a single pass just-in-time compiler that generates non-optimized code for Intel IA-32 and IBM PowerPC processors. We extend Rotor with an optimizing justin-time compiler for IA-32. This compiler has three passes: control flow graph generation, data dependence graph generation and final code generation. Dominance relations in the control flow graph are used to detect natural loops. A number of optimizations are performed during the generation of the data dependence graph. During native code generation, the rich address modes of IA32 are used for instruction folding, reducing code size and usage of register names. Despite the overhead of three passes and optimizations, this compiler is only 1.4 to 1.9 times slower than the original SSCLI compiler and generates code that runs 6.4 to 10 times faster.
Abstract
By using an Interaction Specification Language (ISL), interactions between components can be expressed in a language independent way. At class level, interaction pattern specified in ISLrepresent model s of future interactions when applied on some component instances. The Interaction Server is in charge of managing the life cycle of interactions (interaction pattern registration and instantiation, destruction of interactions, merging). It acts as a central repository that keeps the global coherency of the adaptations realized on the component instances.The Interaction service allows creati ng interactions between heterogeneous components. Noah is an implementation of this Interaction Service. It can be thought as a dynamic aspect repository with a weaver that uses an aspect composition mechanism that insures commutable and associative adaptations. In this paper, we propose the implementation of the Interaction Service in the SSCLI. In contrast to other implementations such as Java where interaction management represents an additional layer, SSCLI enables us to integrate Interaction Management as in intrinsic part of the CLI runtime.
Abstract
Microsoft’s Rotor is a shared-source CLI implementation intended for use as a research platform. It is particularly attractive for research because of its complete implementation and extensive libraries, and because its modular design allows dierent implementations of certain components such as just-in-time compilers (JITs). Our group has independently developed our own high-performance JIT and garbage collector (GC) and wanted to take advantage of Rotor to experiment with these components in a CLI environment. In this paper, we describe our experience integrating these components into Rotor and evaluate the flexibility of Rotor’s design toward this goal. We found it easier to integrate our JIT than our GC because Rotor has a well-defined interface for the former but not the latter. However, our JIT integration still required significant changes to both Rotor and our JIT. For example, we modified Rotor to support multiple JITs. We also added support for a second JIT manager in Rotor, and implemented a new code manager compatible with our JIT. We had to change our JIT compiler to support Rotor’s calling conventions, helper functions, and exception model. Our GC integration was complicated by the many places in Rotor where components make assumptions about how its garbage collector is implemented, as well as Rotor’s lack of a well-defined GC interface. We also had to reconcile the dierent assumptions made by Rotor and our garbage collector about the layout of objects, virtual-method tables, and thread structures.
Fri, 25 Oct 2019, 12:00 am
"Stubs" in the .NET Runtime
As the saying goes:
“All problems in computer science can be solved by another level of indirection”
- David Wheeler
and it certainly seems like the ‘.NET Runtime’ Engineers took this advice to heart!
‘Stubs’, as they’re known in the runtime (sometimes ‘Thunks’), provide a level of indirection throughout the source code, there’s almost 500 mentions of them!
This post will explore what they are, how they work and why they’re needed.
Table of Contents
What are stubs?
In the context of the .NET Runtime, ‘stubs’ look something like this:
Call-site Callee
+--------------+ +---------+ +-------------+
| | | | | |
| +---------->+ Stub + - - - - ->+ |
| | | | | |
+--------------+ +---------+ +-------------+
So they sit between a method ‘call-site’ (i.e. code such as var result = Foo(..);
) and the ‘callee’ (where the method itself is implemented, the native/assembly code) and I like to think of them as doing tidy-up or fix-up work. Note that moving from the ‘stub’ to the ‘callee’ isn’t another full method call (hence the dotted line), it’s often just a single jmp
or call
assembly instruction, so the 2nd transition doesn’t involve all the same work that was initially done at the call-site (pushing/popping arguments into registers, increasing the stack space, etc).
The stubs themselves can be as simple as just a few assembly instructions or something more complicated, we’ll look at individual examples later on in this post.
Now, to be clear, not all method calls require a stub, if you’re doing a regular call to an static or instance method that just goes directly from the ‘call-site’ to the ‘callee’. But once you involve virtual methods, delegates or generics things get a bit more complicated.
Why are stubs needed?
There are several reasons that stubs need to be created by the runtime:
- Required Functionality
- For instance Delegates and Arrays must be provided but the runtime, their method bodies are not generated by the C#/F#/VB.NET compiler and neither do they exist in the Base-Class Libraries. This requirement is outlined in the ECMA 355 Spec, for instance ‘Partition I’ in section ‘8.9.1 Array types’ says:
Exact array types are created automatically by the VES when they are required. Hence, the operations on an array type are defined by the CTS. These generally are: allocating the array based on size and lower-bound information, indexing the array to read and write a value, computing the address of an element of the array (a managed pointer), and querying for the rank, bounds, and the total number of values stored in the array.
Likewise for delegates, which are covered in ‘I.8.9.3 Delegates’:
While, for the most part, delegates appear to be simply another kind of user-defined class, they are tightly controlled. The implementations of the methods are provided by the VES, not user code. The only additional members that can be defined on delegate types are static or instance methods.
- Performance
- Consistent method calls
- A final factor is that having ‘stubs’ makes the work of the JIT compiler easier. As we will see in the rest of the post, stubs deal with a variety of different types of method calls. This means the the JIT can generate more straightforward code for any given ‘call site’, because it (mostly) doesn’t care whats happening in the ‘callee’. If stubs didn’t exist, for a given method call the JIT would have to generate different code depending on whether generics where involved or not, if it was a virtual or non-virtual call, if it was going via a delegate, etc. Stubs abstact a lot of this behaviour away from the JIT, allowing it to deal with a more simple ‘Application Binary Interface’ (ABI).
CLR ‘Application Binary Interface’ (ABI)
Therefore, another way to think about ‘stubs’ is that they are part of what makes the CLR-specific ‘Application Binary Interface’ (ABI) work.
All code needs to work with the ABI or ‘calling convention’ of the CPU/OS that it’s running on, for instance by following the x86 calling convention, x64 calling convention or System V ABI. This applies across runtimes, for more on this see:
As an aside, if you want more information about ‘calling conventions’ here’s some links that I found useful:
However, on-top of what the CLR has to support due to the CPU/OS conventions, it also has it’s own extended ABI for .NET-specific use cases, including:
- “this” pointer:
The managed “this” pointer is treated like a new kind of argument not covered by the native ABI, so we chose to always pass it as the first argument in (AMD64) RCX
or (ARM, ARM64) R0
.
AMD64-only: Up to .NET Framework 4.5, the managed “this” pointer was treated just like the native “this” pointer (meaning it was the second argument when the call used a return buffer and was passed in RDX
instead of RCX
). Starting with .NET Framework 4.5, it is always the first argument.
- Generics or more specifically to handle ‘Shared generics’:
In cases where the code address does not uniquely identify a generic instantiation of a method, then a ‘generic instantiation parameter’ is required. Often the “this” pointer can serve dual-purpose as the instantiation parameter. When the “this” pointer is not the generic parameter, the generic parameter is passed as an additional argument..
- Hidden Parameters, covering ‘Stub dispatch’, ‘Fast Pinvoke’, ‘Calli Pinvoke’ and ‘Normal PInvoke’. For instance, here’s why ‘PInvoke’ has a hidden parameter:
Normal PInvoke - The VM shares IL stubs based on signatures, but wants the right method to show up in call stack and exceptions, so the MethodDesc for the exact PInvoke is passed in the (x86) EAX
/ (AMD64) R10
/ (ARM, ARM64) R12
(in the JIT: REG_SECRET_STUB_PARAM
). Then in the IL stub, when the JIT gets CORJIT_FLG_PUBLISH_SECRET_PARAM
, it must move the register into a compiler temp.
Not all of these scenarios need a stub, for instance the ‘this’ pointer is handled directly by the JIT, but many do as we’ll see in the rest of the post.
Stub Management
So we’ve seen why stubs are needed and what type of functionality they can provide. But before we look at all the specific examples that exist in the CoreCLR source, I just wanted to take some time to understand the common or shared concerns that apply to all stubs.
Stubs in the CLR are snippets of assembly code, but they have to be stored in memory and have their life-time managed. Also, they have to play nice with the debugger, from What Every CLR Developer Must Know Before Writing Code:
2.8 Is your code compatible with managed debugging?
- ..
- If you add a new stub (or way to call managed code), make sure that you can source-level step-in (F11) it under the debugger. The debugger is not psychic. A source-level step-in needs to be able to go from the source-line before a call to the source-line after the call, or managed code developers will be very confused. If you make that call transition be a giant 500 line stub, you must cooperate with the debugger for it to know how to step-through it. (This is what StubManagers are all about. See src\vm\stubmgr.h). Try doing a step-in through your new codepath under the debugger.
So every type of stub has a StubManager
which deals with the allocation, storage and lookup of the stubs. The lookup is significant, as it provides the mapping from an arbitrary memory address to the type of stub (if any) that created the code. As an example, here’s what the CheckIsStub_Internal(..)
method here and DoTraceStub(..)
method here look like for the DelegateInvokeStubManager
:
BOOL DelegateInvokeStubManager::CheckIsStub_Internal(PCODE stubStartAddress)
{
LIMITED_METHOD_DAC_CONTRACT;
bool fIsStub = false;
#ifndef DACCESS_COMPILE
#ifndef _TARGET_X86_
fIsStub = fIsStub || (stubStartAddress == GetEEFuncEntryPoint(SinglecastDelegateInvokeStub));
#endif
#endif // !DACCESS_COMPILE
fIsStub = fIsStub || GetRangeList()->IsInRange(stubStartAddress);
return fIsStub;
}
BOOL DelegateInvokeStubManager::DoTraceStub(PCODE stubStartAddress, TraceDestination *trace)
{
LIMITED_METHOD_CONTRACT;
LOG((LF_CORDB, LL_EVERYTHING, "DelegateInvokeStubManager::DoTraceStub called\n"));
_ASSERTE(CheckIsStub_Internal(stubStartAddress));
// If it's a MC delegate, then we want to set a BP & do a context-ful
// manager push, so that we can figure out if this call will be to a
// single multicast delegate or a multi multicast delegate
trace->InitForManagerPush(stubStartAddress, this);
LOG_TRACE_DESTINATION(trace, stubStartAddress, "DelegateInvokeStubManager::DoTraceStub");
return TRUE;
}
The code to initialise the various stub managers is here in SystemDomain::Attach()
and by working through the list we can get a sense of what each category of stub does (plus the informative comments in the code help!)
PrecodeStubManager
implemented here
- ‘Stub manager functions & globals’
DelegateInvokeStubManager
implemented here
- ‘Since we don’t generate delegate invoke stubs at runtime on IA64, we can’t use the StubLinkStubManager for these stubs. Instead, we create an additional DelegateInvokeStubManager instead.’
JumpStubStubManager
implemented here
- ‘Stub manager for jump stubs created by ExecutionManager::jumpStub() These are currently used only on the 64-bit targets IA64 and AMD64’
RangeSectionStubManager
implemented here
- ‘Stub manager for code sections. It forwards the query to the more appropriate stub manager, or handles the query itself.’
ILStubManager
implemented here
- ‘This is the stub manager for IL stubs’
InteropDispatchStubManager
implemented here
- ‘This is used to recognize GenericComPlusCallStub, VarargPInvokeStub, and GenericPInvokeCalliHelper.’
StubLinkStubManager
implemented here
ThunkHeapStubManager
implemented here
- ‘Note, the only reason we have this stub manager is so that we can recgonize UMEntryThunks for IsTransitionStub. ..’
TailCallStubManager
implemented here
- ‘This is the stub manager to help the managed debugger step into a tail call. It helps the debugger trace through JIT_TailCall().’ (from stubmgr.h)
ThePreStubManager
implemented here (in prestub.cpp)
- ‘The following code manages the PreStub. All method stubs initially use the prestub.’
VirtualCallStubManager
implemented here (in virtualcallstub.cpp)
Finally, we can also see the ‘StubManagers’ in action if we use the eeheap
SOS command to inspect the ‘heap dump’ of a .NET Process, as it helps report the size of the different ‘stub heaps’:
> !eeheap -loader
Loader Heap:
--------------------------------------
System Domain: 704fd058
LowFrequencyHeap: Size: 0x0(0)bytes.
HighFrequencyHeap: 002e2000(8000:1000) Size: 0x1000(4096)bytes.
StubHeap: 002ea000(2000:1000) Size: 0x1000(4096)bytes.
Virtual Call Stub Heap:
- IndcellHeap: Size: 0x0(0)bytes.
- LookupHeap: Size: 0x0(0)bytes.
- ResolveHeap: Size: 0x0(0)bytes.
- DispatchHeap: Size: 0x0(0)bytes.
- CacheEntryHeap: Size: 0x0(0)bytes.
Total size: 0x2000(8192)bytes
--------------------------------------
(output taken from .NET Generics and Code Bloat (or its lack thereof))
You can see that in this case the entire ‘stub heap’ is taking up 4096 bytes and in addition there are more in-depth statistics covering the heaps used by virtual call dispatch.
Types of stubs
The different stubs used by the runtime fall into 3 main categories:
Most stubs are wired up in MethodDesc::DoPrestub(..)
, in this section of code or this section for COM Interop. The stubs generated include the following (definitions taken from BOTR - ‘Kinds of MethodDescs’, also see enum MethodClassification
here):
- Instantiating in (
FEATURE_SHARE_GENERIC_CODE
, on by default) in MakeInstantiatingStubWorker(..)
here
- Used for less common IL methods that have generic instantiation or that do not have preallocated slot in method table.
- P/Invoke (a.k.a NDirect) in
GetStubForInteropMethod(..)
here
- P/Invoke methods. These are methods marked with DllImport attribute.
- FCall methods in
ECall::GetFCallImpl(..)
here
- Internal methods implemented in unmanaged code. These are methods marked with
MethodImplAttribute(MethodImplOptions.InternalCall)
attribute, delegate constructors and tlbimp constructors.
- Array methods in
GenerateArrayOpStub(..)
here
- Array methods whose implementation is provided by the runtime (Get, Set, Address)
- EEImpl in
PCODE COMDelegate::GetInvokeMethodStub(EEImplMethodDesc* pMD)
here
- Delegate methods, implementation provided by the runtime
- COM Interop (
FEATURE_COMINTEROP
, on by default) in GetStubForInteropMethod(..)
here
- COM interface methods. Since the non-generic interfaces can be used for COM interop by default, this kind is usually used for all interface methods.
- Unboxing in
Stub * MakeUnboxingStubWorker(MethodDesc *pMD)
here
Right, now lets look at the individual stub in more detail.
Precode
First up, we’ll take a look at ‘precode’ stubs, because they are used by all other types of stubs, as explained in the BotR page on Method Descriptors:
The precode is a small fragment of code used to implement temporary entry points and an efficient wrapper for stubs. Precode is a niche code-generator for these two cases, generating the most efficient code possible. In an ideal world, all native code dynamically generated by the runtime would be produced by the JIT. That’s not feasible in this case, given the specific requirements of these two scenarios. The basic precode on x86 may look like this:
mov eax,pMethodDesc // Load MethodDesc into scratch register
jmp target // Jump to a target
Efficient Stub wrappers: The implementation of certain methods (e.g. P/Invoke, delegate invocation, multi dimensional array setters and getters) is provided by the runtime, typically as hand-written assembly stubs. Precode provides a space-efficient wrapper over stubs, to multiplex them for multiple callers.
The worker code of the stub is wrapped by a precode fragment that can be mapped to the MethodDesc and that jumps to the worker code of the stub. The worker code of the stub can be shared between multiple methods this way. It is an important optimization used to implement P/Invoke marshalling stubs.
By providing a ‘pointer’ to the MethodDesc class, the precode allows any subsequent stub to have access to a lot of information about a method call and it’s containing Type
via the MethodTable (‘hot’) and EEClass (‘cold’) data structures. The MethodDesc data-structure is one of the most fundamental types in the runtime, hence why it has it’s own BotR page.
Each ‘precode’ is created in MethodDesc::GetOrCreatePrecode()
here and there are several different types as we can see in this enum
from /vm/precode.h:
enum PrecodeType {
PRECODE_INVALID = InvalidPrecode::Type,
PRECODE_STUB = StubPrecode::Type,
#ifdef HAS_NDIRECT_IMPORT_PRECODE
PRECODE_NDIRECT_IMPORT = NDirectImportPrecode::Type,
#endif // HAS_NDIRECT_IMPORT_PRECODE
#ifdef HAS_FIXUP_PRECODE
PRECODE_FIXUP = FixupPrecode::Type,
#endif // HAS_FIXUP_PRECODE
#ifdef HAS_THISPTR_RETBUF_PRECODE
PRECODE_THISPTR_RETBUF = ThisPtrRetBufPrecode::Type,
#endif // HAS_THISPTR_RETBUF_PRECODE
};
As always, the BotR page describes the different types in great detail, but in summary:
- StubPrecode - .. is the basic precode type. It loads MethodDesc into a scratch register and then jumps. It must be implemented for precodes to work. It is used as fallback when no other specialized precode type is available.
- FixupPrecode - .. is used when the final target does not require MethodDesc in scratch register. The FixupPrecode saves a few cycles by avoiding loading MethodDesc into the scratch register. The most common usage of FixupPrecode is for method fixups in NGen images.
- ThisPtrRetBufPrecode - .. is used to switch a return buffer and the this pointer for open instance delegates returning valuetypes. It is used to convert the calling convention of
MyValueType Bar(Foo x)
to the calling convention of MyValueType Foo::Bar()
.
- NDirectImportPrecode (a.k.a P/Invoke) - .. is used for lazy binding of unmanaged P/Invoke targets. This precode is for convenience and to reduce amount of platform specific plumbing.
Finally, to give you an idea of some real-world scenarios for ‘precode’ stubs, take a look at this comment from the DoesSlotCallPrestub(..)
method (AMD64):
// AMD64 has the following possible sequences for prestub logic:
// 1. slot -> temporary entrypoint -> prestub
// 2. slot -> precode -> prestub
// 3. slot -> precode -> jumprel64 (jump stub) -> prestub
// 4. slot -> precode -> jumprel64 (NGEN case) -> prestub
‘Just-in-time’ (JIT) and ‘Tiered’ Compilation
However, another piece of functionality that ‘precodes’ provide is related to ‘just-in-time’ (JIT) compilation, again from the BotR page:
Temporary entry points: Methods must provide entry points before they are jitted so that jitted code has an address to call them. These temporary entry points are provided by precode. They are a specific form of stub wrappers.
This technique is a lazy approach to jitting, which provides a performance optimization in both space and time. Otherwise, the transitive closure of a method would need to be jitted before it was executed. This would be a waste, since only the dependencies of taken code branches (e.g. if statement) require jitting.
Each temporary entry point is much smaller than a typical method body. They need to be small since there are a lot of them, even at the cost of performance. The temporary entry points are executed just once before the actual code for the method is generated.
So these ‘temporary entry points’ provide something concrete that can be referenced before a method has been JITted. They then trigger the JIT-compilation which does the job of generating the native code for a method. The entire process looks like this (dotted lines represent a pointer indirection, solid lines are a ‘control transfer’ e.g. a jmp/call assembly instruction):
Before JITing
Here we see the ‘temporary entry point’ pointing to the ‘fixup precode’, which ultimately calls into the PrestubWorker()
function here.
After JIting
Once the method has been JITted, we can see that the PrestubWorker
is now out of the picture and instead we have the native code for the function. In addition, there is now a ‘stable entry point’ that can be used by any other code that wants to execute the function. Also, we can see that the ‘fixup precode’ has been ‘backpatched’ to also point at the ‘native code’. For an idea of how this ‘back-patching’ works, see the StubPrecode ::SetTargetInterlocked(..)
method here (ARM64).
After JIting - Tiered Compilation
However, there is also another ‘after’ scenario, now that .NET Core has ‘Tiered Compilation’. Here we see that the ‘stable entry point’ still goes via the ‘fixup precode’, it doesn’t directly call into the ‘native code’. This is because ‘tiered compilation’ counts how many times a method is called and once it decides the method is ‘hot’, it re-compiles a more optimised version that will give better performance. This ‘call counting’ takes place in this code in MethodDesc::DoPrestub(..)
which calls into CodeVersionManager::PublishNonJumpStampVersionableCodeIfNecessary(..)
here and then if shouldCountCalls
is true, it ends up calling CallCounter::OnMethodCodeVersionCalledSubsequently(..)
here.
What’s been interesting to watch during the development of ‘tiered compilation’ is that (not surprisingly) there has been a significant amount of work to ensure that the extra level of indirection doesn’t make the entire process slower, for instance see Patch vtable slots and similar when tiering is enabled #21292.
Like all the other stubs, ‘precodes’ have different versions for different CPU architectures. As a reference, the list below contains links to all of them:
Precodes
(a.k.a ‘Precode Fixup Thunk’):
ThePreStub
:
PreStubWorker(..)
in /vm/prestub.cpp
MethodDesc::DoPrestub(..)
here
MethodDesc::DoBackpatch(..)
here
Finally, for even more information on the JITing process, see:
Stubs-as-IL
‘Stubs as IL’ actually describes several types of individual stubs, but what they all have in common is they’re generated from ‘Intermediate Language’ (IL) which is then compiled by the JIT, in exactly the same way it handles the code we write (after it’s first been compiled from C#/F#/VB.NET into IL by another compiler).
This makes sense, it’s far easier to write the IL once and then have the JIT worry about compiling it for different CPU architectures, rather than having to write raw assembly each time (for x86/x64/arm/etc). However all stubs were hand-written assembly in .NET Framework 1.0:
What you have described is how it actually works. The only difference is that the shuffle thunk is hand-emitted in assembly and not generated by the JIT for historic reasons. All stubs (including all interop stubs) were hand-emitted like this in .NET Framework 1.0. Starting with .NET Framework 2.0, we have been converting the stubs to be generated by the JIT (the runtime generates IL for the stub, and then the JIT compiles the IL as regular method). The shuffle thunk is one of the few remaining ones not converted yet. Also, we have the IL path on some platforms but not others - FEATURE_STUBS_AS_IL
is related to it.
In the CoreCLR source code, ‘stubs as IL’ are controlled by the feature flag FEATURE_STUBS_AS_IL
, with the following additional flags for each specific type:
StubsAsIL
ArrayStubAsIL
MulticastStubAsIL
On Windows
only some features are implemented with IL stubs, see this code, e.g. ‘ArrayStubAsIL’ is disabled on ‘x86’, but enabled elsewhere.
true
true
true
...
On Unix
they are all done in IL, regardless of CPU Arch, as this code shows:
...
true
true
true
Finally, here’s the complete list of stubs that can be implemented in IL from /vm/ilstubresolver.h:
enum ILStubType
{
Unassigned = 0,
CLRToNativeInteropStub,
CLRToCOMInteropStub,
CLRToWinRTInteropStub,
NativeToCLRInteropStub,
COMToCLRInteropStub,
WinRTToCLRInteropStub,
#ifdef FEATURE_ARRAYSTUB_AS_IL
ArrayOpStub,
#endif
#ifdef FEATURE_MULTICASTSTUB_AS_IL
MulticastDelegateStub,
#endif
#ifdef FEATURE_STUBS_AS_IL
SecureDelegateStub,
UnboxingILStub,
InstantiatingStub,
#endif
};
But the usage of IL stubs has grown over time and it seems that they are the preferred mechanism where possible as they’re easier to write and debug. See [x86/Linux] Enable FEATURE_ARRAYSTUB_AS_IL, Switch multicast delegate stub on Windows x64 to use stubs-as-il and Fix GenerateShuffleArray to support cyclic shuffles #26169 (comment) for more information.
P/Invoke, Reverse P/Invoke and ‘calli’
All these stubs have one thing in common, they allow a transition between ‘managed’ and ‘un-managed’ (or native) code. To make this safe and to preserve the guarantees that the .NET runtime provides, stubs are used every time the transition is made.
This entire process is outlined in great detail in the BotR page CLR ABI - PInvokes, from the ‘Per-call-site PInvoke work’ section:
- For direct calls, the JITed code sets
InlinedCallFrame->m_pDatum
to the MethodDesc of the call target.
- For JIT64, indirect calls within IL stubs sets it to the secret parameter (this seems redundant, but it might have changed since the per-frame initialization?).
- For JIT32 (ARM) indirect calls, it sets this member to the size of the pushed arguments, according to the comments. The implementation however always passed 0.
- For JIT64/AMD64 only: Next for non-IL stubs, the InlinedCallFrame is ‘pushed’ by setting
Thread->m_pFrame
to point to the InlinedCallFrame (recall that the per-frame initialization already set InlinedCallFrame->m_pNext
to point to the previous top). For IL stubs this step is accomplished in the per-frame initialization.
- The Frame is made active by setting
InlinedCallFrame->m_pCallerReturnAddress
.
- The code then toggles the GC mode by setting
Thread->m_fPreemptiveGCDisabled = 0
.
- Starting now, no GC pointers may be live in registers. RyuJit LSRA meets this requirement by adding special refPositon
RefTypeKillGCRefs
before unmanaged calls and special helpers.
- Then comes the actual call/PInvoke.
- The GC mode is set back by setting
Thread->m_fPreemptiveGCDisabled = 1
.
- Then we check to see if
g_TrapReturningThreads
is set (non-zero). If it is, we call CORINFO_HELP_STOP_FOR_GC
.
- For ARM, this helper call preserves the return register(s):
R0
, R1
, S0
, and D0
.
- For AMD64, the generated code must manually preserve the return value of the PInvoke by moving it to a non-volatile register or a stack location.
- Starting now, GC pointers may once again be live in registers.
- Clear the
InlinedCallFrame->m_pCallerReturnAddress
back to 0.
- For JIT64/AMD64 only: For non-IL stubs ‘pop’ the Frame chain by resetting
Thread->m_pFrame
back to InlinedCallFrame.m_pNext
.
Saving/restoring all the non-volatile registers helps by preventing any registers that are unused in the current frame from accidentally having a live GC pointer value from a parent frame. The argument and return registers are ‘safe’ because they cannot be GC refs. Any refs should have been pinned elsewhere and instead passed as native pointers.
For IL stubs, the Frame chain isn’t popped at the call site, so instead it must be popped right before the epilog and right before any jmp calls. It looks like we do not support tail calls from PInvoke IL stubs?
As you can see, quite a bit of the work is to keep the Garbage Collector (GC) happy. This makes sense because once execution moves into un-managed/native code the .NET runtime has no control over what’s happening, so it needs to ensure that the GC doesn’t clean up or move around objects that are being used in the native code. It achives this by constraining what the GC can do (on the current thread) from the time execution moves into un-managed code and keeps that in place until it returns back to the mamanged side.
On top of that, there needs to be support for allowing ‘stack walking’ or ‘unwinding, to allowing debugging and produce meaningful stack traces. This is done by setting up frames that are put in place when control transitions from managed -> un-managed, before being removed (‘popped’) when transitioning back. Here’s a list of the different scenarios that are covered, from /vm/frames.h:
This is the list of Interop stubs & transition helpers with information
regarding what (if any) Frame they used and where they were set up:
P/Invoke:
JIT inlined: The code to call the method is inlined into the caller by the JIT.
InlinedCallFrame is erected by the JITted code.
Requires marshaling: The stub does not erect any frames explicitly but contains
an unmanaged CALLI which turns it into the JIT inlined case.
Delegate over a native function pointer:
The same as P/Invoke but the raw JIT inlined case is not present (the call always
goes through an IL stub).
Calli:
The same as P/Invoke.
PInvokeCalliFrame is erected in stub generated by GenerateGetStubForPInvokeCalli
before calling to GetILStubForCalli which generates the IL stub. This happens only
the first time a call via the corresponding VASigCookie is made.
ClrToCom:
Late-bound or eventing: The stub is generated by GenerateGenericComplusWorker
(x86) or exists statically as GenericComPlusCallStub[RetBuffArg] (64-bit),
and it erects a ComPlusMethodFrame frame.
Early-bound: The stub does not erect any frames explicitly but contains an
unmanaged CALLI which turns it into the JIT inlined case.
ComToClr:
Normal stub:
Interpreted: The stub is generated by ComCall::CreateGenericComCallStub
(in ComToClrCall.cpp) and it erects a ComMethodFrame frame.
Prestub:
The prestub is ComCallPreStub (in ComCallableWrapper.cpp) and it erects a ComPrestubMethodFrame frame.
Reverse P/Invoke (used for C++ exports & fixups as well as delegates
obtained from function pointers):
Normal stub:
x86: The stub is generated by UMEntryThunk::CompileUMThunkWorker
(in DllImportCallback.cpp) and it is frameless. It calls directly
the managed target or to IL stub if marshaling is required.
non-x86: The stub exists statically as UMThunkStub and calls to IL stub.
Prestub:
The prestub is generated by GenerateUMThunkPrestub (x86) or exists statically
as TheUMEntryPrestub (64-bit), and it erects an UMThkCallFrame frame.
Reverse P/Invoke AppDomain selector stub:
The asm helper is IJWNOADThunkJumpTarget (in asmhelpers.asm) and it is frameless.
The P/Invoke IL stubs are wired up in the MethodDesc::DoPrestub(..)
method (note that P/Invoke is also known as ‘NDirect’), in addition they are also created here when being used for ‘COM Interop’. That code then calls into GetStubForInteropMethod(..)
in /vm/dllimport.cpp, before branching off to handle each case:
- P/Invoke calls into
NDirect::GetStubForILStub(..)
here
- Reverse P/Invoke calls into another overload of
NDirect::GetStubForILStub(..)
here
- COM Interop goes to
ComPlusCall::GetStubForILStub(..)
here in /vm/clrtocomcall.cpp
- EE implemented methods end up in
COMDelegate::GetStubForILStub(..)
here (for more info on EEImpl
methods see ‘Kinds of MethodDescs’)
There are also hand-written assembly stubs for the differents scenarios, such as JIT_PInvokeBegin
, JIT_PInvokeEnd
and VarargPInvokeStub
, these can be seen in the files below:
As an example, calli
method calls (see OpCodes.Calli) end up in GenericPInvokeCalliHelper
, which has a nice bit of ASCII art in the i386 version:
// stack layout at this point:
//
// | ... |
// | stack arguments | ESP + 16
// +----------------------+
// | VASigCookie* | ESP + 12
// +----------------------+
// | return address | ESP + 8
// +----------------------+
// | CALLI target address | ESP + 4
// +----------------------+
// | stub entry point | ESP + 0
// ------------------------
However, all these stubs can have an adverse impact on start-up time, see Large numbers of Pinvoke stubs created on startup for example. This impact has been mitigated by compiling the stubs ‘Ahead-of-Time’ (AOT) and storing them in the ‘Ready-to-Run’ images (replacement format for NGEN (Native Image Generator)). From R2R ilstubs:
IL stub generation for interop takes measurable time at startup, and it is possible to generate some of them in an ahead of time
This change introduces ahead of time R2R compilation of IL stubs
Related work was done in Enable R2R compilation/inlining of PInvoke stubs where no marshalling is required and PInvoke stubs for Unix platforms (‘Enables inlining of PInvoke stubs for Unix platforms’).
Finally, for even more information on the issues involved, see:
Marshalling
However, dealing with the ‘managed’ to ‘un-managed’ transition is only one part of the story. The other is that there are also stubs created to deal with the ‘marshalling’ of arguments between the 2 sides. This process of ‘Interop Marshalling’ is explained nicely in the Microsoft docs:
Interop marshaling governs how data is passed in method arguments and return values between managed and unmanaged memory during calls. Interop marshaling is a run-time activity performed by the common language runtime’s marshaling service.
Most data types have common representations in both managed and unmanaged memory. The interop marshaler handles these types for you. Other types can be ambiguous or not represented at all in managed memory.
Like many stubs in the CLR, the marshalling stubs have evolved over time. As we can read in the excellent post Improvements to Interop Marshaling in V4: IL Stubs Everywhere:
History
The 1.0 and 1.1 versions of the CLR had several different techniques for creating and executing these stubs that were each designed for marshaling different types of signatures. These techniques ranged from directly generated x86 assembly instructions for simple signatures to generating specialized ML (an internal marshaling language) and running them through an internal interpreter for the most complicated signatures. This system worked well enough – although not without difficulties – in 1.0 and 1.1 but presented us with a serious maintenance problem when 2.0, and its support for multiple processor architectures, came around.
That’s right, there was an internal interpreter built into early version of the .NET CLR that had the job of running the ‘marshalling language’ (ML) code!
However, it then goes on to explain why this process wasn’t sustainable:
We realized early in the process of adding 64 bit support to 2.0 that this approach was not sustainable across multiple architectures. Had we continued with the same strategy we would have had to create parallel marshaling infrastructures for each new architecture we supported (remember in 2.0 we introduced support for both x64 and IA64) which would, in addition to the initial cost, at least triple the cost of every new marshaling feature or bug fix. We needed one marshaling stub technology that would work on multiple processor architectures and could be efficiently executed on each one: enter IL stubs.
The solution was to implement all stubs using ‘Intermediate Language’ (IL) that is CPU-agnostic. Then the JIT-compiler is used to convert the IL into machine code for each CPU architecture, which makes sense because it’s exactly what the JIT is good at. Also worth noting is that this work still continues today, for instance see Implement struct marshalling via IL Stubs instead of via FieldMarshalers #26340.
Finally, there is a really nice investigation into the whole process in PInvoke: beyond the magic (also Compile time marshalling). What’s also nice is that you can use PerfView to see the stubs that the runtime generates.
Generics
It is reasonably well known that generics in .NET use ‘code sharing’ to save space. That is, given a generic method such as public void Insert(..)
, one method body of ‘native code’ will be created and shared by the instantiated types of Insert(..)
and Insert(..)
(assumning that Foo
and Bar
are references types), but different versions will be created for Insert(..)
and Insert(..)
(as int
/double
are value types). This is possible, for the reasons outlined by Jon Skeet in a StackOverflow question:
.. consider what the CLR needs to know about a type. It includes:
- The size of a value of that type (i.e. if you have a variable of some type, how much space will that memory need?)
- How to treat the value in terms of garbage collection: is it a reference to an object, or a value which may in turn contain other references?
For all reference types, the answers to these questions are the same. The size is just the size of a pointer, and the value is always just a reference (so if the variable is considered a root, the GC needs to recursively descend into it).
For value types, the answers can vary significantly.
But, this poses a problem. What about if the ‘shared’ method needs to do something specific for each type, like call typeof(T)
?
This whole issue is explained in these 2 great posts, which I really recommend you take the time to read:
I’m not going to repeat what they cover here, except to say that (not surprisingly) ‘stubs’ are used to solve this issue, in conjunction with a ‘hidden’ parameter. These stubs are known as ‘instantiating’ stubs and we can find out more about them in this comment:
Instantiating Stubs - Return TRUE if this is this a special stub used to implement an instantiated generic method or per-instantiation static method. The action of an instantiating stub is - pass on a MethodTable
or InstantiatedMethodDesc
extra argument to shared code
The different scenarios are handled in MakeInstantiatingStubWorker(..)
in /vm/prestub.cpp, you can see the check for HasMethodInstantiation
and the fall-back to a ‘per-instantiation static method’:
// It's an instantiated generic method
// Fetch the shared code associated with this instantiation
pSharedMD = pMD->GetWrappedMethodDesc();
_ASSERTE(pSharedMD != NULL && pSharedMD != pMD);
if (pMD->HasMethodInstantiation())
{
extraArg = pMD;
}
else
{
// It's a per-instantiation static method
extraArg = pMD->GetMethodTable();
}
Stub *pstub = NULL;
#ifdef FEATURE_STUBS_AS_IL
pstub = CreateInstantiatingILStub(pSharedMD, extraArg);
#else
CPUSTUBLINKER sl;
_ASSERTE(pSharedMD != NULL && pSharedMD != pMD);
sl.EmitInstantiatingMethodStub(pSharedMD, extraArg);
pstub = sl.Link(pMD->GetLoaderAllocator()->GetStubHeap());
#endif
As a reminder, FEATURE_STUBS_AS_IL
is defined for all Unix versions of the CoreCLR, but on Windows it’s only used with ARM64.
- When
FEATURE_STUBS_AS_IL
is defined, the code calls into CreateInstantiatingILStub(..)
here. To get an overview of what it’s doing, we can take a look at the steps called-out in the code comments:
// 1. Build the new signature
here
// 2. Emit the method body
here
// 2.2 Push the rest of the arguments for x86
here
// 2.3 Push the hidden context param
here
// 2.4 Push the rest of the arguments for not x86
here
// 2.5 Push the target address
here
// 2.6 Do the calli
here
- When
FEATURE_STUBS_AS_IL
is note defined, per CPU/OS versions of EmitInstantiatingMethodStub(..)
are used, they exist for:
In the last case, (EmitInstantiatingMethodStub(..)
on ARM), the stub shares code with the instantiating version of the unboxing stub, so the heavy-lifting is done in StubLinkerCPU::ThumbEmitCallWithGenericInstantiationParameter(..)
here. This method is over 400 lines for fairly complex code, althrough there is also a nice piece of ASCII art (for info on why this ‘complex’ case is needed see this comment):
// Complex case where we need to emit a new stack frame and copy the arguments.
// Calculate the size of the new stack frame:
//
// +------------+
// SP -> | | srcofs & ShuffleEntry::OFSMASK));
}
else if (pEntry->dstofs & ShuffleEntry::REGMASK)
{
// source must be on the stack
_ASSERTE(!(pEntry->srcofs & ShuffleEntry::REGMASK));
EmitLoadStoreRegImm(eLOAD, IntReg(pEntry->dstofs & ShuffleEntry::OFSMASK), RegSp, pEntry->srcofs * sizeof(void*));
}
else
{
// source must be on the stack
_ASSERTE(!(pEntry->srcofs & ShuffleEntry::REGMASK));
// dest must be on the stack
_ASSERTE(!(pEntry->dstofs & ShuffleEntry::REGMASK));
EmitLoadStoreRegImm(eLOAD, IntReg(9), RegSp, pEntry->srcofs * sizeof(void*));
EmitLoadStoreRegImm(eSTORE, IntReg(9), RegSp, pEntry->dstofs * sizeof(void*));
}
}
// Tailcall to target
// br x16
EmitJumpRegister(IntReg(16));
}
Unboxing
I’ve written about this type of ‘stub’ before in A look at the internals of ‘boxing’ in the CLR, but in summary the unboxing stub needs to handle steps 2) and 3) from the diagram below:
1. MyStruct: [0x05 0x00 0x00 0x00]
| Object Header | MethodTable | MyStruct |
2. MyStruct (Boxed): [0x40 0x5b 0x6f 0x6f 0xfe 0x7 0x0 0x0 0x5 0x0 0x0 0x0]
^
object 'this' pointer |
| Object Header | MethodTable | MyStruct |
3. MyStruct (Boxed): [0x40 0x5b 0x6f 0x6f 0xfe 0x7 0x0 0x0 0x5 0x0 0x0 0x0]
^
adjusted 'this' pointer |
Key to the diagram
- Original
struct
, on the stack
- The
struct
being boxed into an object
that lives on the heap
- Adjustment made to this pointer so
MyStruct::ToString()
will work
These stubs make is possible for ‘value types’ (structs) to override methods from System.Object
, such as ToString()
and GetHashCode()
. The fix-up is needed because structs don’t have an ‘object header’, but when they’re boxed into an Object
they do. So the stub has the job of moving or adjusting the ‘this’ pointer so that the code in the ToString()
method can work the same, regardless of whether it’s operating on a regular ‘struct’ or one that’s been boxed into an ‘object.
The unboxing stubs are created in MethodDesc::DoPrestub(..)
here, which in turn calls into MakeUnboxingStubWorker(..)
here
- when
FEATURE_STUBS_AS_IL
is disabled it then calls EmitUnboxMethodStub(..)
to create the stub, there are per-CPU versions:
- when
FEATURE_STUBS_AS_IL
is enabled is instead calls into CreateUnboxingILStubForSharedGenericValueTypeMethods(..)
here
For more information on some of the internal details of unboxing stubs and how they interact with ‘generic instantiations’ see this informative comment and one in the code for MethodDesc::FindOrCreateAssociatedMethodDesc(..)
here.
Arrays
As discussed at the beginning, the method bodies for arrays is provided by the runtime, that is the array access methods, ‘get’ and ‘set’, that allow var a = myArray[5]
and myArray[7] = 5
to work. Not surprisingly, these are done as stubs to allow them to be as small and efficient as possible.
Here is the flow for wiring up ‘array stubs’. It all starts up in MethodDesc::DoPrestub(..)
here:
- If
FEATURE_ARRAYSTUB_AS_IL
is defined (see ‘Stubs-as-IL’), it happens in GenerateArrayOpStub(ArrayMethodDesc* pMD)
here
- Then
ArrayOpLinker::EmitStub()
here, which is responsible for generating 3 types of stubs { ILSTUB_ARRAYOP_GET, ILSTUB_ARRAYOP_SET, ILSTUB_ARRAYOP_ADDRESS }
.
- Before calling
ILStubCache::CreateAndLinkNewILStubMethodDesc(..)
here
- Finally ending up in
JitILStub(..)
here
- When
FEATURE_ARRAYSTUB_AS_IL
isn’t defined, happens in another version of GenerateArrayOpStub(ArrayMethodDesc* pMD)
lower down
- Then
void GenerateArrayOpScript(..)
here
- Followed by a call to
StubCacheBase::Canonicalize(..)
here, that ends up in ArrayStubCache::CompileStub(..)
here.
- Eventually, we end up in
StubLinkerCPU::EmitArrayOpStub(..)
here, which does the heavy lifting (despite being under ‘\src\vm\i386' seems to support x86 and AMD64?)
I’m not going to include the code for the ‘stub-as-IL’ (ArrayOpLinker::EmitStub()
) or the assembly code (StubLinkerCPU::EmitArrayOpStub(..)
) versions of the array stubs because they’re both 100’s of lines long, dealing with type and bounds checking, computing address, multi-dimensional arrays and mode. But to give an idea of the complexities, take a look at this comment from StubLinkerCPU::EmitArrayOpStub(..)
here:
// Register usage
//
// x86 AMD64
// Inputs:
// managed array THIS_kREG (ecx) THIS_kREG (rcx)
// index 0 edx rdx
// index 1/value r8
// index 2/value r9
// expected element type for LOADADDR eax rax rdx
// Working registers:
// total (accumulates unscaled offset) edi r10
// factor (accumulates the slice factor) esi r11
Finally, these stubs are still being improved, for example see Use unsigned index extension in muldi-dimensional array stubs.
Tail Calls
The .NET runtime provides a nice optimisation when doing ‘tail calls’, that (amoung other things) will prevent StackoverflowExceptions
in recursive scenarios. For more on why these tail call optimisations are useful and how they work, take a look at:
In summary, a tail call optimisation allows the same stack frame to be re-used if in the caller, there is no work done after the function call to the callee (see Tail call JIT conditions (2007) for a more precise definition).
And why is this beneficial? From Tail Call Improvements in .NET Framework 4:
The primary reason for a tail call as an optimization is to improve data locality, memory usage, and cache usage. By doing a tail call the callee will use the same stack space as the caller. This reduces memory pressure. It marginally improves the cache because the same memory is reused for subsequent callers and thus can stay in the cache, rather than evicting some older cache line to make room for a new cache line.
To make this clear, the code below can benefit from the optimisation, because both functions return straight after calling each the other:
public static long Ping(int cnt, long val)
{
if (cnt-- == 0)
return val;
return Pong(cnt, val + cnt);
}
public static long Pong(int cnt, long val)
{
if (cnt-- == 0)
return val;
return Ping(cnt, val + cnt);
}
However, if the code was changed to the version below, the optimisation would no longer work because PingNotOptimised(..)
does some extra work between calling Pong(..)
and when it returns:
public static long PingNotOptimised(int cnt, long val)
{
if (cnt-- == 0)
return val;
var result = Pong(cnt, val + cnt);
result += 1; // prevents the Tail-call optimization
return result;
}
public static long Pong(int cnt, long val)
{
if (cnt-- == 0)
return val;
return PingNotOptimised(cnt, val + cnt);
}
You can see the difference in the code emitted by the JIT compiler for the different scenarios in SharpLab.
But where do the ‘tail call optimisation stubs’ come into play? Helpfully there is a tail call related design doc that explains, from ‘current way of handling tail-calls’:
Fast tail calls
These are tail calls that are handled directly by the jitter and no runtime cooperation is needed. They are limited to cases where:
- Return value and call target arguments are all either primitive types, reference types, or valuetypes with a single primitive type or reference type fields
- The aligned size of call target arguments is less or equal to aligned size of caller arguments
So, the stubs aren’t always needed, sometimes the work can be done by the JIT, if there scenario is simple enough.
However for the more complex cases, a ‘helper’ stub is needed:
Tail calls using a helper
Tail calls in cases where we cannot perform the call in a simple way are implemented using a tail call helper. Here is a rough description of how it works:
- For each tail call target, the jitter asks runtime to generate an assembler argument copying routine. This routine reads vararg list of arguments and places the arguments in their proper slots in the CONTEXT or on the stack. Together with the argument copying routine, the runtime also builds a list of offsets of references and byrefs for return value of reference type or structs returned in a hidden return buffer and for structs passed by ref. The gc layout data block is stored at the end of the argument copying thunk.
- At the time of the tail call, the caller generates a vararg list of all arguments of the tail called function and then calls
JIT_TailCall
runtime function. It passes it the copying routine address, the target address and the vararg list of the arguments.
- The
JIT_TailCall
then performs the following:
…
To see the rest of the steps that JIT_TailCall
takes you can read the design doc or if you’re really keen you can look at the code in /vm/jithelpers.cpp. Also, there’s a useful explanation of what it needs to handle in the JIT code, see here and here.
However, we’re just going to focus on the stubs, refered to as an ‘assembler argument copying routine’. Firstly, we can see that they have their own stub manager, TailCallStubManager
, which is implemented here and allows the stubs to play nicely with the debugger. Also interesting to look at is the TailCallFrame
here that is used to ensure that the ‘stack walker’ can work well with tail calls.
Now, onto the stubs themselves, the ‘copying routines’ are provided by the runtime via a call to CEEInfo::getTailCallCopyArgsThunk(..)
in /vm/jitinterface.cpp. This in turn calls the CPU specific versions of CPUSTUBLINKER::CreateTailCallCopyArgsThunk(..)
:
These routines have the complex and hairy job of dealing with the CPU registers and calling conventions. They achieve this by dynamicially emitting assembly instructions, to create a function that looks like the following pseudo code (X86 version):
// size_t CopyArguments(va_list args, (RCX)
// CONTEXT *pCtx, (RDX)
// DWORD64 *pvStack, (R8)
// size_t cbStack) (R9)
// {
// if (pCtx != NULL) {
// foreach (arg in args) {
// copy into pCtx or pvStack
// }
// }
// return ;
// }
In addition there is one other type of stub that is used. Known as the TailCallHelperStub
, they also come in per-CPU versions:
Going forward, there are several limitations of to this approach of using per-CPU stubs, as the design doc explains:
- It is expensive to port to new platforms
- Parsing the vararg list is not possible to do in a portable way on Unix. Unlike on Windows, the list is not stored a linear sequence of the parameter data bytes in memory. va_list on Unix is an opaque data type, some of the parameters can be in registers and some in the memory.
- Generating the copying asm routine needs to be done for each target architecture / platform differently. And it is also very complex, error prone and impossible to do on platforms where code generation at runtime is not allowed.
- It is slower than it has to be
- The parameters are copied possibly twice - once from the vararg list to the stack and then one more time if there was not enough space in the caller’s stack frame.
RtlRestoreContext
restores all registers from the CONTEXT
structure, not just a subset of them that is really necessary for the functionality, so it results in another unnecessary memory accesses.
- Stack walking over the stack frames of the tail calls requires runtime assistance.
Fortunately, it then goes into great depth discussing how a new approach could be implemented and how it would solve these issues. Even better, work has already started and we can follow along in Implement portable tailcall helpers #26418 (currently sitting at ‘31 of 55’ tasks completed, with over 50 files modified, it’s not a small job!).
Finally, for other PRs related to tail calls, see:
Virtual Stub Dispatch (VSD)
I’ve saved the best for last, ‘Virtual Stub Dispatch’ or VSD is such an in-depth topic, that it an entire BotR page devoted to it!! From the introduction:
Virtual stub dispatching (VSD) is the technique of using stubs for virtual method invocations instead of the traditional virtual method table. In the past, interface dispatch required that interfaces had process-unique identifiers, and that every loaded interface was added to a global interface virtual table map. This requirement meant that all interfaces and all classes that implemented interfaces had to be restored at runtime in NGEN scenarios, causing significant startup working set increases. The motivation for stub dispatching was to eliminate much of the related working set, as well as distribute the remaining work throughout the lifetime of the process.
It then goes on to say:
Although it is possible for VSD to dispatch both virtual instance and interface method calls, it is currently used only for interface dispatch.
So despite having the work ‘virtual’ in the title, it’s not actually used for C# methods with the virtual
modifier on them. However, if you look at the IL for interface methods you can see why they are also known as ‘virtual’.
Virtual Stub Dispatch is so complex, it actually has several different stub types, from /vm/virtualcallstub.h:
enum StubKind {
SK_UNKNOWN,
SK_LOOKUP, // Lookup Stubs are SLOW stubs that simply call into the runtime to do all work.
SK_DISPATCH, // Dispatch Stubs have a fast check for one type otherwise jumps to runtime. Works for monomorphic sites
SK_RESOLVE, // Resolve Stubs do a hash lookup before fallling back to the runtime. Works for polymorphic sites.
SK_VTABLECALL, // Stub that jumps to a target method using vtable-based indirections. Works for non-interface calls.
SK_BREAKPOINT
};
So there are the following types (these are links to the AMD64
versions, x86
versions are in /vm/i386/virtualcallstubcpu.hpp):
- Lookup Stubs:
// Virtual and interface call sites are initially setup to point at LookupStubs. This is because the runtime type of the pointer is not yet known, so the target cannot be resolved.
- Dispatch Stubs:
// Monomorphic and mostly monomorphic call sites eventually point to DispatchStubs. A dispatch stub has an expected type (expectedMT), target address (target) and fail address (failure). If the calling frame does in fact have the type be of the expected type, then control is transfered to the target address, the method implementation. If not, then control is transfered to the fail address, a fail stub (see below) where a polymorphic lookup is done to find the correct address to go to.
- There’s also specific versions, DispatchStubShort and DispatchStubLong, see this comment for why they are both needed.
- Resolve Stubs:
// Polymorphic call sites and monomorphic calls that fail end up in a ResolverStub. There is only one resolver stub built for any given token, even though there may be many call sites that use that token and many distinct types that are used in the calling call frames. A resolver stub actually has two entry points, one for polymorphic call sites and one for dispatch stubs that fail on their expectedMT test. There is a third part of the resolver stub that enters the ee when a decision should be made about changing the callsite.
- V-Table or Virtual Call Stubs
//These are jump stubs that perform a vtable-base virtual call. These stubs assume that an object is placed in the first argument register (this pointer). From there, the stub extracts the MethodTable pointer, followed by the vtable pointer, and finally jumps to the target method at a given slot in the vtable.
The below diagram shows the general control flow between these stubs
(Image from ‘Design of Virtual Stub Dispatch’)
Finally, if you want even more in-depth information see this comment.
However, these stubs come at a cost, which makes virtual method calls more expensive than direct ones. This is why de-virtualization is so important, i.e. the process of the .NET JIT detecting when a virtual call can instead be replaced by a direct one. There has been some work done in .NET Core to improve this, see Simple devirtualization #9230 which covers sealed
classes/methods and when the object type is known exactly. However there is still more to be done, as shown in JIT: devirtualization next steps #9908, where ‘5 of 23’ tasks have been completed.
Other Types of Stubs
This post is already way too long, so I don’t intend to offer any analysis of the following stubs. Instead I’ve just included some links to more information so you can read up on any that interest you!
‘Jump’ stubs
‘Function Pointer’ stubs
- ‘Function Pointer’ Stubs, see /vm/fptrstubs.cpp and /vm/fptrstubs.h
// FuncPtrStubs contains stubs that is used by GetMultiCallableAddrOfCode() if the function has not been jitted. Using a stub decouples ldftn from the prestub, so prestub does not need to be backpatched. This stub is also used in other places which need a function pointer
‘Thread Hijacking’ stubs
From the BotR page on ‘Threading’:
- If fully interruptable, it is safe to perform a GC at any point, since the thread is, by definition, at a safe point. It is reasonable to leave the thread suspended at this point (because it’s safe) but various historical OS bugs prevent this from working, because the CONTEXT retrieved earlier may be corrupt). Instead, the thread’s instruction pointer is overwritten, redirecting it to a stub that will capture a more complete CONTEXT, leave cooperative mode, wait for the GC to complete, reenter cooperative mode, and restore the thread to its previous state.
- If partially-interruptable, the thread is, by definition, not at a safe point. However, the caller will be at a safe point (method transition). Using that knowledge, the CLR “hijacks” the top-most stack frame’s return address (physically overwrite that location on the stack) with a stub similar to the one used for fully-interruptable code. When the method returns, it will no longer return to its actual caller, but rather to the stub (the method may also perform a GC poll, inserted by the JIT, before that point, which will cause it to leave cooperative mode and undo the hijack).
Done with the OnHijackTripThread
method in /vm/amd64/AsmHelpers.asm, which calls into OnHijackWorker(..)
in /vm/threadsuspend.cpp.
‘NGEN Fixup’ stubs
From CLR Inside Out - The Performance Benefits of NGen (2006):
Throughput of NGen-compiled code is lower than that of JIT-compiled code primarily for one reason: cross-assembly references. In JIT-compiled code, cross-assembly references can be implemented as direct calls or jumps since the exact addresses of these references are known at run time. For statically compiled code, however, cross-assembly references need to go through a jump slot that gets populated with the correct address at run time by executing a method pre-stub. The method pre-stub ensures, among other things, that the native images for assemblies referenced by that method are loaded into memory before the method is executed. The pre-stub only needs to be executed the first time the method is called; it is short-circuited out for subsequent calls. However, every time the method is called, cross-assembly references do need to go through a level of indirection. This is principally what accounted for the 5-10 percent drop in throughput for NGen-compiled code when compared to JIT-compiled code.
Also see the ‘NGEN’ section of the ‘jump stub’ design doc.
Stubs in the Mono Runtime
Mono refers to ‘Stubs’ as ‘Trampolines’ and they’re widely used in the source code.
The Mono docs have an excellent page all about ‘Trampolines’, that lists the following types:
- JIT Trampolines
- Virtual Call Trampolines
- Jump Trampolines
- Class Init Trampolines
- Generic Class Init Trampoline
- RGCTX Lazy Fetch Trampolines
- AOT Trampolines
- Delegate Trampolines
- Monitor Enter/Exit Trampolines
Also the docs page on Generic Sharing has some good, in-depth information.
Conclusion
So it turns out that ‘stubs’ are way more prevelant in the .NET Core Runtime that I imagined when I first started on this post. They are an interesting technique and they contain a fair amount of complexity. In addition, I only covered each stub in isolation, in reality many of them have to play nicely together, for instance imagine a delegate
calling a virtual
method that has generic
type parameters and you can see that things start to get complex! (that scenario might contain 3 seperate stubs, although they are also shared where possible). If you were then to add array
methods, P/Invoke
marshalling and un-boxing
to the mix, things get even more hairy and even more complex!
If anyone has read this far and wants a fun challenge, try and figure out what’s the most stubs you can force a single method call to go via! If you do, let me know in the comments or via twitter
Finally, by knowing where and when stubs are involved in our method calls, we can start to understand the overhead of each scenario. For instance, it explains why delegate
method calls are a bit slower than calling a method directly and why ‘de-virtualization’ is so important. Having the JIT be able to perform extra analysis to determine that a virtual call can be converted into a direct one skips an entire level of indirection, for more on this see:
Thu, 26 Sep 2019, 12:00 am
ASCII Art in .NET Code
Who doesn’t like a nice bit of ‘ASCII Art’? I know I certainly do!
To see what Matt’s CLR was all about you can watch the recording of my talk ‘From ‘dotnet run’ to ‘Hello World!’’ (from about ~24:30 in)
So armed with a trusty regex /\*(.*?)\*/|//(.*?)\r?\n|"((\\[^\n]|[^"\n])*)"|@("[^"]*")+
, I set out to find all the interesting ASCII Art used in source code comments in the following .NET related repositories:
- dotnet/CoreCLR - “the runtime for .NET Core. It includes the garbage collector, JIT compiler, primitive data types and low-level classes.”
- Mono - “open source ECMA CLI, C# and .NET implementation.”
- dotnet/CoreFX - “the foundational class libraries for .NET Core. It includes types for collections, file systems, console, JSON, XML, async and many others.”
- dotnet/Roslyn - “provides C# and Visual Basic languages with rich code analysis APIs”
- aspnet/AspNetCore - “a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.”
Note: Yes, I shamelessly ‘borrowed’ this idea from John Regehr, I was motivated to write this because his excellent post ‘Explaining Code using ASCII Art’ didn’t have any .NET related code in it!
If you’ve come across any interesting examples I’ve missed out, please let me know!
Table of Contents
To make the examples easier to browse, I’ve split them up into categories:
Dave Cutler
There’s no art in this one, but it deserves it’s own category as it quotes the amazing Dave Cutler who led the development of Windows NT. Therefore there’s no better person to ask a deep, technical question about how Thread Suspension works on Windows, from coreclr/src/vm/threadsuspend.cpp
// Message from David Cutler
/*
After SuspendThread returns, can the suspended thread continue to execute code in user mode?
[David Cutler] The suspended thread cannot execute any more user code, but it might be currently "running"
on a logical processor whose other logical processor is currently actually executing another thread.
In this case the target thread will not suspend until the hardware switches back to executing instructions
on its logical processor. In this case even the memory barrier would not necessarily work - a better solution
would be to use interlocked operations on the variable itself.
After SuspendThread returns, does the store buffer of the CPU for the suspended thread still need to drain?
Historically, we've assumed that the answer to both questions is No. But on one 4/8 hyper-threaded machine
running Win2K3 SP1 build 1421, we've seen two stress failures where SuspendThread returns while writes seem to still be in flight.
Usually after we suspend a thread, we then call GetThreadContext. This seems to guarantee consistency.
But there are places we would like to avoid GetThreadContext, if it's safe and legal.
[David Cutler] Get context delivers a APC to the target thread and waits on an event that will be set
when the target thread has delivered its context.
Chris.
*/
For more info on Dave Cutler, see this excellent interview ‘Internets of Interest #6: Dave Cutler on Dave Cutler’ or ‘The engineer’s engineer: Computer industry luminaries salute Dave Cutler’s five-decade-long quest for quality’
Syntax Trees
The inner workings of the .NET ‘Just-in-Time’ (JIT) Compiler have always been a bit of a mystery to me. But, having informative comments like this one from coreclr/src/jit/lsra.cpp go some way to showing what it’s doing
// For example, for this tree (numbers are execution order, lower is earlier and higher is later):
//
// +---------+----------+
// | GT_ADD (3) |
// +---------+----------+
// |
// / \
// / \
// / \
// +-------------------+ +----------------------+
// | x (1) | "tree" | y (2) |
// +-------------------+ +----------------------+
//
// generate this tree:
//
// +---------+----------+
// | GT_ADD (4) |
// +---------+----------+
// |
// / \
// / \
// / \
// +-------------------+ +----------------------+
// | GT_RELOAD (3) | | y (2) |
// +-------------------+ +----------------------+
// |
// +-------------------+
// | x (1) | "tree"
// +-------------------+
There’s also a more in-depth example in coreclr/src/jit/morph.cpp
Also from roslyn/src/Compilers/VisualBasic/Portable/Semantics/TypeInference/RequiredConversion.vb
'// These restrictions form a partial order composed of three chains: from less strict to more strict, we have:
'// [reverse chain] [None] < AnyReverse < ReverseReference < Identity
'// [middle chain] None < [Any,AnyReverse] < AnyConversionAndReverse < Identity
'// [forward chain] [None] < Any < ArrayElement < Reference < Identity
'//
'// = KEY:
'// / | \ = Identity
'// / | \ +r Reference
'// -r | +r -r ReverseReference
'// | +-any | +-any AnyConversionAndReverse
'// | /|\ +arr +arr ArrayElement
'// | / | \ | +any Any
'// -any | +any -any AnyReverse
'// \ | / none None
'// \ | /
'// none
'//
Timelines
This example from coreclr/src/vm/comwaithandle.cpp was unique! I didn’t find another example of ASCII Art used to illustrate time-lines, it’s a really novel approach.
// In case the CLR is paused inbetween a wait, this method calculates how much
// the wait has to be adjusted to account for the CLR Freeze. Essentially all
// pause duration has to be considered as "time that never existed".
//
// Two cases exists, consider that 10 sec wait is issued
// Case 1: All pauses happened before the wait completes. Hence just the
// pause time needs to be added back at the end of wait
// 0 3 8 10
// |-----------|###################|------>
// 5-sec pause
// ....................>
// Additional 5 sec wait
// |=========================>
//
// Case 2: Pauses ended after the wait completes.
// 3 second of wait was left as the pause started at 7 so need to add that back
// 0 7 10
// |---------------------------|###########>
// 5-sec pause 12
// ...................>
// Additional 3 sec wait
// |==================>
//
// Both cases can be expressed in the same calculation
// pauseTime: sum of all pauses that were triggered after the timer was started
// expDuration: expected duration of the wait (without any pauses) 10 in the example
// actDuration: time when the wait finished. Since the CLR is frozen during pause it's
// max of timeout or pause-end. In case-1 it's 10, in case-2 it's 12
Logic Tables
A sweet-spot for ASCII Art seems to be tables, there are so many examples. Starting with coreclr/src/vm/methodtablebuilder.cpp (bonus points for combining comments and code together!)
// | Base type
// Subtype | mdPrivateScope mdPrivate mdFamANDAssem mdAssem mdFamily mdFamORAssem mdPublic
// --------------+-------------------------------------------------------------------------------------------------------
/*mdPrivateScope | */ { { e_SM, e_NO, e_NO, e_NO, e_NO, e_NO, e_NO },
/*mdPrivate | */ { e_SM, e_YES, e_NO, e_NO, e_NO, e_NO, e_NO },
/*mdFamANDAssem | */ { e_SM, e_YES, e_SA, e_NO, e_NO, e_NO, e_NO },
/*mdAssem | */ { e_SM, e_YES, e_SA, e_SA, e_NO, e_NO, e_NO },
/*mdFamily | */ { e_SM, e_YES, e_YES, e_NO, e_YES, e_NSA, e_NO },
/*mdFamORAssem | */ { e_SM, e_YES, e_YES, e_SA, e_YES, e_YES, e_NO },
/*mdPublic | */ { e_SM, e_YES, e_YES, e_YES, e_YES, e_YES, e_YES } };
Also coreclr/src/jit/importer.cpp which shows how the JIT deals with boxing/un-boxing
/*
----------------------------------------------------------------------
| \ helper | | |
| \ | | |
| \ | CORINFO_HELP_UNBOX | CORINFO_HELP_UNBOX_NULLABLE |
| \ | (which returns a BYREF) | (which returns a STRUCT) |
| opcode \ | | |
|---------------------------------------------------------------------
| UNBOX | push the BYREF | spill the STRUCT to a local, |
| | | push the BYREF to this local |
|---------------------------------------------------------------------
| UNBOX_ANY | push a GT_OBJ of | push the STRUCT |
| | the BYREF | For Linux when the |
| | | struct is returned in two |
| | | registers create a temp |
| | | which address is passed to |
| | | the unbox_nullable helper. |
|---------------------------------------------------------------------
*/
Finally, there’s some other nice examples showing the rules for operator overloading in the C# (Roslyn) Compiler and which .NET data-types can be converted via the System.ToXXX()
functions.
Class Hierarchies
Of course, most IDE’s come with tools that will generate class-hierarchies for you, but it’s much nicer to see them in ASCII, from coreclr/src/vm/object.h
* COM+ Internal Object Model
*
*
* Object - This is the common base part to all COM+ objects
* | it contains the MethodTable pointer and the
* | sync block index, which is at a negative offset
* |
* +-- code:StringObject - String objects are specialized objects for string
* | storage/retrieval for higher performance
* |
* +-- BaseObjectWithCachedData - Object Plus one object field for caching.
* | |
* | +- ReflectClassBaseObject - The base object for the RuntimeType class
* | +- ReflectMethodObject - The base object for the RuntimeMethodInfo class
* | +- ReflectFieldObject - The base object for the RtFieldInfo class
* |
* +-- code:ArrayBase - Base portion of all arrays
* | |
* | +- I1Array - Base type arrays
* | | I2Array
* | | ...
* | |
* | +- PtrArray - Array of OBJECTREFs, different than base arrays because of pObjectClass
* |
* +-- code:AssemblyBaseObject - The base object for the class Assembly
There’s also an even larger one that I stumbled across when writing “Stack Walking” in the .NET Runtime.
Component Diagrams
When you have several different components in a code-base it’s always nice to see how they fit together. From coreclr/src/vm/codeman.h we can see how the top-level parts of the .NET JIT work together
ExecutionManager
|
+-----------+---------------+---------------+-----------+--- ...
| | | |
CodeType | CodeType |
| | | |
v v v v
+---------------+ +--------+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of previous chunk (if P = 1) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |P|
| Size of this chunk 1| +-+
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+- -+
| |
+- -+
| :
+- size - sizeof(size_t) available payload bytes -+
: |
chunk-> +- -+
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|
| Size of next chunk (may or may not be in use) | +-+
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
And if it's free, it looks like this:
chunk-> +- -+
| User payload (must be in use, or we would have merged!) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |P|
| Size of this chunk 0| +-+
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Prev pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| :
+- size - sizeof(struct chunk) unused bytes -+
: |
chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of this chunk |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0|
| Size of next chunk (must be in use, or we would have merged)| +-+
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| :
+- User payload -+
: |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|
+-+
Also, from corefx/src/Common/src/CoreLib/System/MemoryExtensions.cs we can see how overlapping memory regions are detected:
// Visually, the two sequences are located somewhere in the 32-bit
// address space as follows:
//
// [----------------------------------------------) normal address space
// 0 2³²
// [------------------) first sequence
// xRef xRef + xLength
// [--------------------------) . second sequence
// yRef . yRef + yLength
// : . . .
// : . . .
// . . .
// . . .
// . . .
// [----------------------------------------------) relative address space
// 0 . . 2³²
// [------------------) : first sequence
// x1 . x2 :
// -------------) [------------- second sequence
// y2 y1
State Machines
This comment from mono/benchmark/zipmark.cs gives a great over-view of the implementation of RFC 1951 - DEFLATE Compressed Data Format Specification
/*
* The Deflater can do the following state transitions:
*
* (1) -> INIT_STATE ----> INIT_FINISHING_STATE ---.
* / | (2) (5) |
* / v (5) |
* (3)| SETDICT_STATE ---> SETDICT_FINISHING_STATE |(3)
* \ | (3) | ,-------'
* | | | (3) /
* v v (5) v v
* (1) -> BUSY_STATE ----> FINISHING_STATE
* | (6)
* v
* FINISHED_STATE
* \_____________________________________/
* | (7)
* v
* CLOSED_STATE
*
* (1) If we should produce a header we start in INIT_STATE, otherwise
* we start in BUSY_STATE.
* (2) A dictionary may be set only when we are in INIT_STATE, then
* we change the state as indicated.
* (3) Whether a dictionary is set or not, on the first call of deflate
* we change to BUSY_STATE.
* (4) -- intentionally left blank -- :)
* (5) FINISHING_STATE is entered, when flush() is called to indicate that
* there is no more INPUT. There are also states indicating, that
* the header wasn't written yet.
* (6) FINISHED_STATE is entered, when everything has been flushed to the
* internal pending output buffer.
* (7) At any time (7)
*
*/
This might be pushing the definition of ‘state machine’ a bit far, but I wanted to include it because it shows just how complex ‘exception handling’ can be, from coreclr/src/jit/jiteh.cpp
// fgNormalizeEH: Enforce the following invariants:
//
// 1. No block is both the first block of a handler and the first block of a try. In IL (and on entry
// to this function), this can happen if the "try" is more nested than the handler.
//
// For example, consider:
//
// try1 ----------------- BB01
// | BB02
// |--------------------- BB03
// handler1
// |----- try2 ---------- BB04
// | | BB05
// | handler2 ------ BB06
// | | BB07
// | --------------- BB08
// |--------------------- BB09
//
// Thus, the start of handler1 and the start of try2 are the same block. We will transform this to:
//
// try1 ----------------- BB01
// | BB02
// |--------------------- BB03
// handler1 ------------- BB10 // empty block
// | try2 ---------- BB04
// | | BB05
// | handler2 ------ BB06
// | | BB07
// | --------------- BB08
// |--------------------- BB09
//
RFC’s and Specs
Next up, how the Kestrel web-server handles RFC 7540 - Hypertext Transfer Protocol Version 2 (HTTP/2).
Firstly, from aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2Frame.cs
/* https://tools.ietf.org/html/rfc7540#section-4.1
+-----------------------------------------------+
| Length (24) |
+---------------+---------------+---------------+
| Type (8) | Flags (8) |
+-+-------------+---------------+-------------------------------+
|R| Stream Identifier (31) |
+=+=============================================================+
| Frame Payload (0...) ...
+---------------------------------------------------------------+
*/
and then in aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2Frame.Headers.cs
/* https://tools.ietf.org/html/rfc7540#section-6.2
+---------------+
|Pad Length? (8)|
+-+-------------+-----------------------------------------------+
|E| Stream Dependency? (31) |
+-+-------------+-----------------------------------------------+
| Weight? (8) |
+-+-------------+-----------------------------------------------+
| Header Block Fragment (*) ...
+---------------------------------------------------------------+
| Padding (*) ...
+---------------------------------------------------------------+
*/
There are other notable examples in aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2FrameReader.cs and aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2FrameWriter.cs.
Also RFC 3986 - Uniform Resource Identifier (URI) is discussed in corefx/src/Common/src/System/Net/IPv4AddressHelper.Common.cs
Finally, RFC 7541 - HPACK: Header Compression for HTTP/2, is covered in aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/HPack/HPackDecoder.cs
// http://httpwg.org/specs/rfc7541.html#rfc.section.6.1
// 0 1 2 3 4 5 6 7
// +---+---+---+---+---+---+---+---+
// | 1 | Index (7+) |
// +---+---------------------------+
private const byte IndexedHeaderFieldMask = 0x80;
private const byte IndexedHeaderFieldRepresentation = 0x80;
// http://httpwg.org/specs/rfc7541.html#rfc.section.6.2.1
// 0 1 2 3 4 5 6 7
// +---+---+---+---+---+---+---+---+
// | 0 | 1 | Index (6+) |
// +---+---+-----------------------+
private const byte LiteralHeaderFieldWithIncrementalIndexingMask = 0xc0;
private const byte LiteralHeaderFieldWithIncrementalIndexingRepresentation = 0x40;
// http://httpwg.org/specs/rfc7541.html#rfc.section.6.2.2
// 0 1 2 3 4 5 6 7
// +---+---+---+---+---+---+---+---+
// | 0 | 0 | 0 | 0 | Index (4+) |
// +---+---+-----------------------+
private const byte LiteralHeaderFieldWithoutIndexingMask = 0xf0;
private const byte LiteralHeaderFieldWithoutIndexingRepresentation = 0x00;
// http://httpwg.org/specs/rfc7541.html#rfc.section.6.2.3
// 0 1 2 3 4 5 6 7
// +---+---+---+---+---+---+---+---+
// | 0 | 0 | 0 | 1 | Index (4+) |
// +---+---+-----------------------+
private const byte LiteralHeaderFieldNeverIndexedMask = 0xf0;
private const byte LiteralHeaderFieldNeverIndexedRepresentation = 0x10;
// http://httpwg.org/specs/rfc7541.html#rfc.section.6.3
// 0 1 2 3 4 5 6 7
// +---+---+---+---+---+---+---+---+
// | 0 | 0 | 1 | Max size (5+) |
// +---+---------------------------+
private const byte DynamicTableSizeUpdateMask = 0xe0;
private const byte DynamicTableSizeUpdateRepresentation = 0x20;
// http://httpwg.org/specs/rfc7541.html#rfc.section.5.2
// 0 1 2 3 4 5 6 7
// +---+---+---+---+---+---+---+---+
// | H | String Length (7+) |
// +---+---------------------------+
private const byte HuffmanMask = 0x80;
Dates & Times
It is pretty widely accepted that dates and times are hard and that’s reflected in the amount of comments explaining different scenarios. For example from corefx/src/Common/src/CoreLib/System/TimeZoneInfo.cs
// startTime and endTime represent the period from either the start of DST to the end and
// ***does not include*** the potentially overlapped times
//
// -=-=-=-=-=- Pacific Standard Time -=-=-=-=-=-=-
// April 2, 2006 October 29, 2006
// 2AM 3AM 1AM 2AM
// | +1 hr | | -1 hr |
// | | | |
// [========== DST ========>)
//
// -=-=-=-=-=- Some Weird Time Zone -=-=-=-=-=-=-
// April 2, 2006 October 29, 2006
// 1AM 2AM 2AM 3AM
// | -1 hr | | +1 hr |
// | | | |
// [======== DST ========>)
//
Also, from corefx/src/Common/src/CoreLib/System/TimeZoneInfo.Unix.cs we see some details on how ‘leap-years’ are handled:
// should be n Julian day format which we don't support.
//
// This specifies the Julian day, with n between 0 and 365. February 29 is counted in leap years.
//
// n would be a relative number from the begining of the year. which should handle if the
// the year is a leap year or not.
//
// In leap year, n would be counted as:
//
// 0 30 31 59 60 90 335 365
// |-------Jan--------|-------Feb--------|-------Mar--------|....|-------Dec--------|
//
// while in non leap year we'll have
//
// 0 30 31 58 59 89 334 364
// |-------Jan--------|-------Feb--------|-------Mar--------|....|-------Dec--------|
//
//
// For example if n is specified as 60, this means in leap year the rule will start at Mar 1,
// while in non leap year the rule will start at Mar 2.
//
// If we need to support n format, we'll have to have a floating adjustment rule support this case.
Finally, this comment from corefx/src/System.Runtime/tests/System/TimeZoneInfoTests.cs discusses invalid and ambiguous times that are covered in tests:
// March 26, 2006 October 29, 2006
// 2AM 3AM 2AM 3AM
// | +1 hr | | -1 hr |
// | | | |
// *========== DST ========>*
//
// * 00:59:59 Sunday March 26, 2006 in Universal converts to
// 01:59:59 Sunday March 26, 2006 in Europe/Amsterdam (NO DST)
//
// * 01:00:00 Sunday March 26, 2006 in Universal converts to
// 03:00:00 Sunday March 26, 2006 in Europe/Amsterdam (DST)
//
Stack Layouts
To finish off, I wanted to look at ‘stack layouts’ because they seem to be a favourite of the .NET/Mono Runtime Engineers, there’s sooo many examples!
First-up, x68
from coreclr/src/jit/lclvars.cpp (you can also see the x64, ARM and ARM64 versions).
* The frame is laid out as follows for x86:
*
* ESP frames
*
* | |
* |-----------------------|
* | incoming |
* | arguments |
* |-----------------------| stack_usage-4
* spilled regs
* ------------------- sp +
* MonoLMF structure optional
* ------------------- sp + cfg->arch.lmf_offset
* saved registers s0-s8
* ------------------- sp + cfg->arch.iregs_offset
* locals
* ------------------- sp + cfg->param_area
* param area outgoing
* ------------------- sp + MIPS_STACK_PARAM_OFFSET
* a0-a3 outgoing
* ------------------- sp
* red zone
*/
Finally, there’s another example covering [DLLImport]
callbacks and one more involving funclet frames in ARM64, I told you there were lots!!
The Rest
If you aren’t sick of ‘ASCII Art’ by now, here’s a few more examples for you to look at!!
- CoreCLR
- Roslyn
- CoreFX
- AspNetCore
- Mono
Thu, 25 Apr 2019, 12:00 am
Is C# a low-level language?
I’m a massive fan of everything Fabien Sanglard does, I love his blog and I’ve read both his books cover-to-cover (for more info on his books, check out the recent Hansleminutes podcast).
Recently he wrote an excellent post where he deciphered a postcard sized raytracer, un-packing the obfuscated code and providing a fantastic explanation of the maths involved. I really recommend you take the time to read it!
But it got me thinking, would it be possible to port that C++ code to C#?
Partly because in my day job I’ve been having to write a fair amount of C++ recently and I’ve realised I’m a bit rusty, so I thought this might help!
But more significantly, I wanted to get a better insight into the question is C# a low-level language?
A slightly different, but related question is how suitable is C# for ‘systems programming’? For more on that I really recommend Joe Duffy’s excellent post from 2013.
Line-by-line port
I started by simply porting the un-obfuscated C++ code line-by-line to C#. Turns out that this was pretty straight forward, I guess the story about C# being C++++ is true after all!!
Let’s look at an example, the main data structure in the code is a ‘vector’, here’s the code side-by-side, C++ on the left and C# on the right:
So there’s a few syntax differences, but because .NET lets you define your own ‘Value Types’ I was able to get the same functionality. This is significant because treating the ‘vector’ as a struct
means we can get better ‘data locality’ and the .NET Garbage Collector (GC) doesn’t need to be involved as the data will go onto the stack (probably, yes I know it’s an implementation detail).
For more info on structs
or ‘value types’ in .NET see:
In particular that last post form Eric Lippert contains this helpful quote that makes it clear what ‘value types’ really are:
Surely the most relevant fact about value types is not the implementation detail of how they are allocated, but rather the by-design semantic meaning of “value type”, namely that they are always copied “by value”. If the relevant thing was their allocation details then we’d have called them “heap types” and “stack types”. But that’s not relevant most of the time. Most of the time the relevant thing is their copying and identity semantics.
Now lets look at how some other methods look side-by-side (again C++ on the left, C# on the right), first up RayTracing(..)
:
Next QueryDatabase(..)
:
(see Fabien’s post for an explanation of what these 2 functions are doing)
But the point is that again, C# lets us very easily write C++ code! In this case what helps us out the most is the ref
keyword which lets us pass a value by reference. We’ve been able to use ref
in method calls for quite a while, but recently there’s been a effort to allow ref
in more places:
Now sometimes using ref
can provide a performance boost because it means that the struct
doesn’t need to be copied, see the benchmarks in Adam Sitniks post and Performance traps of ref locals and ref returns in C# for more information.
However what’s most important for this scenario is that it allows us to have the same behaviour in our C# port as the original C++ code. Although I want to point out that ‘Managed References’ as they’re known aren’t exactly the same as ‘pointers’, most notably you can’t do arithmetic on them, for more on this see:
So, it’s all well and good being able to port the code, but ultimately the performance also matters. Especially in something like a ‘ray tracer’ that can take minutes to run! The C++ code contains a variable called sampleCount
that controls the final quality of the image, with sampleCount = 2
it looks like this:
Which clearly isn’t that realistic!
However once you get to sampleCount = 2048
things look a lot better:
But, running with sampleCount = 2048
means the rendering takes a long time, so all the following results were run with it set to 2
, which means the test runs completed in ~1 minute. Changing sampleCount
only affects the number of iterations of the outermost loop of the code, see this gist for an explanation.
Results after a ‘naive’ line-by-line port
To be able to give a meaningful side-by-side comparison of the C++ and C# versions I used the time-windows tool that’s a port of the Unix time
command. My initial results looked this this:
C++ (VS 2017)
.NET Framework (4.7.2)
.NET Core (2.2)
Elapsed time (secs)
47.40
80.14
78.02
Kernel time
0.14 (0.3%)
0.72 (0.9%)
0.63 (0.8%)
User time
43.86 (92.5%)
73.06 (91.2%)
70.66 (90.6%)
page fault #
1,143
4,818
5,945
Working set (KB)
4,232
13,624
17,052
Paged pool (KB)
95
172
154
Non-paged pool
7
14
16
Page file size (KB)
1,460
10,936
11,024
So initially we see that the C# code is quite a bit slower than the C++ version, but it does get better (see below).
However lets first look at what the .NET JIT is doing for us even with this ‘naive’ line-by-line port. Firstly, it’s doing a nice job of in-lining the smaller ‘helper methods’, we can see this by looking at the output of the brilliant Inlining Analyzer tool (green overlay = inlined):
However, it doesn’t inline all methods, for example QueryDatabase(..)
is skipped because of it’s complexity:
Another feature that the .NET Just-In-Time (JIT) compiler provides is converting specific methods calls into corresponding CPU instructions. We can see this in action with the sqrt
wrapper function, here’s the original C# code (note the call to Math.Sqrt
):
// intnv square root
public static Vec operator !(Vec q) {
return q * (1.0f / (float)Math.Sqrt(q % q));
}
And here’s the assembly code that the .NET JIT generates, there’s no call to Math.Sqrt
and it makes use of the vsqrtsd
CPU instruction:
; Assembly listing for method Program:sqrtf(float):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) float -> mm0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M8216_IG01:
vzeroupper
G_M8216_IG02:
vcvtss2sd xmm0, xmm0
vsqrtsd xmm0, xmm0
vcvtsd2ss xmm0, xmm0
G_M8216_IG03:
ret
; Total bytes of code 16, prolog size 3 for method Program:sqrtf(float):float
; ============================================================
(to get this output you need to following these instructions, use the ‘Disasmo’ VS2019 Add-in or take a look at SharpLab.io)
These replacements are also known as ‘intrinsics’ and we can see the JIT generating them in the code below. This snippet just shows the mapping for AMD64
, the JIT also targets X86
, ARM
and ARM64
, the full method is here
bool Compiler::IsTargetIntrinsic(CorInfoIntrinsics intrinsicId)
{
#if defined(_TARGET_AMD64_) || (defined(_TARGET_X86_) && !defined(LEGACY_BACKEND))
switch (intrinsicId)
{
// AMD64/x86 has SSE2 instructions to directly compute sqrt/abs and SSE4.1
// instructions to directly compute round/ceiling/floor.
//
// TODO: Because the x86 backend only targets SSE for floating-point code,
// it does not treat Sine, Cosine, or Round as intrinsics (JIT32
// implemented those intrinsics as x87 instructions). If this poses
// a CQ problem, it may be necessary to change the implementation of
// the helper calls to decrease call overhead or switch back to the
// x87 instructions. This is tracked by #7097.
case CORINFO_INTRINSIC_Sqrt:
case CORINFO_INTRINSIC_Abs:
return true;
case CORINFO_INTRINSIC_Round:
case CORINFO_INTRINSIC_Ceiling:
case CORINFO_INTRINSIC_Floor:
return compSupports(InstructionSet_SSE41);
default:
return false;
}
...
}
As you can see, some methods are implemented like this, e.g. Sqrt
and Abs
, but for others the CLR instead uses the C++ runtime functions for instance powf
.
This entire process is explained very nicely in How is Math.Pow() implemented in .NET Framework?, but we can also see it in action in the CoreCLR source:
However, I wanted to see if my ‘naive’ line-by-line port could be improved, after some profiling I made two main changes:
- Remove in-line array initialisation
- Switch from
Math.XXX(..)
functions to the MathF.XXX()
counterparts.
These changes are explained in more depth below
Remove in-line array initialisation
For more information about why this is necessary see this excellent Stack Overflow answer from Andrey Akinshin complete with benchmarks and assembly code! It comes to the following conclusion:
Conclusion
- Does .NET caches hardcoded local arrays? Kind of: the Roslyn compiler put it in the metadata.
- Do we have any overhead in this case? Unfortunately, yes: JIT will copy the array content from the metadata for each invocation; it will work longer than the case with a static array. Runtime also allocates objects and produce memory traffic.
- Should we care about it? It depends. If it’s a hot method and you want to achieve a good level of performance, you should use a static array. If it’s a cold method which doesn’t affect the application performance, you probably should write “good” source code and put the array in the method scope.
You can see the change I made in this diff.
Using MathF functions instead of Math
Secondly and most significantly I got a big perf improvement by making the following changes:
#if NETSTANDARD2_1 || NETCOREAPP2_0 || NETCOREAPP2_1 || NETCOREAPP2_2 || NETCOREAPP3_0
// intnv square root
public static Vec operator !(Vec q) {
return q * (1.0f / MathF.Sqrt(q % q));
}
#else
public static Vec operator !(Vec q) {
return q * (1.0f / (float)Math.Sqrt(q % q));
}
#endif
As of ‘.NET Standard 2.1’ there are now specific float
implementations of the common maths functions, located in the System.MathF class. For more information on this API and it’s implementation see:
After these changes, the C# code is ~10% slower than the C++ version:
C++ (VS C++ 2017)
.NET Framework (4.7.2)
.NET Core (2.2) TC OFF
.NET Core (2.2) TC ON
Elapsed time (secs)
41.38
58.89
46.04
44.33
Kernel time
0.05 (0.1%)
0.06 (0.1%)
0.14 (0.3%)
0.13 (0.3%)
User time
41.19 (99.5%)
58.34 (99.1%)
44.72 (97.1%)
44.03 (99.3%)
page fault #
1,119
4,749
5,776
5,661
Working set (KB)
4,136
13,440
16,788
16,652
Paged pool (KB)
89
172
150
150
Non-paged pool
7
13
16
16
Page file size (KB)
1,428
10,904
10,960
11,044
TC = Tiered Compilation (I believe that it’ll be on by default in .NET Core 3.0)
For completeness, here’s the results across several runs:
Run
C++ (VS C++ 2017)
.NET Framework (4.7.2)
.NET Core (2.2) TC OFF
.NET Core (2.2) TC ON
TestRun-01
41.38
58.89
46.04
44.33
TestRun-02
41.19
57.65
46.23
45.96
TestRun-03
42.17
62.64
46.22
48.73
Note: the difference between .NET Core and .NET Framework is due to the lack of the MathF
API in .NET Framework v4.7.2, for more info see Support .Net Framework (4.8?) for netstandard 2.1.
However I’m sure that others can do better!
If you’re interested in trying to close the gap the C# code is available. For comparison, you can see the assembly produced by the C++ compiler courtesy of the brilliant Compiler Explorer.
Finally, if it helps, here’s the output from the Visual Studio Profiler showing the ‘hot path’ (after the perf improvement described above):
Is C# a low-level language?
Or more specifically:
What language features of C#/F#/VB.NET or BCL/Runtime functionality enable ‘low-level’* programming?
* yes, I know ‘low-level’ is a subjective term 😊
Note: Any C# developer is going to have a different idea of what ‘low-level’ means, these features would be taken for granted by C++ or Rust programmers.
Here’s the list that I came up with:
- ref returns and ref locals
- “tl;dr Pass and return by reference to avoid large struct copying. It’s type and memory safe. It can be even faster than
unsafe!
”
- Unsafe code in .NET
- “The core C# language, as defined in the preceding chapters, differs notably from C and C++ in its omission of pointers as a data type. Instead, C# provides references and the ability to create objects that are managed by a garbage collector. This design, coupled with other features, makes C# a much safer language than C or C++.”
- Managed pointers in .NET
- “There is, however, another pointer type in CLR – a managed pointer. It could be defined as a more general type of reference, which may point to other locations than just the beginning of an object.”
- C# 7 Series, Part 10: Span and universal memory management
- “
System.Span
is a stack-only type (ref struct
) that wraps all memory access patterns, it is the type for universal contiguous memory access. You can think the implementation of the Span contains a dummy reference and a length, accepting all 3 memory access types."
- Interoperability (C# Programming Guide)
- “The .NET Framework enables interoperability with unmanaged code through platform invoke services, the
System.Runtime.InteropServices
namespace, C++ interoperability, and COM interoperability (COM interop).”
However, I know my limitations and so I asked on twitter and got a lot more replies to add to the list:
- Ben Adams “Platform intrinsics (CPU instruction access)”
- Marc Gravell “SIMD via Vector (which mixes well with Span) is *fairly* low; .NET Core should (soon?) offer direct CPU intrinsics for more explicit usage targeting particular CPU ops"
- Marc Gravell “powerful JIT: things like range elision on arrays/spans, and the JIT using per-struct-T rules to remove huge chunks of code that it knows can’t be reached for that T, or on your particular CPU (BitConverter.IsLittleEndian, Vector.IsHardwareAccelerated, etc)”
- Kevin Jones “I would give a special shout-out to the
MemoryMarshal
and Unsafe
classes, and probably a few other things in the System.Runtime.CompilerServices
namespace.”
- Theodoros Chatzigiannakis “You could also include
__makeref
and the rest.”
- damageboy “Being able to dynamically generate code that fits the expected input exactly, given that the latter will only be known at runtime, and might change periodically?”
- Robert Haken “dynamic IL emission”
- Victor Baybekov “Stackalloc was not mentioned. Also ability to write raw IL (not dynamic, so save on a delegate call), e.g. to use cached
ldftn
and call them via calli
. VS2017 has a proj template that makes this trivial via extern methods + MethodImplOptions.ForwardRef + ilasm.exe rewrite.”
- Victor Baybekov “Also MethodImplOptions.AggressiveInlining “does enable ‘low-level’ programming” in a sense that it allows to write high-level code with many small methods and still control JIT behavior to get optimized result. Otherwise uncomposable 100s LOCs methods with copy-paste…”
- Ben Adams “Using the same calling conventions (ABI) as the underlying platform and p/invokes for interop might be more of a thing though?”
- Victor Baybekov “Also since you mentioned #fsharp - it does have
inline
keyword that does the job at IL level before JIT, so it was deemed important at the language level. C# lacks this (so far) for lambdas which are always virtual calls and workarounds are often weird (constrained generics).”
- Alexandre Mutel “new SIMD intrinsics, Unsafe Utility class/IL post processing (e.g custom, Fody…etc.). For C#8.0, upcoming function pointers…”
- Alexandre Mutel “related to IL, F# has support for direct IL within the language for example”
- OmariO “BinaryPrimitives. Low-level but safe.” (https://docs.microsoft.com/en-us/dotnet/api/system.buffers.binary.binaryprimitives?view=netcore-3.0)
- Kouji (Kozy) Matsui “How about native inline assembler? It’s difficult for how relation both toolchains and runtime, but can replace current P/Invoke solution and do inlining if we have it.”
- Frank A. Krueger “Ldobj, stobj, initobj, initblk, cpyblk.”
- Konrad Kokosa “Maybe Thread Local Storage? Fixed Size Buffers? unmanaged constraint and blittable types should be probably mentioned:)”
- Sebastiano Mandalà “Just my two cents as everything has been said: what about something as simple as struct layout and how padding and memory alignment and order of the fields may affect the cache line performance? It’s something I have to investigate myself too”
- Nino Floris “Constants embedding via readonlyspan, stackalloc, finalizers, WeakReference, open delegates, MethodImplOptions, MemoryBarriers, TypedReference, varargs, SIMD, Unsafe.AsRef can coerce struct types if layout matches exactly (used for a.o. TaskAwaiter and its version)"
So in summary, I would say that C# certainly lets you write code that looks a lot like C++ and in conjunction with the Runtime and Base-Class Libraries it gives you a lot of low-level functionality
Discuss this post on Hacker News, /r/programming, /r/dotnet or /r/csharp
Further Reading
The Unity ‘Burst’ Compiler:
Fri, 1 Mar 2019, 12:00 am
"Stack Walking" in the .NET Runtime
What is ‘stack walking’, well as always the ‘Book of the Runtime’ (BotR) helps us, from the relevant page:
The CLR makes heavy use of a technique known as stack walking (or stack crawling). This involves iterating the sequence of call frames for a particular thread, from the most recent (the thread’s current function) back down to the base of the stack.
The runtime uses stack walks for a number of purposes:
- The runtime walks the stacks of all threads during garbage collection, looking for managed roots (local variables holding object references in the frames of managed methods that need to be reported to the GC to keep the objects alive and possibly track their movement if the GC decides to compact the heap).
- On some platforms the stack walker is used during the processing of exceptions (looking for handlers in the first pass and unwinding the stack in the second).
- The debugger uses the functionality when generating managed stack traces.
- Various miscellaneous methods, usually those close to some public managed API, perform a stack walk to pick up information about their caller (such as the method, class or assembly of that caller).
The rest of this post will explore what ‘Stack Walking’ is, how it works and why so many parts of the runtime need to be involved.
Table of Contents
Where does the CLR use ‘Stack Walking’?
Before we dig into the ‘internals’, let’s take a look at where the runtime utilises ‘stack walking’, below is the full list (as of .NET Core CLR ‘Release 2.2’). All these examples end up calling into the Thread::StackWalkFrames(..)
method here and provide a callback
that is triggered whenever the API encounters a new section of the stack (see How to use it below for more info).
Common Scenarios
- Garbage Collection (GC)
- Exception Handling (unwinding)
- Exception Handling (resumption):
ExceptionTracker::FindNonvolatileRegisterPointers(..)
here -> callback
ExceptionTracker::RareFindParentStackFrame(..)
here -> callback
- Threads:
- Thread Suspension:
Debugging/Diagnostics
- Debugger
- Managed APIs (e.g
System.Diagnostics.StackTrace
)
- Managed code calls via an
InternalCall
(C#) here into DebugStackTrace::GetStackFramesInternal(..)
(C++) here
- Before ending up in
DebugStackTrace::GetStackFramesHelper(..)
here -> callback
- DAC (via by SOS) - Scan for GC ‘Roots’
- Profiling API
ProfToEEInterfaceImpl::ProfilerStackWalkFramesWrapper(..)
here -> callback
- Event Pipe (Diagnostics)
- CLR prints a Stack Trace (to the console/log, DEBUG builds only)
Obscure Scenarios
- Reflection
- Application (App) Domains (See ‘Stack Crawl Marks’ below)
SystemDomain::GetCallersMethod(..)
here (also GetCallersType(..)
and GetCallersModule(..)
) (callback)
SystemDomain::GetCallersModule(..)
here (callback)
- ‘Code Pitching’
- Extensible Class Factory (
System.Runtime.InteropServices.ExtensibleClassFactory
)
- Stack Sampler (unused?)
Stack Crawl Marks
One of the above scenarios deserves a closer look, but firstly why are ‘stack crawl marks’ used, from coreclr/issues/#21629 (comment):
Unfortunately, there is a ton of legacy APIs that were added during netstandard2.0 push whose behavior depend on the caller. The caller is basically passed in as an implicit argument to the API. Most of these StackCrawlMarks are there to support these APIs…
So we can see that multiple functions within the CLR itself need to have knowledge of their caller. To understand this some more, let’s look an example, the GetType(string typeName)
method. Here’s the flow from the externally-visible method all the way down to where the work is done, note how a StackCrawlMark
instance is passed through:
Type::GetType(string typeName)
implementation (Creates StackCrawlMark.LookForMyCaller
)
RuntimeType::GetType(.., ref StackCrawlMark stackMark)
implementation
RuntimeType::GetTypeByName(.., ref StackCrawlMark stackMark, ..)
implementation
extern void GetTypeByName(.., ref StackCrawlMark stackMark, ..)
definition (call into native code, i.e. [DllImport(JitHelpers.QCall, ..)]
)
RuntimeTypeHandle::GetTypeByName(.., QCall::StackCrawlMarkHandle pStackMark, ..)
implementation
TypeHandle TypeName::GetTypeManaged(.., StackCrawlMark* pStackMark, ..)
implementation
TypeHandle TypeName::GetTypeWorker(.. , StackCrawlMark* pStackMark, ..)
implementation
SystemDomain::GetCallersAssembly(StackCrawlMark *stackMark,..)
implementation
SystemDomain::GetCallersModule(StackCrawlMark* stackMark, ..)
implementation
SystemDomain::CallersMethodCallbackWithStackMark(..)
callback implementation
In addition the JIT (via the VM) has to ensure that all relevant methods are available in the call-stack, i.e. they can’t be removed:
However, the StackCrawlMark
feature is currently being cleaned up, so it may look different in the future:
Exception Handling
The place that most .NET Developers will run into ‘stack traces’ is when dealing with exceptions. I originally intended to also describe ‘exception handling’ here, but then I opened up /src/vm/exceptionhandling.cpp and saw that it contained over 7,000 lines of code!! So I decided that it can wait for a future post 😁.
However, if you want to learn more about the ‘internals’ I really recommend Chris Brumme’s post The Exception Model (2003) which is the definitive guide on the topic (also see his Channel9 Videos) and as always, the ‘BotR’ chapter ‘What Every (Runtime) Dev needs to Know About Exceptions in the Runtime’ is well worth a read.
Also, I recommend talking a look at the slides from the ‘Internals of Exceptions’ talk’ and the related post .NET Inside Out Part 2 — Handling and rethrowing exceptions in C# both by Adam Furmanek.
The ‘Stack Walking’ API
Now that we’ve seen where it’s used, let’s look at the ‘stack walking’ API itself. Firstly, how is it used?
How to use it
It’s worth pointing out that the only way you can access it from C#/F#/VB.NET code is via the StackTrace
class, only the runtime itself can call into Thread::StackWalkFrames(..)
directly. The simplest usage in the runtime is EventPipe::WalkManagedStackForThread(..)
(see here), which is shown below. As you can see it’s as simple as specifying the relevant flags, in this case ALLOW_ASYNC_STACK_WALK | FUNCTIONSONLY | HANDLESKIPPEDFRAMES | ALLOW_INVALID_OBJECTS
and then providing the callback, which in the EventPipe class is the StackWalkCallback
method (here)
bool EventPipe::WalkManagedStackForThread(Thread *pThread, StackContents &stackContents)
{
CONTRACTL
{
NOTHROW;
GC_NOTRIGGER;
MODE_ANY;
PRECONDITION(pThread != NULL);
}
CONTRACTL_END;
// Calling into StackWalkFrames in preemptive mode violates the host contract,
// but this contract is not used on CoreCLR.
CONTRACT_VIOLATION( HostViolation );
stackContents.Reset();
StackWalkAction swaRet = pThread->StackWalkFrames(
(PSTACKWALKFRAMESCALLBACK) &StackWalkCallback,
&stackContents,
ALLOW_ASYNC_STACK_WALK | FUNCTIONSONLY | HANDLESKIPPEDFRAMES | ALLOW_INVALID_OBJECTS);
return ((swaRet == SWA_DONE) || (swaRet == SWA_CONTINUE));
}
The StackWalkFrame(..)
function then does the heavy-lifting of actually walking the stack, before triggering the callback shown below. In this case it just records the ‘Instruction Pointer’ (IP/CP) and the ‘managed function’, which is an instance of the MethodDesc
obtained via the pCf->GetFunction()
call:
StackWalkAction EventPipe::StackWalkCallback(CrawlFrame *pCf, StackContents *pData)
{
CONTRACTL
{
NOTHROW;
GC_NOTRIGGER;
MODE_ANY;
PRECONDITION(pCf != NULL);
PRECONDITION(pData != NULL);
}
CONTRACTL_END;
// Get the IP.
UINT_PTR controlPC = (UINT_PTR)pCf->GetRegisterSet()->ControlPC;
if (controlPC == 0)
{
if (pData->GetLength() == 0)
{
// This happens for pinvoke stubs on the top of the stack.
return SWA_CONTINUE;
}
}
_ASSERTE(controlPC != 0);
// Add the IP to the captured stack.
pData->Append(controlPC, pCf->GetFunction());
// Continue the stack walk.
return SWA_CONTINUE;
}
How it works
Now onto the most interesting part, how to the runtime actually walks the stack. Well, first let’s understand what the stack looks like, from the ‘BotR’ page:
The main thing to note is that a .NET ‘stack’ can contain 3 types of methods:
- Managed - this represents code that started off as C#/F#/VB.NET, was turned into IL and then finally compiled to native code by the ‘JIT Compiler’.
- Unmanaged - completely native code that exists outside of the runtime, i.e. a OS function the runtime calls into or a user call via
P/Invoke
. The runtime only cares about transitions into or out of regular unmanaged code, is doesn’t care about the stack frame within it.
- Runtime Managed - still native code, but this is slightly different because the runtime case more about this code. For example there are quite a few parts of the Base-Class libraries that make use of
InternalCall
methods, for more on this see the ‘Helper Method’ Frames section later on.
So the ‘stack walk’ has to deal with these different scenarios as it proceeds. Now let’s look at the ‘code flow’ starting with the entry-point method StackWalkFrames(..)
:
Thread::StackWalkFrames(..)
here
- the entry-point function, the type of ‘stack walk’ can be controlled via these flags
Thread::StackWalkFramesEx(..)
here
- worker-function that sets up the
StackFrameIterator
, via a call to StackFrameIterator::Init(..)
here
StackFrameIterator::Next()
here, then hands off to the primary worker method StackFrameIterator::NextRaw()
here that does 5 things:
CheckForSkippedFrames(..)
here, deals with frames that may have been allocated inside a managed stack frame (e.g. an inlined p/invoke call).
UnwindStackFrame(..)
here, in-turn calls:
x64
- Thread::VirtualUnwindCallFrame(..)
here, then calls VirtualUnwindNonLeafCallFrame(..)
here or VirtualUnwindLeafCallFrame(..)
here. All of of these functions make use of the Windows API function RtlLookupFunctionEntry(..)
to do the actual unwinding.
x86
- ::UnwindStackFrame(..)
here, in turn calls UnwindEpilog(..)
here and UnwindEspFrame(..)
here. Unlike x64
, under x86
all the ‘stack-unwinding’ is done manually, within the CLR code.
PostProcessingForManagedFrames(..)
here, determines if the stack-walk is actually within a managed method rather than a native frame.
ProcessIp(..)
here has the job of looking up the current managed method (if any) based on the current instruction pointer (IP). It does this by calling into EECodeInfo::Init(..)
here and then ends up in one of:
EEJitManager::JitCodeToMethodInfo(..)
here, that uses a very cool looking data structure refereed to as a ‘nibble map’
NativeImageJitManager::JitCodeToMethodInfo(..)
here
ReadyToRunJitManager::JitCodeToMethodInfo(..)
here
ProcessCurrentFrame(..)
here, does some final house-keeping and tidy-up.
CrawlFrame::GotoNextFrame()
here
- in-turn calls
pFrame->Next()
here to walk through the ‘linked list’ of frames which drive the ‘stack walk’ (more on these ‘frames’ later)
StackFrameIterator::Filter()
here
When it gets a valid frame it triggers the callback in Thread::MakeStackwalkerCallback(..)
here and passes in a pointer to the current CrawlFrame
class defined here, this exposes methods such as IsFrameless()
, GetFunction()
and GetThisPointer()
. The CrawlFrame
actually represents 2 scenarios, based on the current IP:
- Native code, represented by a
Frame
class defined here, which we’ll discuss more in a moment.
- Managed code, well technically ‘managed code’ that was JITted to ‘native code’, so more accurately a managed stack frame. In this situation the
MethodDesc
class defined here is provided, you can read more about this key CLR data-structure in the corresponding BotR chapter.
See it ‘in Action’
Fortunately we’re able to turn on some nice diagnostics in a debug build of the CLR (COMPLUS_LogEnable
, COMPLUS_LogToFile
& COMPLUS_LogFacility
). With that in place, given C# code like this:
internal class Program {
private static void Main() {
MethodA();
}
[MethodImpl(MethodImplOptions.NoInlining)]
private void MethodA() {
MethodB();
}
[MethodImpl(MethodImplOptions.NoInlining)]
private void MethodB() {
MethodC();
}
[MethodImpl(MethodImplOptions.NoInlining)]
private void MethodC() {
var stackTrace = new StackTrace(fNeedFileInfo: true);
Console.WriteLine(stackTrace.ToString());
}
}
We get the output shown below, in which you can see the ‘stack walking’ process. It starts in InitializeSourceInfo
and CaptureStackTrace
which are methods internal to the StackTrace
class (see here), before moving up the stack MethodC
-> MethodB
-> MethodA
and then finally stopping in the Main
function. Along the way its does a ‘FILTER’ and ‘CONSIDER’ step before actually unwinding (‘finished unwind for …’):
TID 4740: STACKWALK starting with partial context
TID 4740: STACKWALK: [000] FILTER : EXPLICIT : PC= 00000000`00000000 SP= 00000000`00000000 Frame= 00000002`9977cc48 vtbl= 00007ffd`74a105b0
TID 4740: STACKWALK: [001] CONSIDER: EXPLICIT : PC= 00000000`00000000 SP= 00000000`00000000 Frame= 00000002`9977cc48 vtbl= 00007ffd`74a105b0
TID 4740: STACKWALK: [001] FILTER : EXPLICIT : PC= 00000000`00000000 SP= 00000000`00000000 Frame= 00000002`9977cc48 vtbl= 00007ffd`74a105b0
TID 4740: STACKWALK: [002] CONSIDER: EXPLICIT : PC= 00000000`00000000 SP= 00000000`00000000 Frame= 00000002`9977cdd8 vtbl= 00007ffd`74995220
TID 4740: STACKWALK LazyMachState::unwindLazyState(ip:00007FFD7439C45C,sp:000000029977C338)
TID 4740: STACKWALK: [002] CALLBACK: EXPLICIT : PC= 00000000`00000000 SP= 00000000`00000000 Frame= 00000002`9977cdd8 vtbl= 00007ffd`74995220
TID 4740: STACKWALK HelperMethodFrame::UpdateRegDisplay cached ip:00007FFD72FE9258, sp:000000029977D300
TID 4740: STACKWALK: [003] CONSIDER: FRAMELESS: PC= 00007ffd`72fe9258 SP= 00000002`9977d300 method=InitializeSourceInfo
TID 4740: STACKWALK: [003] CALLBACK: FRAMELESS: PC= 00007ffd`72fe9258 SP= 00000002`9977d300 method=InitializeSourceInfo
TID 4740: STACKWALK: [004] about to unwind for 'InitializeSourceInfo', SP: 00000002`9977d300 , IP: 00007ffd`72fe9258
TID 4740: STACKWALK: [004] finished unwind for 'InitializeSourceInfo', SP: 00000002`9977d480 , IP: 00007ffd`72eeb671
TID 4740: STACKWALK: [004] CONSIDER: FRAMELESS: PC= 00007ffd`72eeb671 SP= 00000002`9977d480 method=CaptureStackTrace
TID 4740: STACKWALK: [004] CALLBACK: FRAMELESS: PC= 00007ffd`72eeb671 SP= 00000002`9977d480 method=CaptureStackTrace
TID 4740: STACKWALK: [005] about to unwind for 'CaptureStackTrace', SP: 00000002`9977d480 , IP: 00007ffd`72eeb671
TID 4740: STACKWALK: [005] finished unwind for 'CaptureStackTrace', SP: 00000002`9977d5b0 , IP: 00007ffd`72eeadd0
TID 4740: STACKWALK: [005] CONSIDER: FRAMELESS: PC= 00007ffd`72eeadd0 SP= 00000002`9977d5b0 method=.ctor
TID 4740: STACKWALK: [005] CALLBACK: FRAMELESS: PC= 00007ffd`72eeadd0 SP= 00000002`9977d5b0 method=.ctor
TID 4740: STACKWALK: [006] about to unwind for '.ctor', SP: 00000002`9977d5b0 , IP: 00007ffd`72eeadd0
TID 4740: STACKWALK: [006] finished unwind for '.ctor', SP: 00000002`9977d5f0 , IP: 00007ffd`14c620d3
TID 4740: STACKWALK: [006] CONSIDER: FRAMELESS: PC= 00007ffd`14c620d3 SP= 00000002`9977d5f0 method=MethodC
TID 4740: STACKWALK: [006] CALLBACK: FRAMELESS: PC= 00007ffd`14c620d3 SP= 00000002`9977d5f0 method=MethodC
TID 4740: STACKWALK: [007] about to unwind for 'MethodC', SP: 00000002`9977d5f0 , IP: 00007ffd`14c620d3
TID 4740: STACKWALK: [007] finished unwind for 'MethodC', SP: 00000002`9977d630 , IP: 00007ffd`14c62066
TID 4740: STACKWALK: [007] CONSIDER: FRAMELESS: PC= 00007ffd`14c62066 SP= 00000002`9977d630 method=MethodB
TID 4740: STACKWALK: [007] CALLBACK: FRAMELESS: PC= 00007ffd`14c62066 SP= 00000002`9977d630 method=MethodB
TID 4740: STACKWALK: [008] about to unwind for 'MethodB', SP: 00000002`9977d630 , IP: 00007ffd`14c62066
TID 4740: STACKWALK: [008] finished unwind for 'MethodB', SP: 00000002`9977d660 , IP: 00007ffd`14c62016
TID 4740: STACKWALK: [008] CONSIDER: FRAMELESS: PC= 00007ffd`14c62016 SP= 00000002`9977d660 method=MethodA
TID 4740: STACKWALK: [008] CALLBACK: FRAMELESS: PC= 00007ffd`14c62016 SP= 00000002`9977d660 method=MethodA
TID 4740: STACKWALK: [009] about to unwind for 'MethodA', SP: 00000002`9977d660 , IP: 00007ffd`14c62016
TID 4740: STACKWALK: [009] finished unwind for 'MethodA', SP: 00000002`9977d690 , IP: 00007ffd`14c61f65
TID 4740: STACKWALK: [009] CONSIDER: FRAMELESS: PC= 00007ffd`14c61f65 SP= 00000002`9977d690 method=Main
TID 4740: STACKWALK: [009] CALLBACK: FRAMELESS: PC= 00007ffd`14c61f65 SP= 00000002`9977d690 method=Main
TID 4740: STACKWALK: [00a] about to unwind for 'Main', SP: 00000002`9977d690 , IP: 00007ffd`14c61f65
TID 4740: STACKWALK: [00a] finished unwind for 'Main', SP: 00000002`9977d6d0 , IP: 00007ffd`742f9073
TID 4740: STACKWALK: [00a] FILTER : NATIVE : PC= 00007ffd`742f9073 SP= 00000002`9977d6d0
TID 4740: STACKWALK: [00b] CONSIDER: EXPLICIT : PC= 00007ffd`742f9073 SP= 00000002`9977d6d0 Frame= 00000002`9977de58 vtbl= 00007ffd`74a105b0
TID 4740: STACKWALK: [00b] FILTER : EXPLICIT : PC= 00007ffd`742f9073 SP= 00000002`9977d6d0 Frame= 00000002`9977de58 vtbl= 00007ffd`74a105b0
TID 4740: STACKWALK: [00c] CONSIDER: EXPLICIT : PC= 00007ffd`742f9073 SP= 00000002`9977d6d0 Frame= 00000002`9977e7e0 vtbl= 00007ffd`74a105b0
TID 4740: STACKWALK: [00c] FILTER : EXPLICIT : PC= 00007ffd`742f9073 SP= 00000002`9977d6d0 Frame= 00000002`9977e7e0 vtbl= 00007ffd`74a105b0
TID 4740: STACKWALK: SWA_DONE: reached the end of the stack
To find out more, you can search for these diagnostic message in \vm\stackwalk.cpp, e.g. in Thread::DebugLogStackWalkInfo(..)
here
Unwinding ‘Native’ Code
As explained in this excellent article:
There are fundamentally two main ways to implement exception propagation in an ABI (Application Binary Interface):
-
“dynamic registration”, with frame pointers in each activation record, organized as a linked list. This makes stack unwinding fast at the expense of having to set up the frame pointer in each function that calls other functions. This is also simpler to implement.
-
“table-driven”, where the compiler and assembler create data structures alongside the program code to indicate which addresses of code correspond to which sizes of activation records. This is called “Call Frame Information” (CFI) data in e.g. the GNU tool chain. When an exception is generated, the data in this table is loaded to determine how to unwind. This makes exception propagation slower but the general case faster.
It turns out that .NET uses the ‘table-driven’ approach, for the reason explained in the ‘BotR’:
The exact definition of a frame varies from platform to platform and on many platforms there isn’t a hard definition of a frame format that all functions adhere to (x86 is an example of this). Instead the compiler is often free to optimize the exact format of frames. On such systems it is not possible to guarantee that a stackwalk will return 100% correct or complete results (for debugging purposes, debug symbols such as pdbs are used to fill in the gaps so that debuggers can generate more accurate stack traces).
This is not a problem for the CLR, however, since we do not require a fully generalized stack walk. Instead we are only interested in those frames that are managed (i.e. represent a managed method) or, to some extent, frames coming from unmanaged code used to implement part of the runtime itself. In particular there is no guarantee about fidelity of 3rd party unmanaged frames other than to note where such frames transition into or out of the runtime itself (i.e. one of the frame types we do care about).
Frames
To enable ‘unwinding’ of native code or more strictly the transitions ‘into’ and ‘out of’ native code, the CLR uses a mechanism of Frames
, which are defined in the source code here. These frames are arranged into a hierachy and there is one type of Frame
for each scenario, for more info on these individual Frames
take a look at the excellent source-code comments here.
- Frame (abstract/base class)
- GCFrame
- FaultingExceptionFrame
- HijackFrame
- ResumableFrame
- InlinedCallFrame
- HelperMethodFrame
- HelperMethodFrame_1OBJ
- HelperMethodFrame_2OBJ
- HelperMethodFrame_3OBJ
- HelperMethodFrame_PROTECTOBJ
- TransitionFrame
- StubHelperFrame
- SecureDelegateFrame
- FramedMethodFrame
- ComPlusMethodFrame
- PInvokeCalliFrame
- PrestubMethodFrame
- StubDispatchFrame
- ExternalMethodFrame
- TPMethodFrame
- UnmanagedToManagedFrame
- ComMethodFrame
- UMThkCallFrame
- ContextTransitionFrame
- TailCallFrame
- ProtectByRefsFrame
- ProtectValueClassFrame
- DebuggerClassInitMarkFrame
- DebuggerSecurityCodeMarkFrame
- DebuggerExitFrame
- DebuggerU2MCatchHandlerFrame
- FuncEvalFrame
- ExceptionFilterFrame
‘Helper Method’ Frames
But to make sense of this, let’s look at one type of Frame
, known as HelperMethodFrame
(above). This is used when .NET code in the runtime calls into C++ code to do the heavy-lifting, often for performance reasons. One example is if you call Environment.GetCommandLineArgs()
you end up in this code (C#), but note that it ends up calling an extern
method marked with InternalCall
:
[MethodImplAttribute(MethodImplOptions.InternalCall)]
private static extern string[] GetCommandLineArgsNative();
This means that the rest of the method is implemented in the runtime in C++, you can see how the method call is wired up, before ending up SystemNative::GetCommandLineArgs
here, which is shown below:
FCIMPL0(Object*, SystemNative::GetCommandLineArgs)
{
FCALL_CONTRACT;
PTRARRAYREF strArray = NULL;
HELPER_METHOD_FRAME_BEGIN_RET_1(strArray); // GetAppDomain());
}
delete [] argv;
HELPER_METHOD_FRAME_END(); // CountOfUnwindCodes]), sizeof(ULONG));
*pPersonalityRoutine = ExecutionManager::GetCLRPersonalityRoutineValue();
#elif defined(_TARGET_ARM64_)
*(LONG *)pUnwindInfo |= (1 End offset : 0x00004e (not in unwind data)
Version : 1
Flags : 0x00
SizeOfProlog : 0x07
CountOfUnwindCodes: 4
FrameRegister : none (0)
FrameOffset : N/A (no FrameRegister) (Value=0)
UnwindCodes :
CodeOffset: 0x07 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 11 * 8 + 8 = 96 = 0x60
CodeOffset: 0x03 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6)
CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rdi (7)
CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5)
Unwind Info:
>> Start offset : 0x00004e (not in unwind data)
>> End offset : 0x0000e2 (not in unwind data)
Version : 1
Flags : 0x00
SizeOfProlog : 0x07
CountOfUnwindCodes: 4
FrameRegister : none (0)
FrameOffset : N/A (no FrameRegister) (Value=0)
UnwindCodes :
CodeOffset: 0x07 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 5 * 8 + 8 = 48 = 0x30
CodeOffset: 0x03 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6)
CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rdi (7)
CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5)
This ‘unwind info’ is then looked up during a ‘stack walk’ as explained in the How it works section above.
So next time you encounter a ‘stack trace’ remember that a lot of work went into making it possible!!
Further Reading
‘Stack Walking’ or ‘Stack Unwinding’ is a very large topic, so if you want to know more, here are some links to get you started:
Stack Unwinding (general)
Stack Unwinding (other runtimes)
In addition, it’s interesting to look at how other runtimes handles this process:
Mon, 21 Jan 2019, 12:00 am
Exploring the .NET Core Runtime (in which I set myself a challenge)
It seems like this time of year anyone with a blog is doing some sort of ‘advent calendar’, i.e. 24 posts leading up to Christmas. For instance there’s a F# one which inspired a C# one (C# copying from F#, that never happens 😉)
However, that’s a bit of a problem for me, I struggled to write 24 posts in my most productive year, let alone a single month! Also, I mostly blog about ‘.NET Internals’, a subject which doesn’t necessarily lend itself to the more ‘light-hearted’ posts you get in these ‘advent calendar’ blogs.
Until now!
Recently I’ve been giving a talk titled from ‘dotnet run’ to ‘hello world’, which attempts to explain everything that the .NET Runtime does from the point you launch your application till “Hello World” is printed on the screen:
From 'dotnet run' to 'hello world' from
Matt Warren
But as I was researching and presenting this talk, it made me think about the .NET Runtime as a whole, what does it contain and most importantly what can you do with it?
Note: this is mostly for informational purposes, for the recommended way of achieving the same thing, take a look at this excellent Deep-dive into .NET Core primitives by Nate McMaster.
In this post I will explore what you can do using only the code in the dotnet/coreclr repository and along the way we’ll find out more about how the runtime interacts with the wider .NET Ecosystem.
To makes things clearer, there are 3 challenges that will need to be solved before a simple “Hello World” application can be run. That’s because in the dotnet/coreclr repository there is:
- No compiler, that lives in dotnet/Roslyn
- No Framework Class Library (FCL) a.k.a. ‘dotnet/CoreFX’
- No
dotnet run
as it’s implemented in the dotnet/CLI repository
Building the CoreCLR
But before we even work through these ‘challenges’, we need to build the CoreCLR itself. Helpfully there is really nice guide available in ‘Building the Repository’:
The build depends on Git, CMake, Python and of course a C++ compiler. Once these prerequisites are installed
the build is simply a matter of invoking the ‘build’ script (build.cmd
or build.sh
) at the base of the repository.
The details of installing the components differ depending on the operating system. See the following pages based on your OS. There is no cross-building across OS (only for ARM, which is built on X64). You have to be on the particular platform to build that platform.
If you follow these steps successfully, you’ll end up with the following files (at least on Windows, other OSes may produce something slightly different):
No Compiler
First up, how do we get around the fact that we don’t have a compiler? After all we need some way of turing our simple “Hello World” code into a .exe?
namespace Hello_World
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Hello World!");
}
}
}
Fortunately we do have access to the ILASM tool (IL Assembler), which can turn Common Intermediate Language (CIL) into an .exe file. But how do we get the correct IL code? Well, one way is to write it from scratch, maybe after reading Inside NET IL Assembler and Expert .NET 2.0 IL Assembler by Serge Lidin (yes, amazingly, 2 books have been written about IL!)
Another, much easier way, is to use the amazing SharpLab.io site to do it for us! If you paste the C# code from above into it, you’ll get the following IL code:
.class private auto ansi ''
{
} // end of class
.class private auto ansi beforefieldinit Hello_World.Program
extends [mscorlib]System.Object
{
// Methods
.method private hidebysig static
void Main (
string[] args
) cil managed
{
// Method begins at RVA 0x2050
// Code size 11 (0xb)
.maxstack 8
IL_0000: ldstr "Hello World!"
IL_0005: call void [mscorlib]System.Console::WriteLine(string)
IL_000a: ret
} // end of method Program::Main
.method public hidebysig specialname rtspecialname
instance void .ctor () cil managed
{
// Method begins at RVA 0x205c
// Code size 7 (0x7)
.maxstack 8
IL_0000: ldarg.0
IL_0001: call instance void [mscorlib]System.Object::.ctor()
IL_0006: ret
} // end of method Program::.ctor
} // end of class Hello_World.Program
Then, if we save this to a file called ‘HelloWorld.il’ and run the cmd ilasm HelloWorld.il /out=HelloWorld.exe
, we get the following output:
Microsoft (R) .NET Framework IL Assembler. Version 4.5.30319.0
Copyright (c) Microsoft Corporation. All rights reserved.
Assembling 'HelloWorld.il' to EXE --> 'HelloWorld.exe'
Source file is ANSI
HelloWorld.il(38) : warning : Reference to undeclared extern assembly 'mscorlib'. Attempting autodetect
Assembled method Hello_World.Program::Main
Assembled method Hello_World.Program::.ctor
Creating PE file
Emitting classes:
Class 1: Hello_World.Program
Emitting fields and methods:
Global
Class 1 Methods: 2;
Emitting events and properties:
Global
Class 1
Writing PE file
Operation completed successfully
Nice, so part 1 is done, we now have our HelloWorld.exe
file!
No Base Class Library
Well, not exactly, one problem is that System.Console
lives in dotnet/corefx, in there you can see the different files that make up the implementation, such as Console.cs
, ConsolePal.Unix.cs
, ConsolePal.Windows.cs
, etc.
Fortunately, the nice CoreCLR developers included a simple Console
implementation in System.Private.CoreLib.dll
, the managed part of the CoreCLR, which was previously known as ‘mscorlib’ (before it was renamed). This internal version of Console
is pretty small and basic, but it provides enough for what we need.
To use this ‘workaround’ we need to edit our HelloWorld.il
to look like this (note the change from mscorlib
to System.Private.CoreLib
)
.class public auto ansi beforefieldinit C
extends [System.Private.CoreLib]System.Object
{
.method public hidebysig static void M () cil managed
{
.entrypoint
// Code size 11 (0xb)
.maxstack 8
IL_0000: ldstr "Hello World!"
IL_0005: call void [System.Private.CoreLib]Internal.Console::WriteLine(string)
IL_000a: ret
} // end of method C::M
...
}
Note: You can achieve the same thing with C# code instead of raw IL, by invoking the C# compiler with the following cmd-line:
csc -optimize+ -nostdlib -reference:System.Private.Corelib.dll -out:HelloWorld.exe HelloWorld.cs
So we’ve completed part 2, we are able to at least print “Hello World” to the screen without using the CoreFX repository!
Now this is a nice little trick, but I wouldn’t ever recommend writing real code like this. Compiling against System.Private.CoreLib
isn’t the right way of doing things. What the compiler normally does is compile against the publicly exposed surface area that lives in dotnet/corefx, but then at run-time a process called ‘Type-Forwarding’ is used to make that ‘reference’ implementation in CoreFX map to the ‘real’ implementation in the CoreCLR. For more on this entire process see The Rough History of Referenced Assemblies.
However, only a small amount of managed code (i.e. C#) actually exists in the CoreCLR, to show this, the directory tree for /dotnet/coreclr/src/System.Private.CoreLib is available here and the tree with all ~1280 .cs files included is here.
As a concrete example, if you look in CoreFX, you’ll see that the System.Reflection implementation is pretty empty! That’s because it’s a ‘partial facade’ that is eventually ‘type-forwarded’ to System.Private.CoreLib.
If you’re interested, the entire API that is exposed in CoreFX (but actually lives in CoreCLR) is contained in System.Runtime.cs. But back to our example, here is the code that describes all the GetMethod(..)
functions in the ‘System.Reflection’ API.
To learn more about ‘type forwarding’, I recommend watching ‘.NET Standard - Under the Hood’ (slides) by Immo Landwerth and there is also some more in-depth information in ‘Evolution of design time assemblies’.
But why is this code split useful, from the CoreFX README:
Runtime-specific library code (mscorlib) lives in the CoreCLR repo. It needs to be built and versioned in tandem with the runtime. The rest of CoreFX is agnostic of runtime-implementation and can be run on any compatible .NET runtime (e.g. CoreRT).
And from the other point-of-view, in the CoreCLR README:
By itself, the Microsoft.NETCore.Runtime.CoreCLR
package is actually not enough to do much. One reason for this is that the CoreCLR package tries to minimize the amount of the class library that it implements. Only types that have a strong dependency on the internal workings of the runtime are included (e.g, System.Object
, System.String
, System.Threading.Thread
, System.Threading.Tasks.Task
and most foundational interfaces).
Instead most of the class library is implemented as independent NuGet packages that simply use the .NET Core runtime as a dependency. Many of the most familiar classes (System.Collections
, System.IO
, System.Xml
and so on), live in packages defined in the dotnet/corefx repository.
One huge benefit of this approach is that Mono can share large amounts of the CoreFX code, as shown in this tweet:
How Mono reuses .NET Core sources for BCL (doesn't include runtime, tools, etc) according to my calculations 🙂 pic.twitter.com/8JCDxqwnNi
— Egor Bogatov (@EgorBo)
March 27, 2018
No Launcher
So far we’ve ‘compiled’ our code (well technically ‘assembled’ it) and we’ve been able to access a simple version of System.Console
, but how do we actually run our .exe
? Remember we can’t use the dotnet run
command because that lives in the dotnet/CLI repository (and that would be breaking the rules of this slightly contrived challenge!!).
Again, fortunately those clever runtime engineers have thought of this exact scenario and they built the very helpful corerun
application. You can read more about in Using corerun To Run .NET Core Application, but the td;dr is that it will only look for dependencies in the same folder as your .exe.
So, to complete the challenge, we can now run CoreRun HelloWorld.exe
:
# CoreRun HelloWorld.exe
Hello World!
Yay, the least impressive demo you’ll see this year!!
For more information on how you can ‘host’ the CLR in your application I recommend this excellent tutorial Write a custom .NET Core host to control the .NET runtime from your native code. In addition, the docs page on ‘Runtime Hosts’ gives a nice overview of the different hosts that are available:
The .NET Framework ships with a number of different runtime hosts, including the hosts listed in the following table.
Runtime Host
Description
ASP.NET
Loads the runtime into the process that is to handle the Web request. ASP.NET also creates an application domain for each Web application that will run on a Web server.
Microsoft Internet Explorer
Creates application domains in which to run managed controls. The .NET Framework supports the download and execution of browser-based controls. The runtime interfaces with the extensibility mechanism of Microsoft Internet Explorer through a mime filter to create application domains in which to run the managed controls. By default, one application domain is created for each Web site.
Shell executables
Invokes runtime hosting code to transfer control to the runtime each time an executable is launched from the shell.
Thu, 13 Dec 2018, 12:00 am
Open Source .NET – 4 years later
A little over 4 years ago Microsoft announced that they were open sourcing large parts of the .NET framework and as this slide from New Features in .NET Core and ASP.NET Core 2.1 shows, the community has been contributing in a significant way:
Side-note: This post forms part of an on-going series, if you want to see how things have changed over time you can check out the previous ones:
Runtime Changes
Before I look at the numbers, I just want to take a moment to look at the significant runtime changes that have taken place over the last 4 years. Partly because I really like looking at the ‘Internals’ of CoreCLR, but also because the runtime is the one repository that makes all the others possible, they rely on it!
To give some context, here’s the slides from a presentation I did called ‘From ‘dotnet run’ to ‘hello world’. If you flick through them you’ll see what components make up the CoreCLR code-base and what they do to make your application run.
From 'dotnet run' to 'hello world' from
Matt Warren
So, after a bit of digging through the 19,059 commits, 5,790 issues and the 8 projects, here’s the list of significant changes in the .NET Core Runtime (CoreCLR) over the last few years (if I’ve missed any out, please let me know!!):
Span
(more info)
ref-like
like types (to support Span
)
- Tiered Compilation (more info)
- Cross-platform (Unix, OS X, etc, see list of all ‘os-xxx’ labels)
- New CPU Architectures
- Hardware Intrinsics (project)
- Default Interface Methods (project)
- Performance Monitoring and Diagnostics (project)
- Ready-to-Run Images
- LocalGC (project)
- Unloadability (project)
So there’s been quite a few large, fundamental changes to the runtime since it’s been open-sourced.
Repository activity over time
But onto the data, first we are going to look at an overview of the level of activity in each repo, by analysing the total number of ‘Issues’ (created) or ‘Pull Requests’ (closed) per month. (Sparklines FTW!!). If you are interested in how I got the data, see the previous post because the process is the same.
Note: Numbers in black are from the most recent month, with the red dot showing the lowest and the green dot the highest previous value. You can toggle between Issues and Pull Requests by clicking on the buttons, hover over individual sparklines to get a tooltip showing the per/month values and click on the project name to take you to the GitHub page for that repository.
Issues
Pull Requests
This data gives a good indication of how healthy different repos are, are they growing over time, or staying the same. You can also see the different levels of activity each repo has and how they compare to other ones.
Whilst it’s clear that Visual Studio Code is way ahead of all the other repos (in ‘# of Issues’), it’s interesting to see that some of the .NET-only ones are still pretty large, notably CoreFX (base-class libraries), Roslyn (compiler) and CoreCLR (runtime).
Next will will look at the total participation from the last 4 years, i.e. November 2014 to November 2018. All Pull Requests and Issues are treated equally, so a large PR counts the same as one that fixes a speling mistake. Whilst this isn’t ideal it’s the simplest way to get an idea of the Microsoft/Community split. In addition, Community does include people paid by other companies to work on .NET Projects, for instance Samsung Engineers.
Note: You can hover over the bars to get the actual numbers, rather than percentages.
Issues:
Microsoft
Community
Pull Requests:
Microsoft
Community
Finally we can see the ‘per-month’ data from the last 4 years, i.e. November 2014 to November 2018.
Note: You can inspect different repos by selecting them from the pull-down list, but be aware that the y-axis on the graphs are re-scaled, so the maximum value will change each time.
Issues:
Microsoft
Community
Pull Requests:
Microsoft
Community
Summary
It’s clear that the community continues to be invested in the .NET-related, Open Source repositories, contributing significantly and for a sustained period of time. I think this is good for all .NET developers, whether you contribute to OSS or not, having .NET be a thriving, Open Source product has many benefits!
Tue, 4 Dec 2018, 12:00 am
A History of .NET Runtimes
Recently I was fortunate enough to chat with Chris Bacon who wrote DotNetAnywhere (an alternative .NET Runtime) and I quipped with him:
.. you’re probably one of only a select group(*) of people who’ve written a .NET runtime, that’s pretty cool!
* if you exclude people who were paid to work on one, i.e. Microsoft/Mono/Xamarin engineers, it’s a very select group.
But it got me thinking, how many .NET Runtimes are there? I put together my own list, then enlisted a crack team of highly-paid researchers, a.k.a my twitter followers:
#LazyWeb, fun Friday quiz, how many different .NET Runtimes are there? (that implement ECMA-335 https://t.co/76stuYZLrw)
- .NET Framework
- .NET Core
- Mono
- Unity
- .NET Compact Framework
- DotNetAnywhere
- Silverlight
What have I missed out?
— Matt Warren (@matthewwarren)
September 14, 2018
For the purposes of this post I’m classifying a ‘.NET Runtime’ as anything that implements the ECMA-335 Standard for .NET (more info here). I don’t know if there’s a more precise definition or even some way of officially veryifying conformance, but in practise it means that the runtimes can take a .NET exe/dll produced by any C#/F#/VB.NET compiler and run it.
Once I had the list, I made copious use of wikipedia (see the list of ‘References’) and came up with the following timeline:
Timeline maker
(If the interactive timeline isn’t working for you, take a look at this version)
If I’ve missed out any runtimes, please let me know!
To make the timeline a bit easier to understand, I put each runtime into one of the following categories:
-
Microsoft .NET Frameworks
-
Other Microsoft Runtimes
-
Mono/Xamarin Runtimes
-
'Ahead-of-Time' (AOT) Runtimes
-
Community Projects
-
Research Projects
The rest of the post will look at the different runtimes in more detail. Why they were created, What they can do and How they compare to each other.
Microsoft .NET Frameworks
The original ‘.NET Framework’ was started by Microsoft in the late 1990’s and has been going strong ever since. Recently they’ve changed course somewhat with the announcement of .NET Core, which is ‘open-source’ and ‘cross-platform’. In addition, by creating the .NET Standard they’ve provided a way for different runtimes to remain compatible:
.NET Standard is for sharing code. .NET Standard is a set of APIs that all .NET implementations must provide to conform to the standard. This unifies the .NET implementations and prevents future fragmentation.
As an aside, if you want more information on the ‘History of .NET’, I really recommend Anders Hejlsberg - What brought about the birth of the CLR? and this presentation by Richard Campbell who really knows how to tell a story!
(Also available as a podcast if you’d prefer and he’s working on a book covering the same subject. If you want to learn more about the history of the entire ‘.NET Ecosystem’ not just the Runtimes, check out ‘Legends of .NET’)
Other Microsoft Runtimes
But outside of the main general purpose ‘.NET Framework’, Microsoft have also released other runtimes, designed for specific scenarios.
.NET Compact Framework
The Compact (.NET CF) and Micro (.NET MF) Frameworks were both attempts to provide cut-down runtimes that would run on more constrained devices, for instance .NET CF:
… is designed to run on resource constrained mobile/embedded devices such as personal digital assistants (PDAs), mobile phones factory controllers, set-top boxes, etc. The .NET Compact Framework uses some of the same class libraries as the full .NET Framework and also a few libraries designed specifically for mobile devices such as .NET Compact Framework controls. However, the libraries are not exact copies of the .NET Framework; they are scaled down to use less space.
.NET Micro Framework
The .NET MF is even more constrained:
… for resource-constrained devices with at least 256 KB of flash and 64 KB of random-access memory (RAM). It includes a small version of the .NET Common Language Runtime (CLR) and supports development in C#, Visual Basic .NET, and debugging (in an emulator or on hardware) using Microsoft Visual Studio. NETMF features a subset of the .NET base class libraries (about 70 classes with about 420 methods),..
NETMF also features added libraries specific to embedded applications. It is free and open-source software released under Apache License 2.0.
If you want to try it out, Scott Hanselman did a nice write-up The .NET Micro Framework - Hardware for Software People.
Silverlight
Although now only in support mode (or ‘dead’/‘sunsetted’ depending on your POV), it’s interesting to go back to the original announcement and see what Silverlight was trying to do:
Silverlight is a cross platform, cross browser .NET plug-in that enables designers and developers to build rich media experiences and RIAs for browsers. The preview builds we released this week currently support Firefox, Safari and IE browsers on both the Mac and Windows.
Back in 2007, Silverlight 1.0 had the following features (it even worked on Linux!):
- Built-in codec support for playing VC-1 and WMV video, and MP3 and WMA audio within a browser…
- Silverlight supports the ability to progressively download and play media content from any web-server…
- Silverlight also optionally supports built-in media streaming…
- Silverlight enables you to create rich UI and animations, and blend vector graphics with HTML to create compelling content experiences…
- Silverlight makes it easy to build rich video player interactive experiences…
Mono/Xamarin Runtimes
Mono came about when Miguel de Icaza and others explored the possibility of making .NET work on Linux (from Mono early history):
Who came first is not an important question to me, because Mono to me is a means to an end: a technology to help Linux succeed on the desktop.
The same post also talks about how it started:
On the Mono side, the events were approximately like this:
As soon as the .NET documents came out in December 2000, I got really interested in the technology, and started where everyone starts: at the byte code interpreter, but I faced a problem: there was no specification for the metadata though.
The last modification to the early VM sources was done on January 22 2001, around that time I started posting to the .NET mailing lists asking for the missing information on the metadata file format.
…
About this time Sam Ruby was pushing at the ECMA committee to get the binary file format published, something that was not part of the original agenda. I do not know how things developed, but by April 2001 ECMA had published the file format.
Over time, Mono (now Xamarin) has branched out into wider areas. It runs on Android and iOS/Mac and was acquired by Microsoft in Feb 2016. In addition Unity & Mono/Xamarim have long worked together, to provide C# support in Unity and Unity is now a member of the .NET Foundation.
'Ahead-of-Time' (AOT) Runtimes
I wanted to include AOT runtimes as a seperate category, because traditionally .NET has been ‘Just-in-Time’ Compiled, but over time more and more ‘Ahead-of-Time’ compilation options have been available.
As far as I can tell, Mono was the first, with an ‘AOT’ mode since Aug 2006, but recently, Microsoft have released .NET Native and are they’re working on CoreRT - A .NET Runtime for AOT.
Community Projects
However, not all ‘.NET Runtimes’ were developed by Microsoft, or companies that they later acquired. There are some ‘Community’ owned ones:
- The oldest is DotGNU Portable.NET, which started at the same time as Mono, with the goal ‘to build a suite of Free Software tools to compile and execute applications for the Common Language Infrastructure (CLI)..’.
- Secondly, there is DotNetAnywhere, the work of just one person, Chris Bacon. DotNetAnywhere has the claim to fame that it provided the initial runtime for the Blazor project. However it’s also an excellent resource if you want to look at what makes up a ‘.NET Compatible-Runtime’ and don’t have the time to wade through the millions of lines-of-code that make up the CoreCLR!
- Next comes CosmosOS (GitHub project), which is not just a .NET Runtime, but a ‘Managed Operating System’. If you want to see how it achieves this I recommend reading through the excellent FAQ or taking a quick look under the hood. Another similar effort is SharpOS.
- Finally, I recently stumbled across CrossNet, which takes a different approach, it ‘parses .NET assemblies and generates unmanaged C++ code that can be compiled on any standard C++ compiler.’ Take a look at the overview docs and example of generated code to learn more.
Research Projects
Finally, onto the more esoteric .NET Runtimes. These are the Research Projects run by Microsoft, with the aim of seeing just how far can you extend a ‘managed runtime’, what can they be used for. Some of this research work has made it’s way back into commercial/shipping .NET Runtimes, for instance Span came from Midori.
Shared Source Common Language Infrastructure (SSCLI) (a.k.a ‘Rotor):
is Microsoft’s shared source implementation of the CLI, the core of .NET. Although the SSCLI is not suitable for commercial use due to its license, it does make it possible for programmers to examine the implementation details of many .NET libraries and to create modified CLI versions. Microsoft provides the Shared Source CLI as a reference CLI implementation suitable for educational use.
An interesting side-effect of releasing Rotor is that they were also able to release the ‘Gyro’ Project, which gives an idea of how Generics were added to the .NET Runtime.
Midori:
Midori was the code name for a managed code operating system being developed by Microsoft with joint effort of Microsoft Research. It had been reported to be a possible commercial implementation of the Singularity operating system, a research project started in 2003 to build a highly dependable operating system in which the kernel, device drivers, and applications are all written in managed code. It was designed for concurrency, and could run a program spread across multiple nodes at once. It also featured a security model that sandboxes applications for increased security. Microsoft had mapped out several possible migration paths from Windows to Midori. The operating system was discontinued some time in 2015, though many of its concepts were rolled into other Microsoft projects.
Midori is the project that appears to have led to the most ideas making their way back into the ‘.NET Framework’, you can read more about this in Joe Duffy’s excellent series Blogging about Midori
- A Tale of Three Safeties
- Objects as Secure Capabilities
- Asynchronous Everything
- Safe Native Code
- The Error Model
- Performance Culture
- 15 Years of Concurrency
Singularity (operating system) (also Singularity RDK)
Singularity is an experimental operating system (OS) which was built by Microsoft Research between 2003 and 2010. It was designed as a high dependability OS in which the kernel, device drivers, and application software were all written in managed code. Internal security uses type safety instead of hardware memory protection.
Last, but not least, there is Redhawk:
Codename for experimental minimal managed code runtime that evolved into CoreRT.
References
Below are the Wikipedia articles I referenced when creating the timeline:
Tue, 2 Oct 2018, 12:00 am
Fuzzing the .NET JIT Compiler
I recently came across the excellent ‘Fuzzlyn’ project, created as part of the ‘Language-Based Security’ course at Aarhus University. As per the project description Fuzzlyn is a:
… fuzzer which utilizes Roslyn to generate random C# programs
And what is a ‘fuzzer’, from the Wikipedia page for ‘fuzzing’:
Fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program.
Or in other words, a fuzzer is a program that tries to create source code that finds bugs in a compiler.
Massive kudos to the developers behind Fuzzlyn, Jakob Botsch Nielsen (who helped answer my questions when writing this post), Chris Schmidt and Jonas Larsen, it’s an impressive project!! (to be clear, I have no link with the project and can’t take any of the credit for it)
Compilation in .NET
But before we dive into ‘Fuzzlyn’ and what it does, we’re going to take a quick look at ‘compilation’ in the .NET Framework. When you write C#/VB.NET/F# code (delete as appropriate) and compile it, the compiler converts it into Intermediate Language (IL) code. The IL is then stored in a .exe or .dll, which the Common Language Runtime (CLR) reads and executes when your program is actually run. However it’s the job of the Just-in-Time (JIT) Compiler to convert the IL code into machine code.
Why is this relevant? Because Fuzzlyn works by comparing the output of a Debug and a Release version of a program and if they are different, there’s a bug! But it turns out that very few optimisations are actually done by the ‘Roslyn’ compiler, compared to what the JIT does, from Eric Lippert’s excellent post What does the optimize switch do? (2009)
The /optimize flag does not change a huge amount of our emitting and generation logic. We try to always generate straightforward, verifiable code and then rely upon the jitter to do the heavy lifting of optimizations when it generates the real machine code. But we will do some simple optimizations with that flag set. For example, with the flag set:
He then goes on to list the 15 things that the C# Compiler will optimise, before finishing with this:
That’s pretty much it. These are very straightforward optimizations; there’s no inlining of IL, no loop unrolling, no interprocedural analysis whatsoever. We let the jitter team worry about optimizing the heck out of the code when it is actually spit into machine code; that’s the place where you can get real wins.
So in .NET, very few of the techniques that an ‘Optimising Compiler’ uses are done at compile-time. They are almost all done at run-time by the JIT Compiler (leaving aside AOT scenarios for the time being).
For reference, most of the differences in IL are there to make the code easier to debug, for instance given this C# code:
public void M() {
foreach (var item in new [] { 1, 2, 3, 4 }) {
Console.WriteLine(item);
}
}
The differences in IL are shown below (‘Release’ on the left, ‘Debug’ on the right). As you can see there are a few extra nop
instructions to allow the debugger to ‘step-through’ more locations in the code, plus an extra local variable, which makes it easier/possible to see the value when debugging.
(click for larger image or you can view the ‘Release’ version and the ‘Debug’ version on the excellent SharpLab)
For more information on the differences in Release/Debug code-gen see the ‘Release (optimized)’ section in this doc on CodeGen Differences. Also, because Roslyn is open-source we can see how this is handled in the code:
This all means that the ‘Fuzzlyn’ project has actually been finding bugs in the .NET JIT, not in the Roslyn Compiler
(well, except this one Finally block belonging to unexecuted try runs anyway, which was fixed here)
How it works
At the simplest level, Fuzzlyn works by compiling and running a piece of randomly generated code in ‘Debug’ and ‘Release’ versions and comparing the output. If the 2 versions produce different results, then it’s a bug, specifically a bug in the optimisations that the JIT compiler has attempted.
The .NET JIT, known as ‘RyuJIT’, has several modes. It can produce fully optimised code that has the highest-performance, or in can produce more ‘debug’ friendly code that has no optimisations, but is much simpler. You can find out more about the different ‘optimisations’ that RyuJIT performs in this excellent tutorial, in this design doc or you can search through the code for usages of the ‘compDbgCode’ flag.
From a high-level Fuzzlyn goes through the following steps:
- Randomly generate a C# program
- Check if the code produces an error (Debug v. Release)
- Reduce the code to it’s simplest form
If you want to see this in action, I ran Fuzzlyn until it produced a randomly generated program with a bug. You can see the original source (6,802 LOC) and the reduced version (28 LOC). What’s interesting is that you can clearly see the buggy line-of-code in the original code, before it’s turned into a simplified version:
// Generated by Fuzzlyn v1.1 on 2018-08-22 15:19:26
// Seed: 14928117313359926641
// Reduced from 256.3 KiB to 0.4 KiB in 00:01:58
// Debug: Prints 0 line(s)
// Release: Prints 1 line(s)
public class Program
{
static short s_18;
static byte s_33 = 1;
static int[] s_40 = new int[]{0};
static short s_74 = 1;
public static void Main()
{
s_18 = -1;
// This comparision is the bug, in Debug it's False, in Release it's True
// However, '(ushort)(s_18 | 2L)' is 65,535 in Debug *and* Release
if (((ushort)(s_18 | 2L)
Tue, 28 Aug 2018, 12:00 am
Monitoring and Observability in the .NET Runtime
.NET is a managed runtime, which means that it provides high-level features that ‘manage’ your program for you, from Introduction to the Common Language Runtime (CLR) (written in 2007):
The runtime has many features, so it is useful to categorize them as follows:
- Fundamental features – Features that have broad impact on the design of other features. These include:
- Garbage Collection
- Memory Safety and Type Safety
- High level support for programming languages.
- Secondary features – Features enabled by the fundamental features that may not be required by many useful programs:
- Program isolation with AppDomains
- Program Security and sandboxing
- Other Features – Features that all runtime environments need but that do not leverage the fundamental features of the CLR. Instead, they are the result of the desire to create a complete programming environment. Among them are:
- Versioning
- Debugging/Profiling
- Interoperation
You can see that ‘Debugging/Profiling’, whilst not a Fundamental or Secondary feature, still makes it into the list because of a ‘desire to create a complete programming environment’.
The rest of this post will look at what Monitoring, Observability and Introspection features the Core CLR provides, why they’re useful and how it provides them.
To make it easier to navigate, the post is split up into 3 main sections (with some ‘extra-reading material’ at the end):
Diagnostics
Firstly we are going to look at the diagnostic information that the CLR provides, which has traditionally been supplied via ‘Event Tracing for Windows’ (ETW).
There is quite a wide range of events that the CLR provides related to:
- Garbage Collection (GC)
- Just-in-Time (JIT) Compilation
- Module and AppDomains
- Threading and Lock Contention
- and much more
For example this is where the AppDomain Load event is fired, this is the Exception Thrown event and here is the GC Allocation Tick event.
Perf View
If you want to see the ETW Events coming from your .NET program I recommend using the excellent PerfView tool and starting with these PerfView Tutorials or this excellent talk PerfView: The Ultimate .NET Performance Tool. PerfView is widely regarded because it provides invaluable information, for instance Microsoft Engineers regularly use it for performance investigations.
Common Infrastructure
However, in case it wasn’t clear from the name, ETW events are only available on Windows, which doesn’t really fit into the new ‘cross-platform’ world of .NET Core. You can use PerfView for Performance Tracing on Linux (via LTTng), but that is only the cmd-line collection tool, known as ‘PerfCollect’, the analysis and rich UI (which includes flamegraphs) is currently Windows only.
But if you do want to analyse .NET Performance Linux, there are some other approaches:
The 2nd link above discusses the new ‘EventPipe’ infrastructure that is being worked on in .NET Core (along with EventSources & EventListeners, can you spot a theme!), you can see its aims in Cross-Platform Performance Monitoring Design. At a high-level it will provide a single place for the CLR to push ‘events’ related to diagnostics and performance. These ‘events’ will then be routed to one or more loggers which may include ETW, LTTng, and BPF for example, with the exact logger being determined by which OS/Platform the CLR is running on. There is also more background information in .NET Cross-Plat Performance and Eventing Design that explains the pros/cons of the different logging technologies.
All the work being done on ‘Event Pipes’ is being tracked in the ‘Performance Monitoring’ project and the associated ‘EventPipe’ Issues.
Future Plans
Finally, there are also future plans for a Performance Profiling Controller which has the following goal:
The controller is responsible for control of the profiling infrastructure and exposure of performance data produced by .NET performance diagnostics components in a simple and cross-platform way.
The idea is for it to expose the following functionality via a HTTP server, by pulling all the relevant data from ‘Event Pipes’:
REST APIs
- Pri 1: Simple Profiling: Profile the runtime for X amount of time and return the trace.
- Pri 1: Advanced Profiling: Start tracing (along with configuration)
- Pri 1: Advanced Profiling: Stop tracing (the response to calling this will be the trace itself)
- Pri 2: Get the statistics associated with all EventCounters or a specified EventCounter.
Browsable HTML Pages
- Pri 1: Textual representation of all managed code stacks in the process.
- Provides an snapshot overview of what’s currently running for use as a simple diagnostic report.
- Pri 2: Display the current state (potentially with history) of EventCounters.
- Provides an overview of the existing counters and their values.
- OPEN ISSUE: I don’t believe the necessary public APIs are present to enumerate EventCounters.
I’m excited to see where the ‘Performance Profiling Controller’ (PPC?) goes, I think it’ll be really valuable for .NET to have this built-in to the CLR, it’s something that other runtimes have.
Profiling
Another powerful feature the CLR provides is the Profiling API, which is (mostly) used by 3rd party tools to hook into the runtime at a very low-level. You can find our more about the API in this overview, but at a high-level, it allows your to wire up callbacks that are triggered when:
- GC-related events happen
- Exceptions are thrown
- Assemblies are loaded/unloaded
- much, much more
Image from the BOTR page Profiling API – Overview
In addition is has other very power features. Firstly you can setup hooks that are called every time a .NET method is executed whether in the runtime or from users code. These callbacks are known as ‘Enter/Leave’ hooks and there is a nice sample that shows how to use them, however to make them work you need to understand ‘calling conventions’ across different OSes and CPU architectures, which isn’t always easy. Also, as a warning, the Profiling API is a COM component that can only be accessed via C/C++ code, you can’t use it from C#/F#/VB.NET!
Secondly, the Profiler is able to re-write the IL code of any .NET method before it is JITted, via the SetILFunctionBody() API. This API is hugely powerful and forms the basis of many .NET APM Tools, you can learn more about how to use it in my previous post How to mock sealed classes and static methods and the accompanying code.
ICorProfiler API
It turns out that the run-time has to perform all sorts of crazy tricks to make the Profiling API work, just look at what went into this PR Allow rejit on attach (for more info on ‘ReJIT’ see ReJIT: A How-To Guide).
The overall definition for all the Profiling API interfaces and callbacks is found in \vm\inc\corprof.idl (see Interface description language). But it’s divided into 2 logical parts, one is the Profiler -> ‘Execution Engine’ (EE) interface, known asICorProfilerInfo
:
// Declaration of class that implements the ICorProfilerInfo* interfaces, which allow the
// Profiler to communicate with the EE. This allows the Profiler DLL to get
// access to private EE data structures and other things that should never be exported
// outside of the EE.
Which is implemented in the following files:
The other main part is the EE -> Profiler callbacks, which are grouped together under the ICorProfilerCallback
interface:
// This module implements wrappers around calling the profiler's
// ICorProfilerCallaback* interfaces. When code in the EE needs to call the
// profiler, it goes through EEToProfInterfaceImpl to do so.
These callbacks are implemented across the following files:
Finally, it’s worth pointing out that the Profiler APIs might not work across all OSes and CPU-archs that .NET Core runs on, e.g. ELT call stub issues on Linux, see Status of CoreCLR Profiler APIs for more info.
Profiling v. Debugging
As a quick aside, ‘Profiling’ and ‘Debugging’ do have some overlap, so it’s helpful to understand what the different APIs provide in the context of the .NET Runtime, from CLR Debugging vs. CLR Profiling
Debugging
Debugging means different things to different people, for instance I asked on Twitter “what are the ways that you’ve debugged a .NET program” and got a wide range of different responses, although both sets of responses contain a really good list of tools and techniques, so they’re worth checking out, thanks #LazyWeb!
But perhaps this quote best sums up what Debugging really is 😊
Debugging is like being the detective in a crime movie where you are also the murderer.
— Filipe Fortes (@fortes)
November 10, 2013
The CLR provides a very extensive range of features related to Debugging, but why does it need to provide these services, the excellent post Why is managed debugging different than native-debugging? provides 3 reasons:
- Native debugging can be abstracted at the hardware level but managed debugging needs to be abstracted at the IL level
- Managed debugging needs a lot of information not available until runtime
- A managed debugger needs to coordinate with the Garbage Collector (GC)
So to give a decent experience, the CLR has to provide the higher-level debugging API known as ICorDebug
, which is shown in the image below of a ‘common debugging scenario’ from the BOTR:
In addition, there is a nice description of how the different parts interact in How do Managed Breakpoints work?:
Here’s an overview of the pipeline of components:
1) End-user
2) Debugger (such as Visual Studio or MDbg).
3) CLR Debugging Services (which we call "The Right Side"). This is the implementation of ICorDebug (in mscordbi.dll).
---- process boundary between Debugger and Debuggee ----
4) CLR. This is mscorwks.dll. This contains the in-process portion of the debugging services (which we call "The Left Side") which communicates directly with the RS in stage #3.
5) Debuggee's code (such as end users C# program)
ICorDebug API
But how is all this implemented and what are the different components, from CLR Debugging, a brief introduction:
All of .Net debugging support is implemented on top of a dll we call “The Dac”. This file (usually named mscordacwks.dll
) is the building block for both our public debugging API (ICorDebug
) as well as the two private debugging APIs: The SOS-Dac API and IXCLR.
In a perfect world, everyone would use ICorDebug
, our public debugging API. However a vast majority of features needed by tool developers such as yourself is lacking from ICorDebug
. This is a problem that we are fixing where we can, but these improvements go into CLR v.next, not older versions of CLR. In fact, the ICorDebug
API only added support for crash dump debugging in CLR v4. Anyone debugging CLR v2 crash dumps cannot use ICorDebug
at all!
(for an additional write-up, see SOS & ICorDebug)
The ICorDebug
API is actually split up into multiple interfaces, there are over 70 of them!! I won’t list them all here, but I will show the categories they fall into, for more info see Partition of ICorDebug where this list came from, as it goes into much more detail.
- Top-level: ICorDebug + ICorDebug2 are the top-level interfaces which effectively serve as a collection of ICorDebugProcess objects.
- Callbacks: Managed debug events are dispatched via methods on a callback object implemented by the debugger
- Process: This set of interfaces represents running code and includes the APIs related to eventing.
- Code / Type Inspection: Could mostly operate on a static PE image, although there are a few convenience methods for live data.
- Execution Control: Execution is the ability to “inspect” a thread’s execution. Practically, this means things like placing breakpoints (F9) and doing stepping (F11 step-in, F10 step-over, S+F11 step-out). ICorDebug’s Execution control only operates within managed code.
- Threads + Callstacks: Callstacks are the backbone of the debugger’s inspection functionality. The following interfaces are related to taking a callstack. ICorDebug only exposes debugging managed code, and thus the stacks traces are managed-only.
- Object Inspection: Object inspection is the part of the API that lets you see the values of the variables throughout the debuggee. For each interface, I list the “MVP” method that I think must succinctly conveys the purpose of that interface.
One other note, as with the Profiling APIs the level of support for the Debugging API varies across OS’s and CPU architectures. For instance, as of Aug 2018 there’s “no solution for Linux ARM of managed debugging and diagnostic”. For more info on ‘Linux’ support in general, see this great post Debugging .NET Core on Linux with LLDB and check-out the Diagnostics repository from Microsoft that has the goal of making it easier to debug .NET programs on Linux.
Finally, if you want to see what the ICorDebug
APIs look like in C#, take a look at the wrappers included in CLRMD library, include all the available callbacks (CLRMD will be covered in more depth, later on in this post).
SOS and the DAC
The ‘Data Access Component’ (DAC) is discussed in detail in the BOTR page, but in essence it provides ‘out-of-process’ access to the CLR data structures, so that their internal details can be read from another process. This allows a debugger (via ICorDebug
) or the ‘Son of Strike’ (SOS) extension to reach into a running instance of the CLR or a memory dump and find things like:
- all the running threads
- what objects are on the managed heap
- full information about a method, including the machine code
- the current ‘stack trace’
Quick aside, if you want an explanation of all the strange names and a bit of a ‘.NET History Lesson’ see this Stack Overflow answer.
The full list of SOS Commands is quite impressive and using it along-side WinDBG allows you a very low-level insight into what’s going on in your program and the CLR. To see how it’s implemented, lets take a look at the !HeapStat
command that gives you a summary of the size of different Heaps that the .NET GC is using:
(image from SOS: Upcoming release has a few new commands – HeapStat)
Here’s the code flow, showing how SOS and the DAC work together:
- SOS The full
!HeapStat
command (link)
- SOS The code in the
!HeapStat
command that deals with the ‘Workstation GC’ (link)
- SOS
GCHeapUsageStats(..)
function that does the heavy-lifting (link)
- Shared The
DacpGcHeapDetails
data structure that contains pointers to the main data in the GC heap, such as segments, card tables and individual generations (link)
- DAC
GetGCHeapStaticData
function that fills-out the DacpGcHeapDetails
struct (link)
- Shared the
DacpHeapSegmentData
data structure that contains details for an individual ‘segment’ with the GC Heap (link)
- DAC
GetHeapSegmentData(..)
that fills-out the DacpHeapSegmentData
struct (link)
3rd Party ‘Debuggers’
Because Microsoft published the debugging API it allowed 3rd parties to make use of the use of the ICorDebug
interfaces, here’s a list of some that I’ve come across:
Memory Dumps
The final area we are going to look at is ‘memory dumps’, which can be captured from a live system and analysed off-line. The .NET runtime has always had good support for creating ‘memory dumps’ on Windows and now that .NET Core is ‘cross-platform’, the are also tools available do the same on other OSes.
One of the issues with ‘memory dumps’ is that it can be tricky to get hold of the correct, matching versions of the SOS and DAC files. Fortunately Microsoft have just released the dotnet symbol
CLI tool that:
can download all the files needed for debugging (symbols, modules, SOS and DAC for the coreclr module given) for any given core dump, minidump or any supported platform’s file formats like ELF, MachO, Windows DLLs, PDBs and portable PDBs.
Finally, if you spend any length of time analysing ‘memory dumps’ you really should take a look at the excellent CLR MD library that Microsoft released a few years ago. I’ve previously written about what you can do with it, but in a nutshell, it allows you to interact with memory dumps via an intuitive C# API, with classes that provide access to the ClrHeap, GC Roots, CLR Threads, Stack Frames and much more. In fact, aside from the time needed to implemented the work, CLR MD could implement most (if not all) of the SOS commands.
But how does it work, from the announcement post:
The ClrMD managed library is a wrapper around CLR internal-only debugging APIs. Although those internal-only APIs are very useful for diagnostics, we do not support them as a public, documented release because they are incredibly difficult to use and tightly coupled with other implementation details of the CLR. ClrMD addresses this problem by providing an easy-to-use managed wrapper around these low-level debugging APIs.
By making these APIs available, in an officially supported library, Microsoft have enabled developers to build a wide range of tools on top of CLRMD, which is a great result!
So in summary, the .NET Runtime provides a wide-range of diagnostic, debugging and profiling features that allow a deep-insight into what’s going on inside the CLR.
Discuss this post on HackerNews, /r/programming or /r/csharp
Further Reading
Where appropriate I’ve included additional links that covers the topics discussed in this post.
General
ETW Events and PerfView:
Profiling API:
Debugging:
Memory Dumps:
Tue, 21 Aug 2018, 12:00 am
Presentations and Talks covering '.NET Internals'
I’m constantly surprised at just how popular resources related to ‘.NET Internals’ are, for instance take this tweet and the thread that followed:
If you like learning about '.NET Internals' here's a few talks/presentations I've watched that you might also like. First 'Writing High Performance Code in .NET' by Bart de Smet https://t.co/L5S9BsBlWe
— Matt Warren (@matthewwarren)
July 9, 2018
All I’d done was put together a list of Presentations/Talks (based on the criteria below) and people really seemed to appreciate it!!
Criteria
To keep things focussed, the talks or presentations:
- Must explain some aspect of the ‘internals’ of the .NET Runtime (CLR)
- i.e. something ‘under-the-hood’, the more ‘low-level’ the better!
- e.g. how the GC works, what the JIT does, how assemblies are structured, how to inspect what’s going on, etc
- Be entertaining and worth watching!
- i.e. worth someone giving up 40-50 mins of their time for
- this is hard when you’re talking about low-level details, not all speakers manage it!
- Needs to be a talk that I’ve watched myself and actually learnt something from
- i.e. I don’t just hope it’s good based on the speaker/topic
- Doesn’t have to be unique, fine if it overlaps with another talk
- it often helps having two people cover the same idea, from different perspectives
If you want more general lists of talks and presentations see Awesome talks and Awesome .NET Performance
List of Talks
Here’s the complete list of talks, including a few bonus ones that weren’t in the tweet:
- PerfView: The Ultimate .NET Performance Tool by Sasha Goldshtein
- Writing High Performance Code in .NET by Bart De Smet
- State of the .NET Performance by Adam Sitnik
- Let’s talk about microbenchmarking by Andrey Akinshin
- Safe Systems Programming in C# and .NET (summary) by Joe Duffy
- FlingOS - Using C# for an OS by Ed Nutting
- Maoni Stephens on .NET GC by Maoni Stephens
- What’s new for performance in .NET Core 2.0 by Ben Adams
- Open Source Hacking the CoreCLR by Geoff Norton
- .NET Core & Cross Platform by Matt Ellis
- .NET Core on Unix by Jan Vorlicek
- Multithreading Deep Dive by Gael Fraiteur
- Everything you need to know about .NET memory by Ben Emmett
I also added these 2 categories:
If I’ve missed any out, please let me know in the comments (or on twitter)
PerfView: The Ultimate .NET Performance Tool by Sasha Goldshtein (slides)
In fact, just watch all the talks/presentations that Sasha has done, they’re great!! For example Modern Garbage Collection in Theory and Practice and Making .NET Applications Faster
This talk is a great ‘how-to’ guide for PerfView, what it can do and how to use it (JIT stats, memory allocations, CPU profiling). For more on PerfView see this interview with it’s creator, Vance Morrison: Performance and PerfView.
Writing High Performance Code in .NET by Bart De Smet (he also has a some Pluralsight Courses on the same subject)
Features CLRMD, WinDBG, ETW Events and PerfView, plus some great ‘real world’ performance issues
State of the .NET Performance by Adam Sitnik (slides)
How to write high-perf code that plays nicely with the .NET GC, covering Span, Memory & ValueTask
Let’s talk about microbenchmarking by Andrey Akinshin (slides)
Primarily a look at how to benchmark .NET code, but along the way it demonstrates some of the internal behaviour of the JIT compiler (Andrey is the creator of BenchmarkDotNet)
Safe Systems Programming in C# and .NET (summary) by Joe Duffy (slides and blog)
Joe Duffy (worked on the Midori project) shows why C# is a good ‘System Programming’ language, including what low-level features it provides
FlingOS - Using C# for an OS by Ed Nutting (slides)
Shows what you need to do if you want to write and entire OS in C# (!!) The FlingOS project is worth checking out, it’s a great learning resource.
Maoni Stephens on .NET GC by Maoni Stephens who is the main (only?) .NET GC developer. In addition CLR 4.5 Server Background GC and .NET 4.5 in Practice: Bing are also worth a watch.
An in-depth Q&A on how the .NET GC works, why is does what it does and how to use it efficiently
What’s new for performance in .NET Core 2.0 by Ben Adams (slides)
Whilst it mostly focuses on performance, there is some great internal details on how the JIT generates code for ‘de-virtualisation’, ‘exception handling’ and ‘bounds checking’
Open Source Hacking the CoreCLR by Geoff Norton
Making .NET Core (the CoreCLR) work on OSX was mostly a ‘community contribution’, this talks is a ‘walk-through’ of what it took to make it happen
.NET Core & Cross Platform by Matt Ellis, one of the .NET Runtime Engineers (this one on how made .NET Core ‘Open Source’ is also worth a watch)
Discussion of the early work done to make CoreCLR ‘cross-platform’, including the build setup, ‘Platform Abstraction Layer’ (PAL) and OS differences that had to be accounted for
.NET Core on Unix by Jan Vorlicek a .NET Runtime Engineer (slides)
This talk discusses which parts of the CLR had to be changed to run on Unix, including exception handling, calling conventions, runtime suspension and the PAL
Multithreading Deep Dive by Gael Fraiteur (creator of PostSharp)
Takes a really in-depth look at the CLR memory-model and threading primitives
Everything you need to know about .NET memory by Ben Emmett (slides)
Explains how the .NET GC works using Lego! A very innovative and effective approach!!
Channel 9
The Channel 9 videos recorded by Microsoft deserve their own category, because there’s so much deep, technical information in them. This list is just a selection, including some of my favourites, there are many, many more available!!
Ones to watch
I can’t recommend these yet, because I haven’t watched them myself! (I can’t break my own rules!!).
But they all look really interesting and I will watch them as soon as I get a chance, so I thought they were worth including:
If this post causes you to go off and watch hours and hours of videos, ignoring friends, family and work for the next few weeks, Don’t Blame Me
Thu, 12 Jul 2018, 12:00 am
.NET JIT and CLR - Joined at the Hip
I’ve been digging into .NET Internals for a while now, but never really looked closely at how the ‘Just-in-Time’ (JIT) compiler works. In my mind, the interaction between the .NET Runtime and the JIT has always looked like this:
Nice and straight-forward, the CLR asks the JIT to compile some ‘Intermediate Language’ (IL) code into machine code and the JIT hands back the bytes when it’s done.
However, it turns out the interaction is much more complicated, in reality it looks more like this:
The JIT and the CLR’s ‘Execution Engine’ (EE) or ‘Virtual Machine’ (VM) work closely with one another, they really are ‘joined at the hip’.
The rest of this post will explore the interaction between the 2 components, how they work together and why they need to.
The JIT Compiler
As a quick aside, this post will not be talking about the internals of the JIT compiler itself, if you want to find out more about how that works I recommend reading the fantastic overview in the BOTR and this excellent tutorial, where this very helpful diagram comes from:
After all that, if you still want more, you can take a look at the ‘JIT’ section in the ‘Hitchhikers-Guide-to-the-CoreCLR-Source-Code’.
Components within the CLR
Before we go any further it’s helpful to discuss how the ‘Common Language Runtime’ (CLR) is actually composed. It’s actually made up of several different components including the VM/EE, JIT, GC and others. The treemap below shows the different areas of the source code, grouped by colour into the top-level sections they fall under. You can clearly see that the VM and JIT dominate as well as ‘mscorlib’ which is the only component written in C#.
You can hover over an individual box to get more detailed information and can click on the different radio buttons to toggle the sizing (LOC/Files/Commits)
Total L.O.C
# Files
# Commits
Note: This treemap is from my previous post ‘Hitchhikers-Guide-to-the-CoreCLR-Source-Code’ which was written over a year ago, so the exact numbers will have changed in the meantime.
You can also see these ‘components’ or ‘areas’ reflected in the classification scheme used for the CoreCLR GitHub issues (one difference is that area-CodeGen
is used instead of JIT
).
The CLR and the JIT Compiler
Onto the main subject, just how do the CLR and the JIT compiler work together to transform a method from IL to machine code? As always, the ‘Book of the Runtime’ is a good place to start, from the ‘Execution Environment and External Interface’ section of the RyuJIT Overview:
RyuJIT provides the just in time compilation service for the .NET runtime. The runtime itself is variously called the EE (execution engine), the VM (virtual machine) or simply the CLR (common language runtime). Depending upon the configuration, the EE and JIT may reside in the same or different executable files. RyuJIT implements the JIT side of the JIT/EE interfaces:
ICorJitCompiler
– this is the interface that the JIT compiler implements. This interface is defined in src/inc/corjit.h and its implementation is in src/jit/ee_il_dll.cpp. The following are the key methods on this interface:
compileMethod
is the main entry point for the JIT. The EE passes it a ICorJitInfo
object, and the “info” containing the IL, the method header, and various other useful tidbits. It returns a pointer to the code, its size, and additional GC, EH and (optionally) debug info.
getVersionIdentifier
is the mechanism by which the JIT/EE interface is versioned. There is a single GUID (manually generated) which the JIT and EE must agree on.
getMaxIntrinsicSIMDVectorLength
communicates to the EE the largest SIMD vector length that the JIT can support.
ICorJitInfo
– this is the interface that the EE implements. It has many methods defined on it that allow the JIT to look up metadata tokens, traverse type signatures, compute field and vtable offsets, find method entry points, construct string literals, etc. This bulk of this interface is inherited from ICorDynamicInfo
which is defined in src/inc/corinfo.h. The implementation is defined in src/vm/jitinterface.cpp.
So there are 2 main interfaces, ICorJitCompiler
which is implemented by the JIT compiler and allows the EE to control how a method is compiled. Second there is ICorJitInfo
which the EE implements to allow the JIT to request information it needs during compilation.
Let’s now look at these interfaces in more detail.
Firstly, we’ll examine ICorJitCompiler
, the interface exposed by the JIT. It’s actually pretty straight-forward and only contains 7 methods:
CorJitResult __stdcall compileMethod (..)
void clearCache()
BOOL isCacheCleanupRequired()
void ProcessShutdownWork(ICorStaticInfo* info)
void getVersionIdentifier(..)
unsigned getMaxIntrinsicSIMDVectorLength(..)
void setRealJit(..)
Of these, the most interesting one is compileMethod(..), which has the following signature:
virtual CorJitResult __stdcall compileMethod (
ICorJitInfo *comp, /* IN */
struct CORINFO_METHOD_INFO *info, /* IN */
unsigned /* code:CorJitFlag */ flags, /* IN */
BYTE **nativeEntry, /* OUT */
ULONG *nativeSizeOfCode /* OUT */
) = 0;
The EE provides the JIT with information about the method it wants compiled (CORINFO_METHOD_INFO
) as well as flags (CorJitFlag
) which control the:
- Level of optimisation
- Whether the code is compiled in
Debug
or Release
mode
- If the code needs to be ‘Profilable’ or support ‘Edit-and-Continue’
- Alignment of loops, i.e. should they be aligned on byte-boundaries
- If
SSE3
/SSE4
should be used
- and many other scenarios
The final parameter is a reference to the ICorJitInfo
interface, which is covered in the next section.
The APIs that the EE has to implement to work with the JIT are not simple, there are almost 180 functions or callbacks!!
Interface
Method Count
ICorJitHost
5
ICorJitInfo
19
ICorDynamicInfo
36
ICorStaticInfo
118
Total
178
Note: The links take you to the function ‘definitions’ for a given interface. Alternatively all the methods are listed together in this gist.
ICorJitHost
makes available ‘functionality that would normally be provided by the operating system’, predominantly the ability to allocate the ‘pages’ of memory that the JIT uses during compilation.
ICorJitInfo
(class ICorJitInfo : public ICorDynamicInfo
) contains more specific memory allocation routines, including ones for the ‘GC Info’ data, a ‘method/funclet’s unwind information’, ‘.rdata and .pdata for a method’ and the ‘exception handler blocks’.
ICorDynamicInfo
(class ICorDynamicInfo : public ICorStaticInfo
) provides data that can change from ‘invocation to invocation’, i.e. the JIT cannot cache the results of these method calls. It includes functions that provide:
- Thread Local Storage (TLS) index
- Function Entry Point (address)
- EE ‘helper functions’
- Address of a Field
- Constructor for a
delegate
- and much more
Finally, ICorStaticInfo
, which is further sub-divided up into more specific interfaces:
Interface
Method Count
ICorMethodInfo
28
ICorModuleInfo
9
ICorClassInfo
49
ICorFieldInfo
7
ICorDebugInfo
4
ICorArgInfo
4
ICorErrorInfo
7
Diagnostic methods
6
General methods
2
Misc methods
2
Total
118
Because the interface is nicely composed we can easily see what it provides. The bulk of the functions are concerned with information about a module
, class
, method
or field
. For instance the JIT can query the class size, GC layout and obtain the address of a field within a class. It can also learn about a method’s signature, find it’s parent class and get ‘exception handling’ information (the full list of methods are available in this gist).
These interfaces and the methods they contain give a nice insight into what information the JIT requests from the runtime and therefore what knowledge it requires when compiling a single method.
Now, let’s look at the end-to-end flow of a couple of these methods and see where they are implemented in the CoreCLR source code.
EE ➜ JIT getFunctionEntryPoint(..)
First we’ll look at a method where the EE provides information to the JIT:
JIT ➜ EE reportInliningDecision()
Next we’ll look at a scenario where the data flows from the JIT back to the EE:
Finally, I just want to cover the ‘SuperPMI’ tool that showed up in the previous 2 scenarios. What is this tool and what does it do? From the CoreCLR glossary:
SuperPMI - JIT component test framework (super fast JIT testing - it mocks/replays EE in EE-JIT interface)
So in a nutshell it allows JIT development and testing to be de-coupled from the EE, which is useful because we’ve just seen that the 2 components are tightly integrated.
But how does it work? From the README:
SuperPMI works in two phases: collection and playback. In the collection phase, the system is configured to collect SuperPMI data. Then, run any set of .NET managed programs. When these managed programs invoke the JIT compiler, SuperPMI gathers and captures all information passed between the JIT and its .NET host. In the playback phase, SuperPMI loads the JIT directly, and causes it to compile all the functions that it previously compiled, but using the collected data to provide answers to various questions that the JIT needs to ask. The .NET execution engine (EE) is not invoked at all.
This explains why there is a SuperPMI implementation for every method that is part of the JIT EE interface. SuperPMI needs to ‘record’ or ‘collect’ each interaction with the EE and store the information so that it can be ‘played back’ at a later time, when the EE isn’t present.
Discuss this post on Hacker News or /r/dotnet
Further Reading
As always, if you’ve read this far, here’s some further information that you might find useful:
Thu, 5 Jul 2018, 12:00 am
Tools for Exploring .NET Internals
Whether you want to look at what your code is doing ‘under-the-hood’ or you’re trying to see what the ‘internals’ of the CLR look like, there is a whole range of tools that can help you out.
To give ‘credit where credit is due’, this post is based on a tweet, so thanks to everyone who contributed to the list and if I’ve missed out any tools, please let me know in the comments below.
While you’re here, I’ve also written other posts that look at the ‘internals’ of the .NET Runtime:
Honourable Mentions
Firstly I’ll start by mentioning that Visual Studio has a great debugger and so does VSCode. Also there are lots of very good (commercial) .NET Profilers and Application Monitoring Tools available that you should also take a look at. For example I’ve recently been playing around with Codetrack and I’m very impressed by what it can do!
However, the rest of the post is going to look at some more single-use tools that give a even deeper insight into what is going on. As a added bonus they’re all ‘open-source’, so you can take a look at the code and see how they work!!
PerfView is simply an excellent tool and is the one that I’ve used most over the years. It uses ‘Event Tracing for Windows’ (ETW) Events to provide a deep insight into what the CLR is doing, as well as allowing you to profile Memory and CPU usage. It does have a fairly steep learning curve, but there are some nice tutorials to help you along the way and it’s absolutely worth the time and effort.
Also, if you need more proof of how useful it is, Microsoft Engineers themselves use it and many of the recent performance improvements in MSBuild were carried out after using PerfView to find the bottlenecks.
PerfView is built on-top of the Microsoft.Diagnostics.Tracing.TraceEvent library which you can use in your own tools. In addition, since it’s been open-sourced the community has contributed and it has gained some really nice features, including flame-graphs:
(Click for larger version)
SharpLab started out as a tool for inspecting the IL code emitted by the Roslyn compiler, but has now grown into much more:
SharpLab is a .NET code playground that shows intermediate steps and results of code compilation.
Some language features are thin wrappers on top of other features – e.g. using()
becomes try/catch
.
SharpLab allows you to see the code as compiler sees it, and get a better understanding of .NET languages.
If supports C#, Visual Basic and F#, but most impressive are the ‘Decompilation/Disassembly’ features:
There are currently four targets for decompilation/disassembly:
- C#
- Visual Basic
- IL
- JIT Asm (Native Asm Code)
That’s right, it will output the assembly code that the .NET JIT generates from your C#:
This tool gives you an insight into the memory layout of your .NET objects, i.e. it will show you how the JITter has decided to arrange the fields within your class
or struct
. This can be useful when writing high-performance code and it’s helpful to have a tool that does it for us because doing it manually is tricky:
There is no official documentation about fields layout because the CLR authors reserved the right to change it in the future. But knowledge about the layout can be helpful if you’re curious or if you’re working on a performance critical application.
How can we inspect the layout? We can look at a raw memory in Visual Studio or use !dumpobj
command in SOS Debugging Extension. These approaches are tedious and boring, so we’ll try to write a tool that will print an object layout at runtime.
From the example in the GitHub repo, if you use TypeLayout.Print()
with code like this:
public struct NotAlignedStruct
{
public byte m_byte1;
public int m_int;
public byte m_byte2;
public short m_short;
}
You’ll get the following output, showing exactly how the CLR will layout the struct
in memory, based on it’s padding and optimization rules.
Size: 12. Paddings: 4 (%33 of empty space)
|================================|
| 0: Byte m_byte1 (1 byte) |
|--------------------------------|
| 1-3: padding (3 bytes) |
|--------------------------------|
| 4-7: Int32 m_int (4 bytes) |
|--------------------------------|
| 8: Byte m_byte2 (1 byte) |
|--------------------------------|
| 9: padding (1 byte) |
|--------------------------------|
| 10-11: Int16 m_short (2 bytes) |
|================================|
TUNE is a really intriguing tool, as it says on the GitHub page, it’s purpose is to help you
… learn .NET internals and performance tuning by experiments with C# code.
You can find out more information about what it does in this blog post, but at a high-level it works like this:
- write a sample, valid C# script which contains at least one class with public method taking a single string parameter. It will be executed by hitting Run button. This script can contain as many additional methods and classes as you wish. Just remember that first public method from the first public class will be executed (with single parameter taken from the input box below the script). …
- after clicking Run button, the script will be compiled and executed. Additionally, it will be decompiled both to IL (Intermediate Language) and assembly code in the corresponding tabs.
- all the time Tune is running (including time during script execution) a graph with GC data is being drawn. It shows information about generation sizes and GC occurrences (illustrated as vertical lines with the number below indicating which generation has been triggered).
And looks like this:
(Click for larger version)
Finally, we’re going to look at a particular category of tools. Since .NET came out you’ve always been able to use WinDBG and the SOS Debugging Extension to get deep into the .NET runtime. However it’s not always the easiest tool to get started with and as this tweet says, it’s not always the most productive way to do things:
Besides how complex it is, the idea is to build better abstractions. Raw debugging at the low level is just usually too unproductive. That to me is the promise of ClrMD, that it lets us build specific extensions to extract quickly the right info
— Tomas Restrepo (@tomasrestrepo)
March 14, 2018
Fortunately Microsoft made the ClrMD library available (a.k.a Microsoft.Diagnostics.Runtime), so now anyone can write a tool that analyses memory dumps of .NET programs. You can find out even more info in the official blog post and I also recommend taking a look at ClrMD.Extensions that “.. provide integration with LINPad and to make ClrMD even more easy to use”.
I wanted to pull together a list of all the existing tools, so I enlisted twitter to help. Note to self: careful what you tweet, the WinDBG Product Manager might read your tweets and get a bit upset!!
Well this just hurts my feelings :(
— Andy Luhrs (@aluhrs13)
March 14, 2018
Most of these tools are based on ClrMD because it’s the easiest way to do things, however you can use the underlying COM interfaces directly if you want. Also, it’s worth pointing out that any tool based on ClrMD is not cross-platform, because ClrMD itself is Windows-only. For cross-platform options see Analyzing a .NET Core Core Dump on Linux
Finally, in the interest of balance, there have been lots of recent improvements to WinDBG and because it’s extensible there have been various efforts to add functionality to it:
Having said all that, onto the list:
- SuperDump (GitHub)
- msos (GitHub)
- Command-line environment a-la WinDbg for executing SOS commands without having SOS available.
- MemoScope.Net (GitHub)
- A tool to analyze .Net process memory Can dump an application’s memory in a file and read it later.
- The dump file contains all data (objects) and threads (state, stack, call stack). MemoScope.Net will analyze the data and help you to find memory leaks and deadlocks
- dnSpy (GitHub)
- .NET debugger and assembly editor
- You can use it to edit and debug assemblies even if you don’t have any source code available!!
- MemAnalyzer (GitHub)
- A command line memory analysis tool for managed code.
- Can show which objects use most space on the managed heap just like
!DumpHeap
from Windbg without the need to install and attach a debugger.
- DumpMiner (GitHub)
- UI tool for playing with ClrMD, with more features coming soon
- Trace CLI (GitHub)
- A production debugging and tracing tool
- Shed (GitHub)
- Shed is an application that allow to inspect the .NET runtime of a program in order to extract useful information. It can be used to inspect malicious applications in order to have a first general overview of which information are stored once that the malware is executed. Shed is able to:
- Extract all objects stored in the managed heap
- Print strings stored in memory
- Save the snapshot of the heap in a JSON format for post-processing
- Dump all modules that are loaded in memory
You can also find many other tools that make use of ClrMD, it was a very good move by Microsoft to make it available.
A few other tools that are also worth mentioning:
- DebugDiag
- The DebugDiag tool is designed to assist in troubleshooting issues such as hangs, slow performance, memory leaks or memory fragmentation, and crashes in any user-mode process (now with ‘CLRMD Integration’)
- SOSEX (might not be developed any more)
- … a debugging extension for managed code that begins to alleviate some of my frustrations with SOS
- VMMap from Sysinternals
Discuss this post on Hacker News or /r/programming
Fri, 15 Jun 2018, 12:00 am
CoreRT - A .NET Runtime for AOT
Firstly, what exactly is CoreRT? From its GitHub repo:
.. a .NET Core runtime optimized for AOT (ahead of time compilation) scenarios, with the accompanying .NET native compiler toolchain
The rest of this post will look at what that actually means.
Contents
- Existing .NET ‘AOT’ Implementations
- High-Level Overview
- The Compiler
- The Runtime
- ‘Hello World’ Program
- Limitations
- Further Reading
Existing .NET ‘AOT’ Implementations
However, before we look at what CoreRT is, it’s worth pointing out there are existing .NET ‘Ahead-of-Time’ (AOT) implementations that have been around for a while:
Mono
.NET Native (Windows 10/UWP apps only, a.k.a ‘Project N’)
So if there were existing implementations, why was CoreRT created? The official announcement gives us some idea:
If we want to shortcut this two-step compilation process and deliver a 100% native application on Windows, Mac, and Linux, we need an alternative to the CLR. The project that is aiming to deliver that solution with an ahead-of-time compilation process is called CoreRT.
The main difference is that CoreRT is designed to support .NET Core scenarios, i.e. .NET Standard, cross-platform, etc.
Also worth pointing out is that whilst .NET Native is a separate product, they are related and in fact “.NET Native shares many CoreRT parts”.
High-Level Overview
Because all the code is open source, we can very easily identify the main components and understand where the complexity is. Firstly lets look at where the most ‘lines of code’ are:
We clearly see that the majority of the code is written in C#, with only the Native component written in C++. The largest single component is System.Private.CoreLib which is all C# code, although there are other sub-components that contribute to it (‘System.Private.XXX’), such as System.Private.Interop (36,547 LOC), System.Private.TypeLoader (30,777) and System.Private.Reflection.Core (24,964). Other significant components are the ‘Intermediate Language (IL) Compiler’ and the Common code that is used re-used by everything else.
All these components are discussed in more detail below.
The Compiler
So whilst CoreRT is a run-time, it also needs a compiler to put everything together, from Intro to .NET Native and CoreRT:
.NET Native is a native toolchain that compiles CIL byte code to machine code (e.g. X64 instructions). By default, .NET Native (for .NET Core, as opposed to UWP) uses RyuJIT as an ahead-of-time (AOT) compiler, the same one that CoreCLR uses as a just-in-time (JIT) compiler. It can also be used with other compilers, such as LLILC, UTC for UWP apps and IL to CPP (an IL to textual C++ compiler we have built as a reference prototype).
But what does this actually look like in practice, as they say ‘a picture paints a thousand words’:
(Click for larger version)
To give more detail, the main compilation phases (started from \ILCompiler\src\Program.cs) are the following:
- Calculate the reachable modules/types/classes, i.e. the ‘compilation roots’ using the ILScanner.cs
- Allow for reflection, via an optional rd.xml file and generate the necessary metadata using ILCompiler.MetadataWriter
- Compile the IL using the specific back-end (generic/shared code is in Compilation.cs)
- Finally, write out the compiled methods using ObjectWriter which in turn uses LLVM under-the-hood
But it’s not just your code that ends up in the final .exe, along the way the CoreRT compiler also generates several ‘helper methods’ to cover the following scenarios:
Fortunately the compiler doesn’t blindly include all the code it finds, it is intelligent enough to only include code that’s actually used:
We don’t use ILLinker, but everything gets naturally treeshaken by the compiler itself (we start with compiling Main
/NativeCallable
exports and continue compiling other methods and generating necessary data structures as we go). If there’s a type or method that is not used, the compiler doesn’t even look at it.
The Runtime
All the user/helper code then sits on-top of the CoreRT runtime, from Intro to .NET Native and CoreRT:
CoreRT is the .NET Core runtime that is optimized for AOT scenarios, which .NET Native targets. This is a refactored and layered runtime. The base is a small native execution engine that provides services such as garbage collection(GC). This is the same GC used in CoreCLR. Many other parts of the traditional .NET runtime, such as the type system, are implemented in C#. We’ve always wanted to implement runtime functionality in C#. We now have the infrastructure to do that. In addition, library implementations that were built deep into CoreCLR, have also been cleanly refactored and implemented as C# libraries.
This last point is interesting, why is it advantageous to implement ‘runtime functionality in C#’? Well it turns out that it’s hard to do in an un-managed language because there’s some very subtle and hard-to-track-down ways that you can get it wrong:
Reliability and performance. The C/C++ code has to manually managed. It means that one has to be very careful to report all GC references to the GC. The manually managed code is both very hard to get right and it has performance overhead.
— Jan Kotas (@JanKotas7)
April 24, 2018
These are known as ‘GC Holes’ and the BOTR provides more detail on them. The author of that tweet is significant, Jan Kotas has worked on the .NET runtime for a long time, if he thinks something is hard, it really is!!
Runtime Components
As previously mentioned it’s a layered runtime, i.e made up of several, distinct components, as explained in this comment:
At the core of CoreRT, there’s a runtime that provides basic services for the code to run (think: garbage collection, exception handling, stack walking). This runtime is pretty small and mostly depends on C/C++ runtime (even the C++ runtime dependency is not a hard requirement as Jan pointed out - #3564). This code mostly lives in src/Native/Runtime, src/Native/gc, and src/Runtime.Base. It’s structured so that the places that do require interacting with the underlying platform (allocating native memory, threading, etc.) go through a platform abstraction layer (PAL). We have a PAL for Windows, Linux, and macOS, but others can be added.
And you can see the PAL Components in the following locations:
C# Code shared with CoreCLR
One interesting aspect of the CoreRT runtime is that wherever possible it shares code with the CoreCLR runtime, this is part of a larger effort to ensure that wherever possible code is shared across multiple repositories:
This directory contains the shared sources for System.Private.CoreLib. These are shared between dotnet/corert, dotnet/coreclr and dotnet/corefx.
The sources are synchronized with a mirroring tool that watches for new commits on either side and creates new pull requests (as @dotnet-bot) in the other repository.
Recently there has been a significant amount of work done to moved more and more code over into the ‘shared partition’ to ensure work isn’t duplicated and any fixes are shared across both locations. You can see how this works by looking at the links below:
What this means is that about 2/3 of the C# code in System.Private.CoreLib
is shared with CoreCLR
and only 1/3 is unique to CoreRT
:
Group
C# LOC (Files)
shared
170,106 (759)
src
96,733 (351)
Total
266,839 (1,110)
Native Code
Finally, whilst it is advantageous to write as much code as possible in C#, there are certain components that have to be written in C++, these include the GC (the majority of which is one file, gc.cpp which is almost 37,000 LOC!!), the JIT Interface, ObjWriter (based on LLVM) and most significantly the Core Runtime that contains code for activities like:
- Threading
- Stack Frame handling
- Debugging/Profiling
- Interfacing to the OS
- CPU specific helpers for:
- Exception handling
- GC Write Barriers
- Stubs/Thunks
- Optimised object allocation
‘Hello World’ Program
One of the first things people asked about CoreRT is “what is the size of a ‘Hello World’ app” and the answer is ~3.93 MB (if you compile in Release mode), but there is work being done to reduce this. At a ‘high-level’, the .exe that is produced looks like this:
Note the different colours correspond to the original format of a component, obviously the output is a single, native, executable file.
This file comes with a full .NET specific ‘base runtime’ or ‘class libraries’ (‘System.Private.XXX’) so you get a lot of functionality, it is not the absolute bare-minimum app. Fortunately there is a way to see what a ‘bare-minimum’ runtime would look like by compiling against the Test.CoreLib project included in the CoreRT source. By using this you end up with an .exe that looks like this:
But it’s so minimal that OOTB you can’t even write ‘Hello World’ to the console as there is no System.Console
type! After a bit of hacking I was able to build a version that did have a working Console
output (if you’re interested, this diff is available here). To make it work I had to include the following components:
So Test.CoreLib
really is a minimal runtime!! But the difference in size is dramatic, it shrinks down to 0.49 MB compared to 3.93 MB for the fully-featured runtime!
Type
Standard (bytes)
Test.CoreLib (bytes)
Difference
.data
163,840
36,864
-126,976
.managed
1,540,096
65,536
-1,474,560
.pdata
147,456
20,480
-126,976
.rdata
1,712,128
81,920
-1,630,208
.reloc
98,304
4,096
-94,208
.text
360,448
299,008
-61,440
rdata
98,304
4,096
-94,208
Total (bytes)
4,120,576
512,000
-3,608,576
Total (MB)
3.93
0.49
-3.44
These data sizes were obtained by using the Microsoft DUMPBIN tool and the /DISASM
cmd line switch (zip file of the full ouput), which produces the following summary (note: size values are in HEX):
Summary
28000 .data
178000 .managed
24000 .pdata
1A2000 .rdata
18000 .reloc
58000 .text
18000 rdata
Also contained in the output is the assembly code for a simple Hello World
method:
HelloWorld_HelloWorld_Program__Main:
0000000140004C50: 48 8D 0D 19 94 37 lea rcx,[__Str_Hello_World__E63BA1FD6D43904697343A373ECFB93457121E4B2C51AF97278C431E8EC85545]
00
0000000140004C57: 48 8D 05 DA C5 00 lea rax,[System_Console_System_Console__WriteLine_12]
00
0000000140004C5E: 48 FF E0 jmp rax
0000000140004C61: 90 nop
0000000140004C62: 90 nop
0000000140004C63: 90 nop
and if we dig further we can see the code for System.Console.WriteLine(..)
:
System_Console_System_Console__WriteLine_12:
0000000140011238: 56 push rsi
0000000140011239: 48 83 EC 20 sub rsp,20h
000000014001123D: 48 8B F1 mov rsi,rcx
0000000140011240: E8 33 AD FF FF call System_Console_System_Console__get_Out
0000000140011245: 48 8B C8 mov rcx,rax
0000000140011248: 48 8B D6 mov rdx,rsi
000000014001124B: 48 8B 00 mov rax,qword ptr [rax]
000000014001124E: 48 8B 40 68 mov rax,qword ptr [rax+68h]
0000000140011252: 48 83 C4 20 add rsp,20h
0000000140011256: 5E pop rsi
0000000140011257: 48 FF E0 jmp rax
000000014001125A: 90 nop
000000014001125B: 90 nop
Limitations
Missing Functionality
There have been some people who’ve successfully run complex apps using CoreRT, but, as it stands CoreRT is still an alpha product. At least according to the NuGet package ‘1.0.0-alpha-26529-02’ that the official samples instruct you to use and I’ve not seen any information about when a full 1.0 Release will be available.
So there is some functionality that is not yet implemented, e.g. F# Support, GC.GetMemoryInfo or canGetCookieForPInvokeCalliSig (a calli
to a p/invoke). For more information on this I recommend this entertaining presentation on Building Native Executables from .NET with CoreRT by Mark Rendle. In the 2nd half he chronicles all the issues that he ran into when he was trying to run an ASP.NET app under CoreRT (some of which may well be fixed now).
Reflection
But more fundamentally, because of the nature of AOT compilation, there are 2 main stumbling blocks that you may also run into Reflection and Runtime Code-Generation.
Firstly, if you want to use reflection in your code you need to tell the CoreRT compiler about the types you expect to reflect over, because by-default it only includes the types it knows about. You can do with by using a file called rd.xml
as shown here. Unfortunately this will always require manual intervention for the reasons explained in this issue. More information is available in this comment ‘…some details about CoreRT’s restriction on MakeGenericType and MakeGenericMethod’.
To make reflection work the compiler adds the required metadata to the final .exe using this process:
This would reuse the same scheme we already have for the RyuJIT codegen path:
- The compiler generates a blob of bytes that describes the metadata (namespaces, types, their members, their custom attributes, method parameters, etc.). The data is generated as a byte array in the ComputeMetadata method.
- The metadata gets embedded as a data blob into the executable image. This is achieved by adding the blob to a “ready to run header”. Ready to run header is a well known data structure that can be located by the code in the framework at runtime.
- The ready to run header along with the blobs it refers to is emitted into the final executable.
- At runtime, pointer to the byte array is located using the RhFindBlob API, and a parser is constructed over the array, to be used by the reflection stack.
Runtime Code-Generation
In .NET you often use reflection once (because it can be slow) followed by ‘dynamic’ or ‘runtime’ code-generation with Reflection.Emit(..)
. This technique is widely using in .NET libraries for Serialisation/Deserialisation, Dependency Injection, Object Mapping and ORM.
The issue is that ‘runtime’ code generation is problematic in an ‘AOT’ scenario:
ASP.NET dependency injection introduced dependency on Reflection.Emit in aspnet/DependencyInjection#630 unfortunately. It makes it incompatible with CoreRT.
We can make it functional in CoreRT AOT environment by introducing IL interpretter (#5011), but it would still perform poorly. The dependency injection framework is using Reflection.Emit on performance critical paths.
It would be really up to ASP.NET to provide AOT-friendly flavor that generates all code at build time instead of runtime to make this work well. It would likely help the startup without CoreRT as well.
I’m sure this will be solved one way or the other (see #5011), but at the moment it’s still ‘work-in-progress’.
Discuss this post on HackerNews and /r/dotnet
Further Reading
If you’ve got this far, here’s some other links that you might be interested in:
Thu, 7 Jun 2018, 12:00 am
Taking a look at the ECMA-335 Standard for .NET
It turns out that the .NET Runtime has a technical standard (or specification), known by its full name ECMA-335 - Common Language Infrastructure (CLI) (not to be confused with ECMA-334 which is the ‘C# Language Specification’). The latest update is the 6th edition from June 2012.
The specification or standard was written before .NET Core existed, so only applies to the .NET Framework, I’d be interested to know if there are any plans for an updated version?
The rest of this post will take a look at the standard, exploring the contents and investigating what we can learn from it (hint: lots of low-level details and information about .NET internals)
Why is it useful?
Having a standard means that different implementations, such as Mono and DotNetAnywhere can exist, from Common Language Runtime (CLR):
Compilers and tools are able to produce output that the common language runtime can consume because the type system, the format of metadata, and the runtime environment (the virtual execution system) are all defined by a public standard, the ECMA Common Language Infrastructure specification. For more information, see ECMA C# and Common Language Infrastructure Specifications.
and from the CoreCLR documentation on .NET Standards:
There was a very early realization by the founders of .NET that they were creating a new programming technology that had broad applicability across operating systems and CPU types and that advanced the state of the art of late 1990s (when the .NET project started at Microsoft) programming language implementation techniques. This led to considering and then pursuing standardization as an important pillar of establishing .NET in the industry.
The key addition to the state of the art was support for multiple programming languages with a single language runtime, hence the name Common Language Runtime. There were many other smaller additions, such as value types, a simple exception model and attributes. Generics and language integrated query were later added to that list.
Looking back, standardization was quite effective, leading to .NET having a strong presence on iOS and Android, with the Unity and Xamarin offerings, both of which use the Mono runtime. The same may end up being true for .NET on Linux.
The various .NET standards have been made meaningful by the collaboration of multiple companies and industry experts that have served on the working groups that have defined the standards. In addition (and most importantly), the .NET standards have been implemented by multiple commercial (ex: Unity IL2CPP, .NET Native) and open source (ex: Mono) implementors. The presence of multiple implementations proves the point of standardization.
As the last quote points out, the standard is not produced solely by Microsoft:
There is also a nice Wikipedia page that has some additional information.
What is in it?
At a high-level overview, the specification is divided into the following ‘partitions’ :
- I: Concepts and Architecture
- A great introduction to the CLR itself, explaining many of the key concepts and components, as well as the rationale behind them
- II: Metadata Definition and Semantics
- An explanation of the format of .NET dll/exe files, the different sections within them and how they’re laid out in-memory
- III: CIL Instruction Set
- A complete list of all the Intermediate Language (IL) instructions that the CLR understands, along with a detailed description of what they do and how to use them
- IV: Profiles and Libraries
- Describes the various different ‘Base Class libraries’ that make-up the runtime and how they are grouped into ‘Profiles’
- V: Binary Formats (Debug Interchange Format)
- An overview of ‘Portable CILDB files’, which give a way for additional debugging information to be provided
- VI: Annexes
- Annex A - Introduction
- Annex B - Sample programs
- Annex C - CIL assembler implementation
- Annex D - Class library design guidelines
- Annex E - Portability considerations
- Annex F - Imprecise faults
- Annex G - Parallel library
But, working your way through the entire specification is a mammoth task, generally I find it useful to just search for a particular word or phrase and locate the parts I need that way. However if you do want to read through one section, I recommend ‘Partition I: Concepts and Architecture’, at just over 100 pages it is much easier to fully digest! This section is a very comprehensive overview of the key concepts and components contained within the CLR and well worth a read.
Also, I’m convinced that the authors of the spec wanted to help out any future readers, so to break things up they included lots of very helpful diagrams:
For more examples see:
On top of all that, they also dropped in some Comic Sans 😀, just to make it clear when the text is only ‘informative’:
How has it changed?
The spec has been through 6th editions and it’s interesting to look at the changes over time:
Edition
Release Date
CLR Version
Significant Changes
1st
December 2001
1.0 (February 2002)
N/A
2nd
December 2002
1.1 (April 2003)
3rd
June 2005
2.0 (January 2006)
See below
(link)
4th
June 2006
None, revision of 3rd edition
(link)
5th
December 2010
4.0 (April 2010)
See below
(link)
6th
June 2012
None, revision of 5th edition
(link)
However, only 2 editions contained significant updates, they are explained in more detail below:
- Support for generic types and methods (see ‘How generics were added to .NET’)
- New IL instructions -
ldelem
, stelem
and unbox.any
- Added the
constrained.
, no.
and readonly.
IL instruction prefixes
- Brand new ‘namespaces’ (with corresponding types) -
System.Collections.Generics
, System.Threading.Parallel
- New types added, including
Action
, Nullable
and ThreadStaticAttribute
- Type-forwarding added
- Semantics of ‘variance’ redefined, became a core feature
- Multiple types added or updated, including
System.Action
, System.MulticastDelegate
and System.WeakReference
System.Math
and System.Double
modified to better conform to IEEE
Microsoft Specific Implementation
Another interesting aspect to look at is the Microsoft specific implementation details and notes. The following links are to pdf documents that are modified versions of the 4th edition:
They all contain multiple occurrences of text like this ‘Implementation Specific (Microsoft)’:
Finally, if you want to find out more there’s a book available (affiliate link):
Fri, 6 Apr 2018, 12:00 am
Exploring the internals of the .NET Runtime
I recently appeared on Herding Code and Stackify ‘Developer Things’ podcasts and in both cases, the first question asked was ‘how do you figure out the internals of the .NET runtime’?
This post is an attempt to articulate that process, in the hope that it might be useful to others.
Here are my suggested steps:
- Decide what you want to investigate
- See if someone else has already figured it out (optional)
- Read the ‘Book of the Runtime’
- Build from the source
- Debugging
- Verify against .NET Framework (optional)
Note: As with all these types of lists, just because it worked for me doesn’t mean that it will for everyone. So, ‘your milage may vary’.
Step One - Decide what you want to investigate
For me, this means working out what question I’m trying to answer, for example here are some previous posts I’ve written:
(it just goes to show, you don’t always need fancy titles!)
I put this as ‘Step 1’ because digging into .NET internals isn’t quick or easy work, some of my posts take weeks to research, so I need to have a motivation to keep me going, something to focus on. In addition, the CLR isn’t a small run-time, there’s a lot in there, so just blindly trying to find your way around it isn’t easy! That’s why having a specific focus helps, looking at one feature or section at a time is more manageable.
The very first post where I followed this approach was Strings and the CLR - a Special Relationship. I’d previously spent some time looking at the CoreCLR source and I knew a bit about how Strings
in the CLR worked, but not all the details. During the research of that post I then found more and more areas of the CLR that I didn’t understand and the rest of my blog grew from there (delegates, arrays, fixed keyword, type loader, etc).
Aside: I think this is generally applicable, if you want to start blogging, but you don’t think you have enough ideas to sustain it, I’d recommend that you start somewhere and other ideas will follow.
Another tip is to look at HackerNews or /r/programming for posts about the ‘internals’ of other runtimes, e.g. Java, Ruby, Python, Go etc, then write the equivalent post about the CLR. One of my most popular posts A Hitchhikers Guide to the CoreCLR Source Code was clearly influenced by equivalent articles!
Finally, for more help with learning, ‘figuring things out’ and explaining them to others, I recommend that you read anything by Julia Evans. Start with Blogging principles I use and So you want to be a wizard (also available as a zine), then work your way through all the other posts related to blogging or writing.
I’ve been hugely influenced, for the better, by Julia’s approach to blogging.
I put this in as ‘optional’, because it depends on your motivation. If you are trying to understand .NET internals for your own education, then feel-free to write about whatever you want. If you are trying to do it to also help others, I’d recommend that you first see what’s already been written about the subject. If, once you’ve done that you still think there is something new or different that you can add, then go ahead, but I try not to just re-hash what is already out there.
To see what’s already been written, you can start with Resources for Learning about .NET Internals or peruse the ‘Internals’ tag on this blog. Another really great resource is all the answers by Hans Passant on StackOverflow, he is prolific and amazingly knowledgeable, here’s some examples to get you started:
Step Three - Read the ‘Book of the Runtime’
You won’t get far in investigating .NET internals without coming across the ‘Book of the Runtime’ (BOTR) which is an invaluable resource, even Scott Hanselman agrees!
It was written by the .NET engineering team, for the .NET engineering team, as per this HackerNews comment:
Having worked for 7 years on the .NET runtime team, I can attest that the BOTR is the official reference. It was created as documentation for the engineering team, by the engineering team. And it was (supposed to be) kept up to date any time a new feature was added or changed.
However, just a word of warning, this means that it’s an in-depth, non-trivial document and hard to understand when you are first learning about a particular topic. Several of my blog posts have consisted of the following steps:
- Read the BOTR chapter on ‘Topic X’
- Understand about 5% of what I read
- Go away and learn more (read the source code, read other resources, etc)
- GOTO ‘Step 1’, understanding more this time!
Related to this, the source code itself is often as helpful as the BOTR due to the extensive comments, for example this one describing the rules for prestubs really helped me out. The downside of the source code comments is that they are bit harder to find, whereas the BOTR is all in one place.
Step Four - Build from the source
However, at some point, just reading about the internals of the CLR isn’t enough, you actually need to ‘get your hands’ dirty and see it in action. Now that the Core CLR is open source it’s very easy to build it yourself and then once you’ve done that, there are even more docs to help you out if you are building on different OSes, want to debug, test CoreCLR in conjunction with CoreFX, etc.
But why is building from source useful?
Because it lets you build a Debug/Diagnostic version of the runtime that gives you lots of additional information that isn’t available in the Release/Retails builds. For instance you can view JIT Dumps using COMPlus_JitDump=...
, however this is just one of many COMPlus_XXX
settings you can use, there are 100’s available.
However, even more useful is the ability to turn on diagnostic logging for a particular area of the CLR. For instance, lets imagine that we want to find out more about AppDomains
and how they work under-the-hood, we can use the following logging configuration settings:
SET COMPLUS_LogEnable=1
SET COMPLUS_LogToFile=1
SET COMPLUS_LogFacility=02000000
SET COMPLUS_LogLevel=A
Where LogFacility
is set to LF_APPDOMAIN
, there are many other values you can provide as a HEX bit-mask the full list is available in the source code. If you set these variables and then run an app, you will get a log output like this one. Once you have this log you can very easily search around in the code to find where the messages came from, for instance here are all the places that LF_APPDOMAIN
is logged. This is a great technique to find your way into a section of the CLR that you aren’t familiar with, I’ve used it many times to great effect.
Step Five - Debugging
For me, biggest boon of Microsoft open sourcing .NET is that you can discover so much more about the internals without having to resort to ‘old school’ debugging using WinDBG. But there still comes a time when it’s useful to step through the code line-by-line to see what’s going on. The added advantage of having the source code is that you can build a copy locally and then debug through that using Visual Studio which is slightly easier than WinDBG.
I always leave debugging to last, as it can be time-consuming and I only find it helpful when I already know where to set a breakpoint, i.e. I already know which part of the code I want to step through. I once tried to blindly step through the source of the CLR whilst it was starting up and it was very hard to see what was going on, as I’ve said before the CLR is a complex runtime, there are many things happening, so stepping through lots of code, line-by-line can get tricky.
Step Six - Verify against .NET Framework
I put this final step in because the .NET CLR source available on GitHub is the ‘.NET Core’ version of the runtime, which isn’t the same as the full/desktop .NET Framework that’s been around for years. So you may need to verify the behavior matches, if you want to understand the internals ‘as they were’, not just ‘as they will be’ going forward. For instance .NET Core has removed the ability to create App Domains as a way to provide isolation but interestingly enough the internal class lives on!
To verify the behaviour, your main option is to debug the CLR using WinDBG. Beyond that, you can resort to looking at the ‘Rotor’ source code (roughly the same as .NET Framework 2.0), or petition Microsoft the release the .NET Framework Source Code (probably not going to happen)!
However, low-level internals don’t change all that often, so more often than not the way things behave in the CoreCLR is the same as they’ve always worked.
Resources
Finally, for your viewing pleasure, here are a few talks related to ‘.NET Internals’:
Discuss this post on /r/programming or /r/dotnet
Fri, 23 Mar 2018, 12:00 am
How generics were added to .NET
Discuss this post on HackerNews and /r/programming
Before we dive into the technical details, let’s start with a quick history lesson, courtesy of Don Syme who worked on adding generics to .NET and then went on to design and implement F#, which is a pretty impressive set of achievements!!
Background and History
- 1999 Initial research, design and planning
- 1999 First ‘white paper’ published
- 2001 C# Language Design Specification created
- 2001 Research paper published
- 2004 Work completed and all bugs fixed
Update: Don Syme, pointed out another research paper related to .NET generics, Combining Generics, Precompilation and Sharing Between Software Based Processes (pdf)
To give you an idea of how these events fit into the bigger picture, here are the dates of .NET Framework Releases, up-to 2.0 which was the first version to have generics:
Version number
CLR version
Release date
1.0
1.0
2002-02-13
1.1
1.1
2003-04-24
2.0
2.0
2005-11-07
Aside from the historical perspective, what I find most fascinating is just how much the addition of generics in .NET was due to the work done by Microsoft Research, from .NET/C# Generics History:
It was only through the total dedication of Microsoft Research, Cambridge during 1998-2004, to doing a complete, high quality implementation in both the CLR (including NGEN, debugging, JIT, AppDomains, concurrent loading and many other aspects), and the C# compiler, that the project proceeded.
He then goes on to say:
What would the cost of inaction have been? What would the cost of failure have been? No generics in C# 2.0? No LINQ in C# 3.0? No TPL in C# 4.0? No Async in C# 5.0? No F#? Ultimately, an erasure model of generics would have been adopted, as for Java, since the CLR team would never have pursued a in-the-VM generics design without external help.
Wow, C# and .NET would look very different without all these features!!
The ‘Gyro’ Project - Generics for Rotor
Unfortunately there doesn’t exist a publicly accessible version of the .NET 1.0 and 2.0 source code, so we can’t go back and look at the changes that were made (if I’m wrong, please let me know as I’d love to read it).
However, we do have the next best thing, the ‘Gyro’ project in which the equivalent changes were made to the ‘Shared Source Common Language Implementation’ (SSCLI) code base (a.k.a ‘Rotor’). As an aside, if you want to learn more about the Rotor code base I really recommend the excellent book by Ted Neward, which you can download from his blog.
Gyro 1.0 was released in 2003 which implies that is was created after the work has been done in the real .NET Framework source code, I assume that Microsoft Research wanted to publish the ‘Rotor’ implementation so it could be studied more widely. Gyro is also referenced in one Don Syme’s posts, from Some History: 2001 “GC#” research project draft, from the MSR Cambridge team:
With Dave Berry’s help we later published a version of the corresponding code as the “Gyro” variant of the “Rotor” CLI implementation.
The rest of this post will look at how generics were implemented in the Rotor source code.
Note: There are some significant differences between the Rotor source code and the real .NET framework. Most notably the JIT and GC are completely different implementations (due to licensing issues, listen to DotNetRocks show 360 - Ted Neward and Joel Pobar on Rotor 2.0 for more info). However, the Rotor source does give us an accurate idea about how other core parts of the CLR are implemented, such as the Type-System, Debugger, AppDomains and the VM itself. It’s interesting to compare the Rotor source with the current CoreCLR source and see how much of the source code layout and class names have remained the same.
Implementation
To make things easier for anyone who wants to follow-along, I created a GitHub repo that contains the Rotor code for .NET 1.0 and then checked in the Gyro source code on top, which means that you can see all the changes in one place:
The first thing you notice in the Gyro source is that all the files contain this particular piece of legalese:
; By using this software in any fashion, you are agreeing to be bound by the
; terms of this license.
;
+; This file contains modifications of the base SSCLI software to support generic
+; type definitions and generic methods. These modifications are for research
+; purposes. They do not commit Microsoft to the future support of these or
+; any similar changes to the SSCLI or the .NET product. -- 31st October, 2002.
+;
; You must not remove this notice, or any other, from this software.
It’s funny that they needed to add the line ‘They do not commit Microsoft to the future support of these or any similar changes to the SSCLI or the .NET product’, even though they were just a few months away from doing just that!!
Components (Directories) with the most changes
To see where the work was done, lets start with a high-level view, showing the directories with a significant amount of changes (> 1% of the total changes):
$ git diff --dirstat=lines,1 464bf98 2714cca
0.1% bcl/
14.4% csharp/csharp/sccomp/
9.1% debug/di/
11.9% debug/ee/
2.1% debug/inc/
1.9% debug/shell/
2.5% fjit/
21.1% ilasm/
1.5% ildasm/
1.2% inc/
1.4% md/compiler/
29.9% vm/
Note: fjit
is the “Fast JIT” compiler, i.e the version released with Rotor, which was significantly different to one available in the full .NET framework.
The full output from git diff --dirstat=lines,0
is available here and the output from git diff --stat
is here.
0.1% bcl/
is included only to show that very little C# code changes were needed, these were mostly plumbing code to expose the underlying C++ methods and changes to the various ToString()
methods to include generic type information, e.g. ‘Class[int,double]
’. However there are 2 more significant ones:
bcl/system/reflection/emit/opcodes.cs
(diff)
bcl/system/reflection/emit/signaturehelper.cs
(diff)
- Add the ability to parse method metadata that contains generic related information, such as methods with generic parameters.
Files with the most changes
Next, we’ll take a look at the specific classes/files that had the most changes as this gives us a really good idea about where the complexity was
Added
Deleted
Total Changes
File (click to go directly to the diff)
1794
323
1471
debug/di/module.cpp
1418
337
1081
vm/class.cpp
1335
308
1027
vm/jitinterface.cpp
1616
888
728
debug/ee/debugger.cpp
741
46
695
csharp/csharp/sccomp/symmgr.cpp
693
0
693
vm/genmeth.cpp
999
362
637
csharp/csharp/sccomp/clsdrec.cpp
926
321
605
csharp/csharp/sccomp/fncbind.cpp
559
0
559
vm/typeparse.cpp
605
156
449
vm/siginfo.cpp
417
29
388
vm/method.hpp
642
255
387
fjit/fjit.cpp
379
0
379
vm/jitinterfacegen.cpp
3045
2672
373
ilasm/parseasm.cpp
465
94
371
vm/class.h
515
163
352
debug/inc/cordb.h
339
0
339
vm/generics.cpp
733
418
315
csharp/csharp/sccomp/parser.cpp
471
169
302
debug/shell/dshell.cpp
382
88
294
csharp/csharp/sccomp/import.cpp
Components of the Runtime
Now we’ll look at individual components in more detail so we can get an idea of how different parts of the runtime had to change to accommodate generics.
Type System changes
Not surprisingly the bulk of the changes are in the Virtual Machine (VM) component of the CLR and related to the ‘Type System’. Obviously adding ‘parameterised types’ to a type system that didn’t already have them requires wide-ranging and significant changes, which are shown in the list below:
vm/class.cpp
(diff
)
- Allow the type system to distinguish between open and closed generic types and provide APIs to allow working them, such as
IsGenericVariable()
and GetGenericTypeDefinition()
vm/genmeth.cpp
(diff)
- Contains the bulk of the functionality to make ‘generic methods’ possible, i.e.
MyMethod(T item, U filter)
, including to work done to enable ‘shared instantiation’ of generic methods
vm/typeparse.cpp
(diff)
- Changes needed to allow generic types to be looked-up by name, i.e. ‘
MyClass[System.Int32]
’
vm/siginfo.cpp
(diff)
- Adds the ability to work with ‘generic-related’ method signatures
vm/method.hpp
(diff) and vm/method.cpp
(diff)
- Provides the runtime with generic related methods such as
IsGenericMethodDefinition()
, GetNumGenericMethodArgs()
and GetNumGenericClassArgs()
vm/generics.cpp
(diff)
- All the completely new ‘generics’ specific code is in here, mostly related to ‘shared instantiation’ which is explained below
The main place that the implementation of generics in the CLR differs from the JVM is that they are ‘fully reified’ instead of using ‘type erasure’, this was possible because the CLR designers were willing to break backwards compatibility, whereas the JVM had been around longer so I assume that this was a much less appealing option. For more discussion on this issue see Erasure vs reification and Reified Generics for Java. Update: this HackerNews discussion is also worth a read.
The specific changes made to the .NET Intermediate Language (IL) op-codes can be seen in the inc/opcode.def
(diff), in essence the following 3 instructions were added
In addition the IL Assembler
tool (ILASM) needed significant changes as well as it’s counter part `IL Disassembler (ILDASM) so it could handle the additional instructions.
There is also a whole section titled ‘Support for Polymorphism in IL’ that explains these changes in greater detail in Design and Implementation of Generics for the .NET Common Language Runtime
Shared Instantiations
From Design and Implementation of Generics for the .NET Common Language Runtime
Two instantiations are compatible if for any parameterized class its
compilation at these instantiations gives rise to identical code and
other execution structures (e.g. field layout and GC tables), apart
from the dictionaries described below in Section 4.4. In particular,
all reference types are compatible with each other, because the
loader and JIT compiler make no distinction for the purposes of
field layout or code generation. On the implementation for the Intel
x86, at least, primitive types are mutually incompatible, even
if they have the same size (floats and ints have different parameter
passing conventions). That leaves user-defined struct types, which
are compatible if their layout is the same with respect to garbage
collection i.e. they share the same pattern of traced pointers
From a comment with more info:
// For an generic type instance return the representative within the class of
// all type handles that share code. For example,
// --> ,
// --> ,
// --> ,
// --> ,
// -->
//
// If the code for the type handle is not shared then return
// the type handle itself.
In addition, this comment explains the work that needs to take place to allow shared instantiations when working with generic methods.
Update: If you want more info on the ‘code-sharing’ that takes places, I recommend reading these 4 posts:
Compiler and JIT Changes
If seems like almost every part of the compiler had to change to accommodate generics, which is not surprising given that they touch so many parts of the code we write, Types
, Classes
and Methods
. Some of the biggest changes were:
csharp/csharp/sccomp/clsdrec.cpp
- +999 -363 - (diff)
csharp/csharp/sccomp/emitter.cpp
- +347 -127 - (diff)
csharp/csharp/sccomp/fncbind.cpp
- +926 -321 - (diff)
csharp/csharp/sccomp/import.cpp
- +382 - 88 - (diff)
csharp/csharp/sccomp/parser.cpp
- +733 -418 - (diff)
csharp/csharp/sccomp/symmgr.cpp
- +741 -46 - (diff)
In the ‘just-in-time’ (JIT) compiler extra work was needed because it’s responsible for implementing the additional ‘IL Instructions’. The bulk of these changes took place in fjit.cpp
(diff) and fjitdef.h
(diff).
Finally, a large amount of work was done in vm/jitinterface.cpp
(diff) to enable the JIT to access the extra information it needed to emit code for generic methods.
Debugger Changes
Last, but by no means least, a significant amount of work was done to ensure that the debugger could understand and inspect generics types. It goes to show just how much inside information a debugger needs to have of the type system in an managed language.
debug/ee/debugger.cpp
(diff)
debug/ee/debugger.h
(diff)
debug/di/module.cpp
(diff)
debug/di/rsthread.cpp
(diff)
debug/shell/dshell.cpp
(diff)
Further Reading
If you want even more information about generics in .NET, there are also some very useful design docs available (included in the Gyro source code download):
Also Pre-compilation for .NET Generics by Andrew Kennedy & Don Syme (pdf) is an interesting read
Fri, 2 Mar 2018, 12:00 am
Resources for Learning about .NET Internals
It all started with a tweet, which seemed to resonate with people:
If you like reading my posts on .NET internals, you'll like all these other blogs. So I've put them together in a thread for you!!
— Matt Warren (@matthewwarren)
January 12, 2018
The aim was to list blogs that specifically cover .NET internals at a low-level or to put it another way, blogs that answer the question how does feature ‘X’ work, under-the-hood. The list includes either typical posts for that blog, or just some of my favourites!
Note: for a wider list of .NET and performance related blogs see Awesome .NET Performance by Adam Sitnik
I wouldn’t recommend reading through the entire list, at least not in one go, your brain will probably melt. Picks some posts/topics that interest you and start with those.
Finally, bear in mind that some of the posts are over 10 years old, so there’s a chance that things have changed since then (however, in my experience, the low-levels parts of the CLR are more stable). If you want to double-check the latest behaviour, you’re best option is to read the source!
These blogs are all written by non-Microsoft employees (AFAICT), or if they do work for Microsoft, they don’t work directly on the CLR. If I’ve missed any interesting blogs out, please let me know!
Special mention goes to Sasha Goldshtein, he’s been blogging about this longer than anyone!!
Update: I missed out a few blogs and learnt about some new ones:
Honourable mention goes to .NET Type Internals - From a Microsoft CLR Perspective on CodeProject, it’s a great article!!
Book of the Runtime (BotR)
The BotR deserves it’s own section (thanks to svick to reminding me about it).
If you haven’t heard of the BotR before, there’s a nice FAQ that explains what it is:
The Book of the Runtime is a set of documents that describe components in the CLR and BCL. They are intended to focus more on architecture and invariants and not an annotated description of the codebase.
It was originally created within Microsoft in ~2007, including this document. Developers were responsible to document their feature areas. This helped new devs joining the team and also helped share the product architecture across the team.
To find your way around it, I recommend starting with the table of contents and then diving in.
Note: It’s written for developers working on the CLR, so it’s not an introductory document. I’d recommend reading some of the other blog posts first, then referring to the BotR once you have the basic knowledge. For instance many of my blog posts started with me reading a chapter from the BotR, not fully understanding it, going away and learning some more, writing up what I found and then pointing people to the relevant BotR page for more information.
Microsoft Engineers
The blogs below are written by the actual engineers who worked on, designed or managed various parts of the CLR, so they give a deep insight (again, if I’ve missed any blogs out, please let me know):
- Maoni’s WebLog - CLR Garbage Collector by Maoni Stephens
- cbrumme’s WebLog by Christopher Brumme
- A blog on coding, .NET, .NET Compact Framework and life in general.. by Abhinaba Basu
- Joel Pobar’s CLR weblog - CLR Program Manager: Reflection, LCG, Generics and the type system.. by Joel Pobar
- CLR Profiling API Blog - Info about the Common Language Runtime’s Profiling API by David Broman (slightly niche, but still worth a read)
- Yun Jin’s WebLog CLR internals, Rotor code explanation, CLR debugging tips, trivial debugging notes, .NET programming pitfalls by Yun Jin
- JIT, NGen, and other Managed Code Generation Stuff - Details about RyuJIT stuff of all sort.. by various
- Distributed Matters - Troubleshooting issues in technologies available to developers for building distributed applications by Carlo
- B# .NET Blog - BART DE SMET’S on-line blog (0X2B | ~0X2B, THAT’S THE QUESTION) by Bart De Smet
- Nate McMaster’s blog by Nate McMaster
Books
Finally, if you prefer reading off-line there are some decent books that discuss .NET Internals (Note: all links are Amazon Affiliate links):
All the books listed above I own copies of and I’ve read cover-to-cover, they’re fantastic resources.
I’ve also been recently recommend the 2 books below, they look good and certainly the authors know their stuff, but I haven’t read them yet:
*New Release*
Discuss this post on HackerNews and /r/programming
Mon, 22 Jan 2018, 12:00 am
A look back at 2017
I’ve now been blogging consistently for over 2 years (~2 times per/month) and I decided it was time for my first ‘retrospective’ post.
Warning this post contains a large amount of humble brags, if you’ve come here to read about ‘.NET internals’ you’d better check back in a few weeks, when normal service will be resumed!
Overall Stats
Firstly, lets looks at my Google Analytics stats for 2017, showing Page Views and Sessions:
Which clearly shows that I took a bit of a break during the summer! But I still managed over 800K page views, mostly because I was fortunate enough to end up on the front page of HackerNews a few times!
As a comparison, here’s what ‘2017 v 2016’ looks like:
This is cool because it shows a nice trend, more people read my blog posts in 2017 than in 2016 (but I have no idea if it will continue in 2018?!)
Most Read Posts
Next, here are my top 10 most read posts. Surprising enough my most read post was literally just a list with 68 entries in it!!
Post
Page Views
The 68 things the CLR does before executing a single line of your code
101,382
A Hitchhikers Guide to the CoreCLR Source Code
61,169
A DoS Attack against the C# Compiler
50,884
Analysing C# code on GitHub with BigQuery
40,165
Adding a new Bytecode Instruction to the CLR
39,101
Open Source .NET – 3 years later
36,316
How do .NET delegates work?
36,047
Lowering in the C# Compiler (and what happens when you misuse it)
34,375
How the .NET Runtime loads a Type
32,813
DotNetAnywhere: An Alternative .NET Runtime
26,140
Traffic Sources
I was going to do a write-up on where/how I get my blog traffic, but instead I’d encourage you to read 6 Years of Thoughts on Programming by Henrik Warne as his experience exactly matches mine. But in summary, getting onto the front-page of HackerNews drives a lot of traffic to your site/blog.
Finally, a big thanks to everyone who has read, commented on or shared my blogs posts, it means a lot!!
Sun, 31 Dec 2017, 12:00 am
Open Source .NET – 3 years later
A little over 3 years ago Microsoft announced that they were open sourcing large parts of the .NET framework and as Scott Hanselman said in his Connect 2016 keynote, the community has been contributing in a significant way:
This post forms part of an on-going series, if you want to see how things have changed over time you can check out the previous ones:
In addition, I’ve recently done a talk covering this subject, the slides are below:
Microsoft & open source a 'brave new world' - CORESTART 2.0 from
Matt Warren
Historical Perspective
Now that we are 3 years down the line, it’s interesting to go back and see what the aims were when it all started. If you want to know more about this, I recommend watching the 2 Channel 9 videos below, made by the Microsoft Engineers involved in the process:
It hasn’t always been plain sailing, it’s fair to say that there have been a few bumps along the way (I guess that’s what happens if you get to see “how the sausage gets made”), but I think that we’ve ended up in a good place.
During the past 3 years there have been a few notable events that I think are worth mentioning:
Repository activity over time
But onto the data, first we are going to look at an overview of the level of activity in each repo, by looking at the total number of ‘Issues’ (created) or ‘Pull Requests’ (closed) per month. (yay sparklines FTW!!). If you are interested in how I got the data, see the previous post because the process is the same.
Note: Numbers in black are from the most recent month, with the red dot showing the lowest and the green dot the highest previous value. You can toggle between Issues and Pull Requests by clicking on the buttons, hover over individual sparklines to get a tooltip showing the per/month values and click on the project name to take you to the GitHub page for that repository.
Issues
Pull Requests
This data gives a good indication of how healthy different repos are, are they growing over time, or staying the same. You can also see the different levels of activity each repo has and how they compare to other ones.
Whilst it’s clear that Visual Studio Code is way ahead of all the other repos in terms of ‘Issues’, it’s interesting to see that the .NET-only ones have the most ‘Pull-Requests’, notably CoreFX (Base Class Libraries), Roslyn (Compiler) and CoreCLR (Runtime).
Next will will look at the total participation from the last 3 years, i.e. November 2014 to November 2017. All Pull Requests are Issues are treated equally, so a large PR counts the same as one that fixes a spelling mistake. Whilst this isn’t ideal it’s the simplest way to get an idea of the Microsoft/Community split.
Note: You can hover over the bars to get the actual numbers, rather than percentages.
Issues:
Microsoft
Community
Pull Requests:
Microsoft
Community
Finally we can see the ‘per-month’ data from the last 3 years, i.e. November 2014 to November 2017.
Note: You can inspect different repos by selecting them from the pull-down list, but be aware that the y-axis on the graphs are re-scaled, so the maximum value will change each time.
Issues:
Microsoft
Community
Pull Requests:
Microsoft
Community
Summary
It’s clear that the community continues to be invested in the .NET-related, Open Source repositories, contributing significantly and for a sustained period of time. I think this is good for all .NET developers, whether you contribute to OSS or not, having .NET be a thriving, Open Source product has many benefits!
Discuss this post on Hacker News and /r/programming
Tue, 19 Dec 2017, 12:00 am
A look at the internals of 'Tiered JIT Compilation' in .NET Core
The .NET runtime (CLR) has predominantly used a just-in-time (JIT) compiler to convert your executable into machine code (leaving aside ahead-of-time (AOT) scenarios for the time being), as the official Microsoft docs say:
At execution time, a just-in-time (JIT) compiler translates the MSIL into native code. During this compilation, code must pass a verification process that examines the MSIL and metadata to find out whether the code can be determined to be type safe.
But how does that process actually work?
The same docs give us a bit more info:
JIT compilation takes into account the possibility that some code might never be called during execution. Instead of using time and memory to convert all the MSIL in a PE file to native code, it converts the MSIL as needed during execution and stores the resulting native code in memory so that it is accessible for subsequent calls in the context of that process. The loader creates and attaches a stub to each method in a type when the type is loaded and initialized. When a method is called for the first time, the stub passes control to the JIT compiler, which converts the MSIL for that method into native code and modifies the stub to point directly to the generated native code. Therefore, subsequent calls to the JIT-compiled method go directly to the native code.
Simple really!! However if you want to know more, the rest of this post will explore this process in detail.
In addition, we will look at a new feature that is making its way into the Core CLR, called ‘Tiered Compilation’. This is a big change for the CLR, up till now .NET methods have only been JIT compiled once, on their first usage. Tiered compilation is looking to change that, allowing methods to be re-compiled into a more optimised version much like the Java Hotspot compiler.
How it works
But before we look at future plans, how does the current CLR allow the JIT to transform a method from IL to native code? Well, they say ‘a pictures speaks a thousand words’
Before the method is JITed
After the method has been JITed
The main things to note are:
- The CLR has put in a ‘precode’ and ‘stub’ to divert the initial method call to the
PreStubWorker()
method (which ultimately calls the JIT). These are hand-written assembly code fragments consisting of only a few instructions.
- Once the method had been JITed into ‘native code’, a stable entry point it created. For the rest of the life-time of the method the CLR guarantees that this won’t change, so the rest of the run-time can depend on it remaining stable.
- The ‘temporary entry point’ doesn’t go away, it’s still available because there may be other methods that are expecting to call it. However the associated ‘precode fixup’ has been re-written or ‘back patched’ to point to the newly created ‘native code’ instead of
PreStubWorker()
.
- The CLR doesn’t change the address of the
call
instruction in the method that called the method being JITted, it only changes the address inside the ‘precode’. But because all method calls in the CLR go via a precode, the 2nd time the newly JITed method is called, the call will end up at the ‘native code’.
For reference, the ‘stable entry point’ is the same memory location as the IntPtr
that is returned when you call the RuntimeMethodHandle.GetFunctionPointer() method.
If you want to see this process in action for yourself, you can either re-compile the CoreCLR source and add the relevant debug information as I did or just use WinDbg and follow the steps in this excellent blog post (for more on the same topic see ‘Advanced Call Processing in the CLR’ and Vance Morrison’s excellent write-up ‘Digging into interface calls in the .NET Framework: Stub-based dispatch’).
Finally, the different parts of the Core CLR source code that are involved are listed below:
Note: this post isn’t going to look at how the JIT itself works, if you are interested in that take a look as this excellent overview written by one of the main developers.
JIT and Execution Engine (EE) Interaction
The make all this work the JIT and the EE have to work together, to get an idea of what is involved, take a look at this comment describing the rules that determine which type of precode the JIT can use. All this info is stored in the EE as it’s the only place that has the full knowledge of what a method does, so the JIT has to ask which mode to work in.
In addition, the JIT has to ask the EE what the address of a functions entry point is, this is done via the following methods:
Precode and Stubs
There are different types or ‘precode’ available, ‘FIXUP’, ‘REMOTING’ or ‘STUB’, you can see the rules for which one is used in MethodDesc::GetPrecodeType(). In addition, because they are such a low-level mechanism, they are implemented differently across CPU architectures, from a comment in the code:
There two implementation options for temporary entrypoints:
(1) Compact entrypoints. They provide as dense entrypoints as possible, but can’t be patched to point to the final code. The call to unjitted method is indirect call via slot.
(2) Precodes. The precode will be patched to point to the final code eventually, thus the temporary entrypoint can be embedded in the code.
The call to unjitted method is direct call to direct jump.
We use (1) for x86 and (2) for 64-bit to get the best performance on each platform. For ARM (1) is used.
There’s also a whole lot more information about ‘precode’ available in the BOTR.
Finally, it turns out that you can’t go very far into the internals of the CLR without coming across ‘stubs’ (or ‘trampolines’, ‘thunks’, etc), for instance they’re used in
Tiered Compilation
Before we go any further I want to point out that Tiered Compilation is very much work-in-progress. As an indication, to get it working you currently have to set an environment variable called COMPLUS_EXPERIMENTAL_TieredCompilation
. It appears that the current work is focussed on the infrastructure to make it possible (i.e. CLR changes), then I assume that there has to be a fair amount of testing and performance analysis before it’s enabled by default.
If you want to learn about the goals of the feature and how it fits into the wider process of ‘code versioning’, I recommend reading the excellent design docs, including the future roadmap possibilities.
To give an indications of what has been involved so far, there has been work going on in the:
If you want to follow along you can take a look at the related issues/PRs, here are the main ones to get you started:
There is also some nice background information available in Introduce a tiered JIT and if you want to understand how it will eventually makes use of changes in the JIT (‘MinOpts’), take a look at Low Tier Back-Off and JIT: enable aggressive inline policy for Tier1.
History - ReJIT
As an quick historical aside, you have previously been able to get the CLR to re-JIT a method for you, but it only worked with the Profiling APIs, which meant you had to write some C/C++ COM code to make it happen! In addition ReJIT only allowed the method to be re-compiled at the same level, so it wouldn’t ever produce more optimised code. It was mostly meant to help monitoring or profiling tools.
How it works
Finally, how does it work, again lets look at some diagrams. Firstly, as a recap, lets take a look at how things ends up once a method had been JITed, with tiered compilation turned off (the same diagram as above):
Now, as a comparison, here’s what the same stage looks like with tiered compilation enabled:
The main difference is that tiered compilation has forced the method call to go through another level of indirection, the ‘pre stub’. This is to make it possible to count the number of times the method is called, then once it has hit the threshold (currently 30), the ‘pre stub’ is re-written to point to the ‘optimised native code’ instead:
Note that the original ‘native code’ is still available, so if needed the changes can be reverted and the method call can go back to the unoptimised version.
Using a counter
We can see a bit more details about the counter in this comments from prestub.cpp:
/*************************** CALL COUNTER ***********************/
// If we are counting calls for tiered compilation, leave the prestub
// in place so that we can continue intercepting method invocations.
// When the TieredCompilationManager has received enough call notifications
// for this method only then do we back-patch it.
BOOL fCanBackpatchPrestub = TRUE;
#ifdef FEATURE_TIERED_COMPILATION
BOOL fEligibleForTieredCompilation = IsEligibleForTieredCompilation();
if (fEligibleForTieredCompilation)
{
CallCounter * pCallCounter = GetCallCounter();
fCanBackpatchPrestub = pCallCounter->OnMethodCalled(this);
}
#endif
In essence the ‘stub’ calls back into the TieredCompilationManager until the ‘tiered compilation’ is triggered, once that happens the ‘stub’ is ‘back-patched’ to stop it being called any more.
Why not ‘Interpreted’?
If you’re wondering why tiered compilation doesn’t have an interpreted mode, you’re not alone, I asked the same question (for more info see my previous post on the .NET Interpreter)
And the answer I got was:
There’s already an Interpreter available, or is it not considered suitable for production code?
Its a fine question, but you guessed correctly - the interpreter is not in good enough shape to run production code as-is. There are also some significant issues if you want debugging and profiling tools to work (which we do). Given enough time and effort it is all solvable, it just isn’t the easiest place to start.
How different is the overhead between non-optimised and optimised JITting?
On my machine non-optimized jitting used about ~65% of the time that optimized jitting took for similar IL input sizes, but of course I expect results will vary by workload and hardware. Getting this first step checked in should make it easier to collect better measurements.
But that’s from a few months ago, maybe Mono’s New .NET Interpreter will change things, who knows?
Why not LLVM?
Finally, why aren’t they using a LLVM to compile the code, from Introduce a tiered JIT (comment)
There were (and likely still are) significant differences in the LLVM support needed for the CLR versus what is needed for Java, both in GC and in EH, and in the restrictions one must place on the optimizer. To cite just one example: the CLRs GC currently cannot tolerate managed pointers that point off the end of objects. Java handles this via a base/derived paired reporting mechanism. We’d either need to plumb support for this kind of paired reporting into the CLR or restrict LLVM’s optimizer passes to never create these kinds of pointers. On top of that, the LLILC jit was slow and we weren’t sure ultimately what kind of code quality it might produce.
So, figuring out how LLILC might fit into a potential multi-tier approach that did not yet exist seemed (and still seems) premature. The idea for
now is to get tiering into the framework and use RyuJit for the second-tier jit. As we learn more, we may discover there is indeed room for higher tier jits, or, at least, understand better what else we need to do before such things make sense.
There is more background info in Introduce a tiered JIT
Summary
One of my favourite side-effects of Microsoft making .NET Open Source and developing out in the open is that we can follow along with work-in-progress features. It’s great being able to download the latest code, try them out and see how they work under-the-hood, yay for OSS!!
Discuss this post on Hacker News
Fri, 15 Dec 2017, 12:00 am
Exploring the BBC micro:bit Software Stack
If you grew up in the UK and went to school during the 1980’s or 1990’s there’s a good chance that this picture brings back fond memories:
(image courtesy of Classic Acorn)
I’d imagine that for a large amount of computer programmers (currently in their 30’s) the BBC Micro was their first experience of programming. If this applies to you and you want a trip down memory lane, have a read of Remembering: The BBC Micro and The BBC Micro in my education.
Programming the classic Turtle was done in Logo, with code like this:
FORWARD 100
LEFT 90
FORWARD 100
LEFT 90
FORWARD 100
LEFT 90
FORWARD 100
LEFT 90
Of course, once you knew what you were doing, you would re-write it like so:
REPEAT 4 [FORWARD 100 LEFT 90]
BBC micro:bit
The original Micro was launched as an education tool, as part of the BBC’s Computer Literacy Project and by most accounts was a big success. As a follow-up, in March 2016 the micro:bit was launched as part of the BBC’s ‘Make it Digital’ initiative and 1 million devices were given out to schools and libraries in the UK to ‘help develop a new generation of digital pioneers’ (i.e. get them into programming!)
Aside: I love the difference in branding across 30 years, ‘BBC Micro’ became ‘BBC micro:bit’ (you must include the colon) and ‘Computer Literacy Project’ changed to the ‘Make it Digital Initiative’.
A few weeks ago I walked into my local library, picked up a nice starter kit and then spent a fun few hours watching my son play around with it (I’m worried about how quickly he picked up the basics of programming, I think I might be out of a job in a few years time!!)
However once he’d gone to bed it was all mine! The result of my ‘playing around’ is this post, in it I will be exploring the software stack that makes up the micro:bit, what’s in it, what it does and how it all fits together.
If you want to learn about how to program the micro:bit, its hardware or anything else, take a look at this excellent list of resources.
Slightly off-topic, but if you enjoy reading source code you might like these other posts:
BBC micro:bit Software Stack
If we take a high-level view at the stack, it divides up into 3 discrete software components that all sit on top of the hardware itself:
If you would like to build this stack for yourself take a look at the Building with Yotta guide. I also found this post describing The First Video Game on the BBC micro:bit [probably] very helpful.
Runtimes
There are several high-level runtimes available, these are useful because they let you write code in a language other than C/C++ or even create programs by dragging blocks around on a screen. The main ones that I’ve come across are below (see ‘Programming’ for a full list):
They both work in a similar way, the users code (Python or TypeScript) is bundled up along with the C/C++ code of the runtime itself and then the entire binary (hex) file is deployed to the micro:bit. When the device starts up, the runtime then looks for the users code at a known location in memory and starts interpreting it.
Update It turns out that I was wrong about the Microsoft PXT, it actually compiles your TypeScript program to native code, very cool! Interestingly, they did it that way because:
Compared to a typical dynamic JavaScript engine, PXT compiles code statically, giving rise to significant time and space performance improvements:
- user programs are compiled directly to machine code, and are never in any byte-code form that needs to be interpreted; this results in much faster execution than a typical JS interpreter
- there is no RAM overhead for user-code - all code sits in flash; in a dynamic VM there are usually some data-structures representing code
- due to lack of boxing for small integers and static class layout the memory consumption for objects is around half the one you get in a dynamic VM (not counting the user-code structures mentioned above)
- while there is some runtime support code in PXT, it’s typically around 100KB smaller than a dynamic VM, bringing down flash consumption and leaving more space for user code
The execution time, RAM and flash consumption of PXT code is as a rule of thumb 2x of compiled C code, making it competitive to write drivers and other user-space libraries.
Memory Layout
Just before we go onto the other parts of the software stack I want to take a deeper look at the memory layout. This is important because memory is so constrained on the micro:bit, there is only 16KB of RAM. To put that into perspective, we’ll use the calculation from this StackOverflow question How many bytes of memory is a tweet?
Twitter uses UTF-8 encoded messages. UTF-8 code points can be up to six four octets long, making the maximum message size 140 x 4 = 560 8-bit bytes.
If we re-calculate for the newer, longer tweets 280 x 4 = 1,120 bytes. So we could only fit 10 tweets into the available RAM on the micro:bit (it turns out that only ~11K out of the total 16K is available for general use). Which is why it’s worth using a custom version of atoi() to save 350 bytes of RAM!
The memory layout is specified by the linker at compile-time using NRF51822.ld, there is a sample output available if you want to take a look. Because it’s done at compile-time you run into build errors such as “region RAM overflowed with stack” if you configure it incorrectly.
The table below shows the memory layout from the ‘no SD’ version of a ‘Hello World’ app, i.e. with the maximum amount of RAM available as the Bluetooth (BLE) Soft-Device (SD) support has been removed. By comparison with BLE enabled, you instantly have 8K less RAM available, so things start to get tight!
Name
Start Address
End Address
Size
Percentage
.data
0x20000000
0x20000098
152 bytes
0.93%
.bss
0x20000098
0x20000338
672 bytes
4.10%
Heap (mbed)
0x20000338
0x20000b38
2,048 bytes
12.50%
Empty
0x20000b38
0x20003800
11,464 bytes
69.97%
Stack
0x20003800
0x20004000
2,048 bytes
12.50%
For more info on the column names see the Wikipedia pages for .data and .bss as well as text, data and bss: Code and Data Size Explained
As a comparison there is a nice image of the micro:bit RAM Layout in this article. It shows what things look like when running MicroPython and you can clearly see the main Python heap in the centre taking up all the remaining space.
Sitting in the stack below the high-level runtime is the device abstraction layer (DAL), created at Lancaster University in the UK, it’s made up of 4 main components:
- core
- High-level components, such as
Device
, Font
, HeapAllocator
, Listener
and Fiber
, often implemented on-top of 1 or more driver
classes
- types
- Helper types such as
ManagedString
, Image
, Event
and PacketBuffer
- drivers
- For control of a specific hardware component, such as
Accelerometer
, Button
, Compass
, Display
, Flash
, IO
, Serial
and Pin
- bluetooth
- asm
- Just 4 functions are implemented in assembly, they are
swap_context
, save_context
, save_register_context
and restore_register_context
. As the names suggest, they handle the ‘context switching’ necessary to make the MicroBit Fiber scheduler work
The image below shows the distribution of ‘Lines of Code’ (LOC), as you can see the majority of the code is in the drivers
and bluetooth
components.
In addition to providing nice helper classes for working with the underlying devices, the DAL provides the Fiber
abstraction to allows asynchronous functions to work. This is useful because you can asynchronously display text on the LED display and your code won’t block whilst it’s scrolling across the screen. In addition the Fiber
class is used to handle the interrupts that signal when the buttons on the micro:bit are pushed. This comment from the code clearly lays out what the Fiber scheduler does:
This lightweight, non-preemptive scheduler provides a simple threading mechanism for two main purposes:
1) To provide a clean abstraction for application languages to use when building async behaviour (callbacks).
2) To provide ISR decoupling for EventModel events generated in an ISR context.
Finally the high-level classes MicroBit.cpp and MicroBit.h are housed in the microbit repository. These classes define the API of the MicroBit runtime and setup the default configuration, as shown in the Constructor
of MicroBit.cpp
:
/**
* Constructor.
*
* Create a representation of a MicroBit device, which includes member variables
* that represent various device drivers used to control aspects of the micro:bit.
*/
MicroBit::MicroBit() :
serial(USBTX, USBRX),
resetButton(MICROBIT_PIN_BUTTON_RESET),
storage(),
i2c(I2C_SDA0, I2C_SCL0),
messageBus(),
display(),
buttonA(MICROBIT_PIN_BUTTON_A, MICROBIT_ID_BUTTON_A),
buttonB(MICROBIT_PIN_BUTTON_B, MICROBIT_ID_BUTTON_B),
buttonAB(MICROBIT_ID_BUTTON_A,MICROBIT_ID_BUTTON_B, MICROBIT_ID_BUTTON_AB),
accelerometer(i2c),
compass(i2c, accelerometer, storage),
compassCalibrator(compass, accelerometer, display),
thermometer(storage),
io(MICROBIT_ID_IO_P0,MICROBIT_ID_IO_P1,MICROBIT_ID_IO_P2,
MICROBIT_ID_IO_P3,MICROBIT_ID_IO_P4,MICROBIT_ID_IO_P5,
MICROBIT_ID_IO_P6,MICROBIT_ID_IO_P7,MICROBIT_ID_IO_P8,
MICROBIT_ID_IO_P9,MICROBIT_ID_IO_P10,MICROBIT_ID_IO_P11,
MICROBIT_ID_IO_P12,MICROBIT_ID_IO_P13,MICROBIT_ID_IO_P14,
MICROBIT_ID_IO_P15,MICROBIT_ID_IO_P16,MICROBIT_ID_IO_P19,
MICROBIT_ID_IO_P20),
bleManager(storage),
radio(),
ble(NULL)
{
...
}
The software at the bottom of the stack is making use of the ARM mbed OS which is:
.. an open-source embedded operating system designed for the “things” in the Internet of Things (IoT). mbed OS includes the features you need to develop a connected product using an ARM Cortex-M microcontroller.
mbed OS provides a platform that includes:
- Security foundations.
- Cloud management services.
- Drivers for sensors, I/O devices and connectivity.
mbed OS is modular, configurable software that you can customize it to your device and to reduce memory requirements by excluding unused software.
We can see this from the layout of it’s source, it’s based around common
components, which can be combined with a hal
(Hardware Abstraction Layers) and a target
specific to the hardware you are running on.
More specifically the micro:bit uses the yotta target bbc-microbit-classic-gcc
, but it can also use others targets as needed.
For reference, here are the files from the common
section of mbed
that are used by the micro:bit-dal
:
And here are the hardware specific files, targeting the NORDIC - MCU NRF51822
:
End-to-end (or top-to-bottom)
Finally, lets look a few examples of how the different components within the stack are used in specific scenarios
Writing to the Display
Storing files on the Flash memory
- microbit-dal
- mbed-classic
- Allows low-level control of the hardware, such as writing to the flash itself either directly or via the SoftDevice (SD) layer
In addition, this comment from MicroBitStorage.h gives a nice overview of how the file system is implemented on-top of the raw flash storage:
* The first 8 bytes are reserved for the KeyValueStore struct which gives core
* information such as the number of KeyValuePairs in the store, and whether the
* store has been initialised.
*
* After the KeyValueStore struct, KeyValuePairs are arranged contiguously until
* the end of the block used as persistent storage.
*
* |-------8-------|--------48-------|-----|---------48--------|
* | KeyValueStore | KeyValuePair[0] | ... | KeyValuePair[N-1] |
* |---------------|-----------------|-----|-------------------|
Summary
All-in-all the micro:bit is a very nice piece of kit and hopefully will achieve its goal ‘to help develop a new generation of digital pioneers’. However, it also has a really nice software stack, one that is easy to understand and find your way around.
Further Reading
I’ve got nothing to add that isn’t already included in this excellent, comprehensive list of resources, thanks Carlos for putting it together!!
Discuss this post on Hacker News or /r/microbit
Tue, 28 Nov 2017, 12:00 am
Microsoft & Open Source a 'Brave New World' - CORESTART 2.0
Recently I was fortunate enough to be invited to the CORESTART 2.0 conference to give a talk on Microsoft & Open Source a ‘Brave New World’. It was a great conference, well organised by Tomáš Herceg and the teams from .NET College and Riganti and I had a great time.
I encourage you to attend next years ‘Update’ conference if you can and as bonus you’ll get to see the sights of Prague! Including the Head of Franz Kafka as well as the amazing buildings, castles and bridges that all the guide-books will tell you about!
I’ve not been ‘invited’ to speak at a conference before, so I wasn’t sure what to expect, but there was a great audience and they seemed happy to learn about the Open Source projects that Microsoft are running and what is being done to encourage us (the ‘Community’) to contribute.
The slides for my talk are embedded below and you can also ‘watch’ the entire recording (audio and slides only, no video).
Microsoft & open source a 'brave new world' - CORESTART 2.0 from
Matt Warren
Talk Outline
But if you don’t fancy sitting through the whole thing, you can read the summary below and jump straight to the relevant parts
Before
[jump to slide] [direct video link]
During
[jump to slide] [direct video link]
After
[jump to slide] [direct video link]
What Now?
[jump to slide] [direct video link]
Domino Chain Reaction
Finally, if you’re wondering what the section on ‘Domino Chain Reaction’ is all about, you’ll have to listen to that part of the talk, but the video itself is embedded below:
(Based on actual research, see The Curious Mathematics of Domino Chain Reactions)
Tue, 14 Nov 2017, 12:00 am
A DoS Attack against the C# Compiler
Generics in C# are certainly very useful and I find it amazing that we almost didn’t get them:
What would the cost of inaction have been? What would the cost of failure have been? No generics in C# 2.0? No LINQ in C# 3.0? No TPL in C# 4.0? No Async in C# 5.0? No F#? Ultimately, an erasure model of generics would have been adopted, as for Java, since the CLR team would never have pursued a in-the-VM generics design without external help.
So a big thanks is due to Don Syme and the rest of the team at Microsoft Research in Cambridge!
But as well as being useful, I also find some usages of generics mind-bending, for instance I’m still not sure what this code actually means or how to explain it in words:
class Blah where T : Blah
As always, reading an Eric Lippert post helps a lot, but even he recommends against using this specific ‘circular’ pattern.
Recently I spoke at the CORESTART 2.0 conference in Prague, giving a talk on ‘Microsoft and Open-Source – A ‘Brave New World’. Whilst I was there I met the very knowledgeable Jiri Cincura, who blogs at tabs ↹ over ␣ ␣ ␣ spaces. He was giving a great talk on ‘C# 7.1 and 7.2 features’, but also shared with me an excellent code snippet that he called ‘Crazy Class’:
class Class
{
class Inner : Class
{
Inner.Inner.Inner.Inner.Inner.Inner.Inner.Inner.Inner inner;
}
}
He said:
this is the class that takes crazy amount of time to compile. You can add more Inner.Inner.Inner...
to make it even longer (and also generic parameters).
After a big of digging around I found that someone else had noticed this, see the StackOverflow question Why does field declaration with duplicated nested type in generic class results in huge source code increase? Helpfully the ‘accepted answer’ explains what is going on:
When you combine these two, the way you have done, something interesting happens. The type Outer.Inner
is not the same type as Outer.Inner.Inner
. Outer.Inner
is a subclass of Outer
while Outer.Inner.Inner
is a subclass of Outer
, which we established before as being different from Outer.Inner
. So Outer.Inner.Inner
and Outer.Inner
are referring to different types.
When generating IL, the compiler always uses fully qualified names for types. You have cleverly found a way to refer to types with names whose lengths that grow at exponential rates. That is why as you increase the generic arity of Outer
or add additional levels .Y
to the field field
in Inner
the output IL size and compile time grow so quickly.
Clear? Good!!
You probably have to be Jon Skeet, Eric Lippert or a member of the C# Language Design Team (yay, ‘Matt Warren’) to really understand what’s going on here, but that doesn’t stop the rest of us having fun with the code!!
I can’t think of any reason why you’d actually want to write code like this, so please don’t!! (or at least if you do, don’t blame me!!)
For a simple idea of what’s actually happening, lets take this code (with only 2 ‘Levels’):
class Class
{
class Inner : Class
{
Inner.Inner inner;
}
}
The ‘decompiled’ version actually looks like this:
internal class Class
{
private class Inner : Class
{
private Class.Inner inner;
}
}
Wow, no wonder things go wrong quickly!!
Exponential Growth
Firstly let’s check the claim of exponential growth, if you don’t remember your Big O notation you can also think of this as O(very, very bad)
!!
To test this out, I’m going to compile the code above, but vary the ‘level’ each time by adding a new .Inner
, so ‘Level 5’ looks like this:
Inner.Inner.Inner.Inner.Inner inner;
‘Level 6’ like this, and so on
Inner.Inner.Inner.Inner.Inner.Inner inner;
We then get the following results:
Level
Compile Time (secs)
Working set (KB)
Binary Size (Bytes)
5
1.15
54,288
135,680
6
1.22
59,500
788,992
7
2.00
70,728
4,707,840
8
6.43
121,852
28,222,464
9
33.23
405,472
169,310,208
10
202.10
2,141,272
CRASH
If we look at these results in graphical form, it’s very obvious what’s going on
(the dotted lines are a ‘best fit’ trend-line and they are exponential)
If I compile the code with dotnet build
(version 2.0.0), things go really wrong at ‘Level 10’ and the compiler throws an error (full stack trace):
System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values.
Which looks similar to Internal compiler error when creating Portable PDB files #3866.
However your mileage may vary, when I ran the code in Visual Studio 2015 it threw an OutOfMemoryException
instead and then promptly restarted itself!! I assume this is because VS is a 32-bit application and it runs out of memory before it can go really wrong!
Mono Compiler
As a comparison, here are the results from the Mono compiler, thanks to Egor Bogatov for putting them together.
Level
Compile Time (secs)
Memory Usage (Bytes)
5
0.480
134,144
6
0.502
786,944
7
0.745
4,706,304
8
2.053
28,220,928
9
10.134
169,308,672
10
57.307
1,015,835,136
At ‘Level 10’ it produced a 968.78 Mb binary!!
Profiling the Compiler
Finally, I want to look at just where the compiler is spending all it’s time. From the results above we saw that it was taking over 3 minutes to compile a simple program, with a peak memory usage of 2.14 GB, so what was it actually doing??
Well clearly there’s lots of Types
involved and the Compiler seems happy for you to write this code, so I guess it needs to figure it all out. Once it’s done that, it then needs to write all this Type
metadata out to a .dll or .exe, which can be 100’s of MB in size.
At a high-level the profiling summary produce by VS looks like this (click for full-size image):
However if we take a bit of a close look, we can see the ‘hot-path’ is inside the SerializeTypeReference(..)
method in Compilers/Core/Portable/PEWriter/MetadataWriter.cs
Summary
I’m a bit torn about this, it is clearly an ‘abuse’ of generics!!
In some ways I think that it shouldn’t be fixed, it seems better that the compiler encourages you to not write code like this, rather than making is possible!!
So if it takes 3 mins to compile your code, allocates 2GB of memory and then crashes, take that as a warning!!
Discuss this post on Hacker News, /r/programming and /r/csharp
The post A DoS Attack against the C# Compiler first appeared on my blog Performance is a Feature!
CodeProject
Wed, 8 Nov 2017, 12:00 am
DotNetAnywhere: An Alternative .NET Runtime
Recently I was listening to the excellent DotNetRocks podcast and they had Steven Sanderson (of Knockout.js fame) talking about ‘WebAssembly and Blazor’.
In case you haven’t heard about it, Blazor is an attempt to bring .NET to the browser, using the magic of WebAssembly. If you want more info, Scott Hanselmen has done a nice write-up of the various .NET/WebAssembly projects.
However, as much as the mention of WebAssembly was pretty cool, what interested me even more how Blazor was using DotNetAnywhere as the underlying .NET runtime. This post will look at what DotNetAnywhere is, what you can do with it and how it compares to the full .NET framework.
DotNetAnywhere
Firstly it’s worth pointing out that DotNetAnywhere (DNA) is designed to be a fully compliant .NET runtime, which means that it can run .NET dlls/exes that have been compiled to run against the full framework. On top of that (at least in theory) it supports all the following .NET runtime features, which is a pretty impressive list!
- Generics
- Garbage collection and finalization
- Weak references
- Full exception handling - try/catch/finally
- PInvoke
- Interfaces
- Delegates
- Events
- Nullable types
- Single-dimensional arrays
- Multi-threading
In addition there is some partial support for Reflection
- Very limited read-only reflection
- typeof(), .GetType(), Type.Name, Type.Namespace, Type.IsEnum(), .ToString() only
Finally, there are a few features that are currently unsupported:
- Attributes
- Most reflection
- Multi-dimensional arrays
- Unsafe code
There are various bugs or missing functionality that might prevent your code running under DotNetAnywhere, however several of these have been fixed since Blazor came along, so it’s worth checking against the Blazor version of DotNetAnywhere.
At this point in time the original DotNetAnywhere repo is no longer active (the last sustained activity was in Jan 2012), so it seems that any future development or bugs fixes will likely happen in the Blazor repo. If you have ever fixed something in DotNetAnywhere, consider sending a P.R there, to help the effort.
Update: In addition there are other forks with various bug fixes and enhancements:
Source Code Layout
What I find most impressive about the DotNetAnywhere runtime is that it was developed by one person and is less that 40,000 lines of code!! For a comparison the .NET framework Garbage Collector is almost 37,000 lines on it’s own (more info available in my previous post A Hitchhikers Guide to the CoreCLR Source Code).
This makes DotNetAnywhere an ideal learning resource!
Firstly, lets take a look at the Top-10 largest source files, to see where the complexity is:
Native Code - 17,710 lines in total
LOC
File
3,164
JIT_Execute.c
1,778
JIT.c
1,109
PInvoke_CaseCode.h
630
Heap.c
618
MetaData.c
563
MetaDataTables.h
517
Type.c
491
MetaData_Fill.c
467
MetaData_Search.c
452
JIT_OpCodes.h
Managed Code - 28,783 lines in total
LOC
File
2393
corlib/System.Globalization/CalendricalCalculations.cs
2314
corlib/System/NumberFormatter.cs
1582
System.Drawing/System.Drawing/Pens.cs
1443
System.Drawing/System.Drawing/Brushes.cs
1405
System.Core/System.Linq/Enumerable.cs
745
corlib/System/DateTime.cs
693
corlib/System.IO/Path.cs
632
corlib/System.Collections.Generic/Dictionary.cs
598
corlib/System/String.cs
467
corlib/System.Text/StringBuilder.cs
Main areas of functionality
Next, lets look at the key components in DotNetAnywhere as this gives us a really good idea about what you need to implement a .NET compatible runtime. Along the way, we will also see how they differ from the implementation found in Microsoft’s .NET Framework.
Reading .NET dlls
The first thing DotNetAnywhere has to do is read/understand/parse the .NET Metadata and Code that’s contained in a .dll/.exe. This all takes place in MetaData.c, primarily within the LoadSingleTable(..) function. By adding some debugging code, I was able to get a summary of all the different types of Metadata that are read in from a typical .NET dll, it’s quite an interesting list:
MetaData contains 1 Assemblies (MD_TABLE_ASSEMBLY)
MetaData contains 1 Assembly References (MD_TABLE_ASSEMBLYREF)
MetaData contains 0 Module References (MD_TABLE_MODULEREF)
MetaData contains 40 Type References (MD_TABLE_TYPEREF)
MetaData contains 13 Type Definitions (MD_TABLE_TYPEDEF)
MetaData contains 14 Type Specifications (MD_TABLE_TYPESPEC)
MetaData contains 5 Nested Classes (MD_TABLE_NESTEDCLASS)
MetaData contains 11 Field Definitions (MD_TABLE_FIELDDEF)
MetaData contains 0 Field RVA's (MD_TABLE_FIELDRVA)
MetaData contains 2 Propeties (MD_TABLE_PROPERTY)
MetaData contains 59 Member References (MD_TABLE_MEMBERREF)
MetaData contains 2 Constants (MD_TABLE_CONSTANT)
MetaData contains 35 Method Definitions (MD_TABLE_METHODDEF)
MetaData contains 5 Method Specifications (MD_TABLE_METHODSPEC)
MetaData contains 4 Method Semantics (MD_TABLE_PROPERTY)
MetaData contains 0 Method Implementations (MD_TABLE_METHODIMPL)
MetaData contains 22 Parameters (MD_TABLE_PARAM)
MetaData contains 2 Interface Implementations (MD_TABLE_INTERFACEIMPL)
MetaData contains 0 Implementation Maps? (MD_TABLE_IMPLMAP)
MetaData contains 2 Generic Parameters (MD_TABLE_GENERICPARAM)
MetaData contains 1 Generic Parameter Constraints (MD_TABLE_GENERICPARAMCONSTRAINT)
MetaData contains 22 Custom Attributes (MD_TABLE_CUSTOMATTRIBUTE)
MetaData contains 0 Security Info Items? (MD_TABLE_DECLSECURITY)
For more information on the Metadata see Introduction to CLR metadata, Anatomy of a .NET Assembly – PE Headers and the ECMA specification itself.
Executing .NET IL
Another large piece of functionality within DotNetAnywhere is the ‘Just-in-Time’ Compiler (JIT), i.e. the code that is responsible for executing the IL, this takes place initially in JIT_Execute.c and then JIT.c. The main ‘execution loop’ is in the JITit(..) function which contains an impressive 1,374 lines of code and over 200 case
statements within a single switch
!!
Taking a higher level view, the overall process that it goes through looks like this:
Where the .NET IL Op-Codes (CIL_XXX
) are defined in CIL_OpCodes.h and the DotNetAnywhere JIT Op-Codes (JIT_XXX
) are defined in JIT_OpCodes.h
Interesting enough, the JIT is the only place in DotNetAnywhere that uses assembly code and even then it’s only for win32
. It is used to allow a ‘jump’ or a goto
to labels in the C source code, so as IL instructions are executed it never actually leaves the JITit(..)
function, control is just moved around without having to make a full method call.
#ifdef __GNUC__
#define GET_LABEL(var, label) var = &&label
#define GO_NEXT() goto **(void**)(pCurOp++)
#else
#ifdef WIN32
#define GET_LABEL(var, label) \
{ __asm mov edi, label \
__asm mov var, edi }
#define GO_NEXT() \
{ __asm mov edi, pCurOp \
__asm add edi, 4 \
__asm mov pCurOp, edi \
__asm jmp DWORD PTR [edi - 4] }
#endif
Differences with the .NET Framework
In the full .NET framework all IL code is turned into machine code by the Just-in-Time Compiler (JIT) before being executed by the CPU.
However as we’ve already seen, DotNetAnywhere ‘interprets’ the IL, instruction-by-instruction and even through it’s done in a file called JIT.c no machine code is emitted, so the naming seems strange!?
Maybe it’s just a difference of perspective, but it’s not clear to me at what point you move from ‘interpreting’ code to ‘JITting’ it, even after reading the following links I’m not sure!! (can someone enlighten me?)
Garbage Collector
All the code for the DotNetAnywhere Garbage Collector (GC) is contained in Heap.c and is a very readable 600 lines of code. To give you an overview of what it does, here is the list of functions that it exposes:
void Heap_Init();
void Heap_SetRoots(tHeapRoots *pHeapRoots, void *pRoots, U32 sizeInBytes);
void Heap_UnmarkFinalizer(HEAP_PTR heapPtr);
void Heap_GarbageCollect();
U32 Heap_NumCollections();
U32 Heap_GetTotalMemory();
HEAP_PTR Heap_Alloc(tMD_TypeDef *pTypeDef, U32 size);
HEAP_PTR Heap_AllocType(tMD_TypeDef *pTypeDef);
void Heap_MakeUndeletable(HEAP_PTR heapEntry);
void Heap_MakeDeletable(HEAP_PTR heapEntry);
tMD_TypeDef* Heap_GetType(HEAP_PTR heapEntry);
HEAP_PTR Heap_Box(tMD_TypeDef *pType, PTR pMem);
HEAP_PTR Heap_Clone(HEAP_PTR obj);
U32 Heap_SyncTryEnter(HEAP_PTR obj);
U32 Heap_SyncExit(HEAP_PTR obj);
HEAP_PTR Heap_SetWeakRefTarget(HEAP_PTR target, HEAP_PTR weakRef);
HEAP_PTR* Heap_GetWeakRefAddress(HEAP_PTR target);
void Heap_RemovedWeakRefTarget(HEAP_PTR target);
Differences with the .NET Framework
However, like the JIT/Interpreter, the GC has some fundamental differences when compared to the .NET Framework
Conservative Garbage Collection
Firstly DotNetAnywhere implements what is knows as a Conservative GC. In simple terms this means that is does not know (for sure) which areas of memory are actually references/pointers to objects and which are just a random number (that looks like a memory address). In the Microsoft .NET Framework the JIT calculates this information and stores it in the GCInfo structure so the GC can make use of it. But DotNetAnywhere doesn’t do this.
Instead, during the Mark
phase the GC gets all the available ‘roots’, but it will consider all memory addresses within an object as ‘potential’ references (hence it is ‘conservative’). It then has to lookup each possible reference, to see if it really points to an ‘object reference’. It does this by keeping track of all memory/heap references in a balanced binary search tree (ordered by memory address), which looks something like this:
However, this means that all objects references have to be stored in the binary tree when they are allocated, which adds some overhead to allocation. In addition extra memory is needed, 20 bytes per heap entry. We can see this by looking at the tHeapEntry
data structure (all pointers are 4 bytes, U8
= 1 byte and padding
is ignored), tHeapEntry *pLink[2]
is the extra data that is needed just to enable the binary tree lookup.
struct tHeapEntry_ {
// Left/right links in the heap binary tree
tHeapEntry *pLink[2];
// The 'level' of this node. Leaf nodes have lowest level
U8 level;
// Used to mark that this node is still in use.
// If this is set to 0xff, then this heap entry is undeletable.
U8 marked;
// Set to 1 if the Finalizer needs to be run.
// Set to 2 if this has been added to the Finalizer queue
// Set to 0 when the Finalizer has been run (or there is no Finalizer in the first place)
// Only set on types that have a Finalizer
U8 needToFinalize;
// unused
U8 padding;
// The type in this heap entry
tMD_TypeDef *pTypeDef;
// Used for locking sync, and tracking WeakReference that point to this object
tSync *pSync;
// The user memory
U8 memory[0];
};
But why does DotNetAnywhere work like this? Fortunately Chris Bacon the author of DotNetAnywhere explains
Mind you, the whole heap code really needs a rewrite to reduce per-object memory overhead, and to remove the need for the binary tree of allocations. Not really thinking of a generational GC, that would probably add to much code. This was something I vaguely intended to do, but never got around to.
The current heap code was just the simplest thing to get GC working quickly. The very initial implementation did no GC at all. It was beautifully fast, but ran out of memory rather too quickly.
For more info on ‘Conservative’ and ‘Precise’ GCs see:
GC only does ‘Mark-Sweep’, it doesn’t Compact
Another area in which the GC behaviour differs is that it doesn’t do any Compaction of memory after it’s cleaned up, as Steve Sanderson found out when working on Blazor
.. During server-side execution we don’t actually need to pin anything, because there’s no interop outside .NET. During client-side execution, everything is (in effect) pinned regardless, because DNA’s GC only does mark-sweep - it doesn’t have any compaction phase.
In addition, when an object is allocated DotNetAnywhere just makes a call to malloc(), see the code that does this is in the Heap_Alloc(..) function. So there is no concept of ‘Generations’ or ‘Segments’ that you have in the .NET Framework GC, i.e. no ‘Gen 0’, ‘Gen 1’, or ‘Large Object Heap’.
Threading Model
Finally, lets take a look at the threading model, which is fundamentally different from the one found in the .NET Framework.
Differences with the .NET Framework
Whilst DotNetAnywhere will happily create new threads and execute them for you, it’s only providing the illusion of true multi-threading. In reality it only runs on one thread, but context switches between the different threads that your program creates:
You can see this in action in the code below, (from the Thread_Execute() function), note the call to JIT_Execute(..)
with numInst
set to 100
:
for (;;) {
U32 minSleepTime = 0xffffffff;
I32 threadExitValue;
status = JIT_Execute(pThread, 100);
switch (status) {
....
}
}
An interesting side-effect is that the threading code in the DotNetAnywhere corlib
implementation is really simple. For instance the internal implementation of the Interlocked.CompareExchange()
function looks like the following, note the lack of synchronisation that you would normally expect:
tAsyncCall* System_Threading_Interlocked_CompareExchange_Int32(
PTR pThis_, PTR pParams, PTR pReturnValue) {
U32 *pLoc = INTERNALCALL_PARAM(0, U32*);
U32 value = INTERNALCALL_PARAM(4, U32);
U32 comparand = INTERNALCALL_PARAM(8, U32);
*(U32*)pReturnValue = *pLoc;
if (*pLoc == comparand) {
*pLoc = value;
}
return NULL;
}
Benchmarks
As a simple test, I ran some benchmarks from The Computer Language Benchmarks Game - binary-trees, using the simplest C# version
Note: DotNetAnywhere was designed to run on low-memory devices, so it was not meant to have the same performance as the full .NET Framework. Please bear that in mind when looking at the results!!
.NET Framework, 4.6.1 - 0.36 seconds
Invoked=TestApp.exe 15
stretch tree of depth 16 check: 131071
32768 trees of depth 4 check: 1015808
8192 trees of depth 6 check: 1040384
2048 trees of depth 8 check: 1046528
512 trees of depth 10 check: 1048064
128 trees of depth 12 check: 1048448
32 trees of depth 14 check: 1048544
long lived tree of depth 15 check: 65535
Exit code : 0
Elapsed time : 0.36
Kernel time : 0.06 (17.2%)
User time : 0.16 (43.1%)
page fault # : 6604
Working set : 25720 KB
Paged pool : 187 KB
Non-paged pool : 24 KB
Page file size : 31160 KB
DotNetAnywhere - 54.39 seconds
Invoked=dna TestApp.exe 15
stretch tree of depth 16 check: 131071
32768 trees of depth 4 check: 1015808
8192 trees of depth 6 check: 1040384
2048 trees of depth 8 check: 1046528
512 trees of depth 10 check: 1048064
128 trees of depth 12 check: 1048448
32 trees of depth 14 check: 1048544
long lived tree of depth 15 check: 65535
Total execution time = 54288.33 ms
Total GC time = 36857.03 ms
Exit code : 0
Elapsed time : 54.39
Kernel time : 0.02 (0.0%)
User time : 54.15 (99.6%)
page fault # : 5699
Working set : 15548 KB
Paged pool : 105 KB
Non-paged pool : 8 KB
Page file size : 13144 KB
So clearly DotNetAnywhere doesn’t work as fast in this benchmark (0.36 seconds v 54 seconds). However if we look at other benchmarks from the same site, it performs a lot better. It seems that DotNetAnywhere has a significant overhead when allocating objects (a class
), which is less obvious when using structs
.
Benchmark 1 (using
classes
)
Benchmark 2 (using
structs
)
Elapsed Time (secs)
3.1
2.0
GC Collections
96
67
Total GC time (msecs)
983.59
439.73
Finally, I really want to thank Chris Bacon, DotNetAnywhere is a great code base and gives a fantastic insight into what needs to happen for a .NET runtime to work.
Discuss this post on Hacker News and /r/programming
The post DotNetAnywhere: An Alternative .NET Runtime first appeared on my blog Performance is a Feature!
CodeProject
Thu, 19 Oct 2017, 12:00 am
Analysing C# code on GitHub with BigQuery
Just over a year ago Google made all the open source code on GitHub available for querying within BigQuery and as if that wasn’t enough you can run a terabyte of queries each month for free!
So in this post I am going to be looking at all the C# source code on GitHub and what we can find out from it. Handily a smaller, C# only, dataset has been made available (in BigQuery you are charged per byte read), called fh-bigquery:github_extracts.contents_net_cs and has
- 5,885,933 unique ‘.cs’ files
- 792,166,632 lines of code (LOC)
- 37.17 GB of data
Which is a pretty comprehensive set of C# source code!
The rest of this post will attempt to answer the following questions:
- Tabs or Spaces?
regions
: ‘should be banned’ or ‘okay in some cases’?
- ‘K&R’ or ‘Allman’, where do C# devs like to put their braces?
- Do C# developers like writing functional code?
Then moving onto some less controversial C# topics:
- Which
using
statements are most widely used?
- What NuGet packages are most often included in a .NET project
- How many lines of code (LOC) are in a typical C# file?
- What is the most widely thrown
Exception
?
- ‘async/await all the things’ or not?
- Do C# developers like using the
var
keyword? (Updated)
Before we end up looking at repositories, not just individual C# files:
- What is the most popular repository with C# code in it?
- Just how many files should you have in a repository?
- What are the most popular C#
class
names?
- ‘Foo.cs’, ‘Program.cs’ or something else, what’s the most common file name?
If you want to try the queries for yourself (or find my mistakes), all of them are available in this gist. There’s a good chance that my regular expressions miss out some edge-cases, after all Regular Expressions: Now You Have Two Problems:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
Tabs or Spaces?
In the entire data-set there are 5,885,933 files, but here we only include ones that have more than 10 lines starting with a tab or a space
Tabs
Tabs %
Spaces
Spaces %
Total
799,055
17.15%
3,859,528
82.85%
4,658,583
Clearly, C# developers (on GitHub) prefer Spaces over Tabs, let the endless debates continue!! (I think some of this can be explained by the fact that Visual Studio uses ‘spaces’ by default)
If you want to see how C# compares to other programming languages, take a look at 400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?.
regions
: ‘should be banned’ or ‘okay in some cases’?
It turns out that there are an impressive 712,498 C# files (out of 5.8 million) that contain at least one #region
statement (query used), that’s just over 12%. (I’m hoping that a lot of those files have been auto-generated by a tool!)
‘K&R’ or ‘Allman’, where do C# devs like to put their braces?
C# developers overwhelmingly prefer putting an opening brace {
on it’s own line (query used)
separate line
same line
same line (initializer)
total (with brace)
total (all code)
81,306,320 (67%)
40,044,603 (33%)
3,631,947 (2.99%)
121,350,923 (15.32%)
792,166,632
(‘same line initializers’ include code like new { Name = "", .. }
, new [] { 1, 2, 3.. }
)
Do C# developers like writing functional code?
This is slightly unscientific, but I wanted to see how widely the Lambda Operator =>
is used in C# code (query). Yes, I know, if you want to write functional code on .NET you really should use F#, but C# has become more ‘functional’ over the years and I wanted to see how much code was taking advantage of that.
Here’s the raw percentiles:
Percentile
% of lines using lambdas
10
0.51
25
1.14
50
2.50
75
5.26
90
9.95
95
14.29
99
28.00
So we can say that:
- 50% of all the C# code on GitHub uses
=>
on 2.44% (or less) of their lines.
- 10% of all C# files have lambdas on almost 1 in 10 of their lines (9.95%)
- 5% use
=>
on 1 in 7 lines (14.29%)
- 1% of files have lambdas on over 1 in 3 lines (28%) of their lines of code, that’s pretty impressive!
Which using
statements are most widely used?
Now on to some a bit more substantial, what are the most widely used using
statements in C# code?
The top 10 looks like this (the full results are available):
using statement
count
using System.Collections.Generic;
1,780,646
using System;
1,477,019
using System.Linq;
1,319,830
using System.Text;
902,165
using System.Threading.Tasks;
628,195
using System.Runtime.InteropServices;
431,867
using System.IO;
407,848
using System.Runtime.CompilerServices;
338,686
using System.Collections;
289,867
using System.Reflection;
218,369
However, as was pointed out, the top 5 are included by default when you add a new file in Visual Studio and many people wouldn’t remove them. The same applies to ‘System.Runtime.InteropServices’ and ‘System.Runtime.CompilerServices’ which are include in ‘AssemblyInfo.cs` by default.
So if we adjust the list to take account of this, the top 10 looks like so:
using statement
count
using System.IO;
407,848
using System.Collections;
289,867
using System.Reflection;
218,369
using System.Diagnostics;
201,341
using System.Threading;
179,168
using System.ComponentModel;
160,681
using System.Web;
160,323
using System.Windows.Forms;
137,003
using System.Globalization;
132,113
using System.Drawing;
127,033
Finally, an interesting list is the top 10 using statements that aren’t System
, Microsoft
or Windows
namespaces:
using statement
count
using NUnit.Framework;
119,463
using UnityEngine;
117,673
using Xunit;
99,099
using Newtonsoft.Json;
81,675
using Newtonsoft.Json.Linq;
29,416
using Moq;
23,546
using UnityEngine.UI;
20,355
using UnityEditor;
19,937
using Amazon.Runtime;
18,941
using log4net;
17,297
What NuGet packages are most often included in a .NET project?
It turns out that there is also a separate dataset containing all the ‘packages.config’ files on GitHub, it’s called contents_net_packages_config and has 104,808 entries. By querying this we can see that Json.Net is the clear winner!!
package
count
Newtonsoft.Json
45,055
Microsoft.Web.Infrastructure
16,022
Microsoft.AspNet.Razor
15,109
Microsoft.AspNet.WebPages
14,495
Microsoft.AspNet.Mvc
14,236
EntityFramework
14,191
Microsoft.AspNet.WebApi.Client
13,480
Microsoft.AspNet.WebApi.Core
12,210
Microsoft.Net.Http
11,625
jQuery
10,646
Microsoft.Bcl.Build
10,641
Microsoft.Bcl
10,349
NUnit
10,341
Owin
9,681
Microsoft.Owin
9,202
Microsoft.AspNet.WebApi.WebHost
9,007
WebGrease
8,743
Microsoft.AspNet.Web.Optimization
8,721
Microsoft.AspNet.WebApi
8,179
How many lines of code (LOC) are in a typical C# file?
Are C# developers prone to creating huge files that go one for 1000’s of lines? Well some are but fortunately it’s the minority of us!!
Note the Y-axis is ‘lines of code’ and is logarithmic, the raw data is available.
Oh dear, Uncle Bob isn’t going to be happy, whilst 96% of the files have 509 LOC of less, the other 4% don’t!! From Clean Code:
And in case you’re wondering, here’s the Top 10 longest C# files!!
File
Lines
MarMot/Input/test.marmot.cs
92663
src/CodenameGenerator/WordRepos/LastNamesRepository.cs
88810
cs_inputtest/cs_02_7000.cs
63004
cs_inputtest/cs_02_6000.cs
54004
src/ML NET20/Utility/UserName.cs
52014
MWBS/Dictionary/DefaultWordDictionary.cs
48912
Sources/Accord.Math/Matrix/Matrix.Comparisons1.Generated.cs
48407
UrduProofReader/UrduLibs/Utils.cs
48255
cs_inputtest/cs_02_5000.cs
45004
css/style.cs
44366
What is the most widely thrown Exception
?
There’s a few interesting results in this query, for instance who knew that so many ApplicationExceptions
were thrown and NotSupportedException
being so high up the list is a bit worrying!!
Exception
count
throw new ArgumentNullException
699,526
throw new ArgumentException
361,616
throw new NotImplementedException
340,361
throw new InvalidOperationException
260,792
throw new ArgumentOutOfRangeException
160,640
throw new NotSupportedException
110,019
throw new HttpResponseException
74,498
throw new ValidationException
35,615
throw new ObjectDisposedException
31,129
throw new ApplicationException
30,849
throw new UnauthorizedException
21,133
throw new FormatException
19,510
throw new SerializationException
17,884
throw new IOException
15,779
throw new IndexOutOfRangeException
14,778
throw new NullReferenceException
12,372
throw new InvalidDataException
12,260
throw new ApiException
11,660
throw new InvalidCastException
10,510
‘async/await all the things’ or not?
The addition of the async
and await
keywords to the C# language makes writing asynchronous code much easier:
public async Task GetDotNetCountAsync()
{
// Suspends GetDotNetCountAsync() to allow the caller (the web server)
// to accept another request, rather than blocking on this one.
var html = await _httpClient.DownloadStringAsync("http://dotnetfoundation.org");
return Regex.Matches(html, ".NET").Count;
}
But how much is it used? Using the query below:
SELECT Count(*) count
FROM
[fh-bigquery:github_extracts.contents_net_cs]
WHERE
REGEXP_MATCH(content, r'\sasync\s|\sawait\s')
I found that there are 218,643 files (out of 5,885,933) that have at least one usage of async
or await
in them.
Do C# developers like using the var
keyword?
Less that they use async
and await
, there are 130,590 files that have at least one usage of the var
keyword
Update: thanks for jairbubbles for pointing out that my var
regex was wrong and supplying a fixed version!
More than they use async
and await
, there are 1,457,154 files that have at least one usage of the var
keyword
Just how many files should you have in a repository?
90% of the repositories (that have any C# files) have 95 files or less. 95% have 170 files or less and 99% have 535 files or less.
(again the Y-axis (# files) is logarithmic)
The top 10 largest repositories, by number of C# files are shown below:
Repository
# Files
https://github.com/xen2/mcs
23389
https://github.com/mater06/LEGOChimaOnlineReloaded
14241
https://github.com/Microsoft/referencesource
13051
https://github.com/dotnet/corefx
10652
https://github.com/apo-j/Projects_Working
10185
https://github.com/Microsoft/CodeContracts
9338
https://github.com/drazenzadravec/nequeo
8060
https://github.com/ClearCanvas/ClearCanvas
7946
https://github.com/mwilliamson-firefly/aws-sdk-net
7860
https://github.com/151706061/MacroMedicalSystem
7765
What is the most popular repository with C# code in it?
This time we are going to look at the most popular repositories (based on GitHub ‘stars’) that contain at least 50 C# files (query used):
repo
stars
files
https://github.com/grpc/grpc
11075
237
https://github.com/dotnet/coreclr
8576
6503
https://github.com/dotnet/roslyn
8422
6351
https://github.com/facebook/yoga
8046
73
https://github.com/bazelbuild/bazel
7123
132
https://github.com/dotnet/corefx
7115
10652
https://github.com/SeleniumHQ/selenium
7024
512
https://github.com/Microsoft/WinObjC
6184
81
https://github.com/qianlifeng/Wox
5674
207
https://github.com/Wox-launcher/Wox
5674
142
https://github.com/ShareX/ShareX
5336
766
https://github.com/Microsoft/Windows-universal-samples
5130
1501
https://github.com/NancyFx/Nancy
3701
957
https://github.com/chocolatey/choco
3432
248
https://github.com/JamesNK/Newtonsoft.Json
3340
650
Interesting that the top spot is a Google Repository! (the C# files in it are sample code for using the GRPC library from .NET)
What are the most popular C# class
names?
Assuming that I got the regex correct, the most popular C# class
names are the following:
Class name
Count
class C
182480
class Program
163462
class Test
50593
class Settings
40841
class Resources
39345
class A
34687
class App
28462
class B
24246
class Startup
18238
class Foo
15198
Yay for Foo
, just sneaking into the Top 10!!
‘Foo.cs’, ‘Program.cs’ or something else, what’s the most common file name?
Finally lets look at the different class
names used, as with the using
statement they are dominated by the default ones used in the Visual Studio templates:
File
Count
AssemblyInfo.cs
386822
Program.cs
105280
Resources.Designer.cs
40881
Settings.Designer.cs
35392
App.xaml.cs
21928
Global.asax.cs
16133
Startup.cs
14564
HomeController.cs
13574
RouteConfig.cs
11278
MainWindow.xaml.cs
11169
Discuss this post on Hacker News and /r/csharp
As always, if you’ve read this far your present is yet more blog posts to read, enjoy!!
How BigQuery Works (only put in at the end of the blog post)
BigQuery analysis of other Programming Languages
The post Analysing C# code on GitHub with BigQuery first appeared on my blog Performance is a Feature!
CodeProject
Thu, 12 Oct 2017, 12:00 am
A look at the internals of 'boxing' in the CLR
It’s a fundamental part of .NET and can often happen without you knowing, but how does it actually work? What is the .NET Runtime doing to make boxing possible?
Note: this post won’t be discussing how to detect boxing, how it can affect performance or how to remove it (speak to Ben Adams about that!). It will only be talking about how it works.
As an aside, if you like reading about CLR internals you may find these other posts interesting:
Boxing in the CLR Specification
Firstly it’s worth pointing out that boxing is mandated by the CLR specification ‘ECMA-335’, so the runtime has to provide it:
This means that there are a few key things that the CLR needs to take care of, which we will explore in the rest of this post.
Creating a ‘boxed’ Type
The first thing that the runtime needs to do is create the corresponding reference type (‘boxed type’) for any struct
that it loads. You can see this in action, right at the beginning of the ‘Method Table’ creation where it first checks if it’s dealing with a ‘Value Type’, then behaves accordingly. So the ‘boxed type’ for any struct
is created up front, when your .dll is imported, then it’s ready to be used by any ‘boxing’ that happens during program execution.
The comment in the linked code is pretty interesting, as it reveals some of the low-level details the runtime has to deal with:
// Check to see if the class is a valuetype; but we don't want to mark System.Enum
// as a ValueType. To accomplish this, the check takes advantage of the fact
// that System.ValueType and System.Enum are loaded one immediately after the
// other in that order, and so if the parent MethodTable is System.ValueType and
// the System.Enum MethodTable is unset, then we must be building System.Enum and
// so we don't mark it as a ValueType.
CPU-specific code-generation
But to see what happens during program execution, let’s start with a simple C# program. The code below creates a custom struct
or Value Type
, which is then ‘boxed’ and ‘unboxed’:
public struct MyStruct
{
public int Value;
}
var myStruct = new MyStruct();
// boxing
var boxed = (object)myStruct;
// unboxing
var unboxed = (MyStruct)boxed;
This gets turned into the following IL code, in which you can see the box
and unbox.any
IL instructions:
L_0000: ldloca.s myStruct
L_0002: initobj TestNamespace.MyStruct
L_0008: ldloc.0
L_0009: box TestNamespace.MyStruct
L_000e: stloc.1
L_000f: ldloc.1
L_0010: unbox.any TestNamespace.MyStruct
Runtime and JIT code
So what does the JIT do with these IL op codes? Well in the normal case it wires up and then inlines the optimised, hand-written, assembly code versions of the ‘JIT Helper Methods’ provided by the runtime. The links below take you to the relevant lines of code in the CoreCLR source:
Interesting enough, the only other ‘JIT Helper Methods’ that get this special treatment are object
, string
or array
allocations, which goes to show just how performance sensitive boxing is.
In comparison, there is only one helper method for ‘unboxing’, called JIT_Unbox(..), which falls back to JIT_Unbox_Helper(..) in the uncommon case and is wired up here (CORINFO_HELP_UNBOX
to JIT_Unbox
). The JIT will also inline the helper call in the common case, to save the cost of a method call, see Compiler::impImportBlockCode(..).
Note that the ‘unbox helper’ only fetches a reference/pointer to the ‘boxed’ data, it has to then be put onto the stack. As we saw above, when the C# compiler does unboxing it uses the ‘Unbox_Any’ op-code not just the ‘Unbox’ one, see Unboxing does not create a copy of the value for more information.
Unboxing Stub Creation
As well as ‘boxing’ and ‘unboxing’ a struct
, the runtime also needs to help out during the time that a type remains ‘boxed’. To see why, let’s extend MyStruct
and override
the ToString()
method, so that it displays the current Value
:
public struct MyStruct
{
public int Value;
public override string ToString()
{
return "Value = " + Value.ToString();
}
}
Now, if we look at the ‘Method Table’ the runtime creates for the boxed version of MyStruct
(remember, value types have no ‘Method Table’), we can see something strange going on. Note that there are 2 entries for MyStruct::ToString
, one of which I’ve labelled as an ‘Unboxing Stub’
Method table summary for 'MyStruct':
Number of static fields: 0
Number of instance fields: 1
Number of static obj ref fields: 0
Number of static boxed fields: 0
Number of declared fields: 1
Number of declared methods: 1
Number of declared non-abstract methods: 1
Vtable (with interface dupes) for 'MyStruct':
Total duplicate slots = 0
SD: MT::MethodIterator created for MyStruct (TestNamespace.MyStruct).
slot 0: MyStruct::ToString 0x000007FE41170C10 (slot = 0) (Unboxing Stub)
slot 1: System.ValueType::Equals 0x000007FEC1194078 (slot = 1)
slot 2: System.ValueType::GetHashCode 0x000007FEC1194080 (slot = 2)
slot 3: System.Object::Finalize 0x000007FEC14A30E0 (slot = 3)
slot 5: MyStruct::ToString 0x000007FE41170C18 (slot = 4)
Wed, 2 Aug 2017, 12:00 am
Memory Usage Inside the CLR
Have you ever wondered where and why the .NET Runtime (CLR) allocates memory? I don’t mean the ‘managed’ memory that your code allocates, e.g. via new MyClass(..)
and the Garbage Collector (GC) then cleans up. I mean the memory that the CLR itself allocates, all the internal data structures that it needs to make is possible for your code to run.
Note just to clarify, this post will not be telling you how you can analyse the memory usage of your code, for that I recommend using one of the excellent .NET Profilers available such as dotMemory by JetBrains or the ANTS Memory Profiler from Redgate (I’ve personally used both and they’re great)
The high-level view
Fortunately there’s a fantastic tool that makes it very easy for us to get an overview of memory usage within the CLR itself. It’s called VMMap and it’s part of the excellent Sysinternals Suite.
For the post I will just be using a simple HelloWorld
program, so that we can observe what the CLR does in the simplest possible scenario, obviously things may look a bit different in a more complex app.
Firstly, lets look at the data over time, in 1 second intervals. The HelloWorld
program just prints to the Console and then waits until you press
, so once the memory usage has reached it’s peak it remains there till the program exits. (Click for a larger version)
However, to get a more detailed view, we will now look at the snapshot from 2 seconds into the timeline, when the memory usage has stabilised.
Note: If you want to find out more about memory usage in general, but also specifically how measure it in .NET applications, I recommend reading this excellent series of posts by Sasha Goldshtein
Also, if like me you always get the different types of memory mixed-up, please read this Stackoverflow answer first What is private bytes, virtual bytes, working set?
‘Image’ Memory
Now we’ve seen the high-level view, lets take a close look at the individual chucks, the largest of which is labelled Image, which according to the VMMap help page (see here for all info on all memory types):
… represents an executable file such as a .exe or .dll and has been loaded into a process by the image loader. It does not include images mapped as data files, which would be included in the Mapped File memory type. Image mappings can include shareable memory like code. When data regions, like initialized data, is modified, additional private memory is created in the process.
At this point, it’s worth pointing out a few things:
- This memory is takes up a large amount of the total process memory because I’m using a simple
HelloWorld
program, in other types of programs it wouldn’t dominate the memory usage as much
- I was using a
DEBUG
version of the CoreCLR, so the CLR specific files System.Private.CoreLib.dll, coreclr.dll, clrjit.dll and CoreRun.exe may well be larger than if they were compiled in RELEASE
mode
- Some of this memory is potentially ‘shared’ with other processes, compare the numbers in the ‘Total WS’, ‘Private WS’, ‘Shareable WS’ and ‘Shared WS’ columns to see this in action.
‘Managed Heaps’ created by the Garbage Collector
The next largest usage of memory is the GC itself, it pre-allocates several heaps that it can then give out whenever your program allocates an object, for example via code such as new MyClass()
or new byte[]
.
The main thing to note about the image above is that you can clearly see the different heap, there is 256 MB allocated for Generations (Gen 0, 1, 2) and 128 MB for the ‘Large Object Heap’. In addition, note the difference between the amounts in the Size and the Committed columns. Only the Committed memory is actually being used, the total Size is what the GC pre-allocates or reserves up front from the address space.
If you’re interested, the rules for heap or more specifically segment sizes are helpfully explained in the Microsoft Docs, but simply put, it varies depending on the GC mode (Workstation v Server), whether the process is 32/64-bit and ‘Number of CPUs’.
Internal CLR ‘Heap’ memory
However the part that I’m going to look at for the rest of this post is the memory that is allocated by the CLR itself, that is unmanaged memory that is uses for all its internal data structures.
But if we just look at the VMMap UI view, it doesn’t really tell us that much!
However, using the excellent PerfView tool we can capture the full call-stack of any memory allocations, that is any calls to VirtualAlloc() or RtlAllocateHeap() (obviously these functions only apply when running the CoreCLR on Windows). If we do this, PerfView gives us the following data (yes, it’s not pretty, but it’s very powerful!!)
So lets explore this data in more detail.
Notable memory allocations
There are a few places where the CLR allocates significant chunks of memory up-front and then uses them through its lifetime, they are listed below:
- GC related allocations (see gc.cpp)
- Mark List - 1,052,672 Bytes (1,028 K) in
WKS::make_mark_list(..)
. using during the ‘mark’ phase of the GC, see Back To Basics: Mark and Sweep Garbage Collection
- Card Table - 397,312 Bytes (388 K) in
WKS::gc_heap::make_card_table(..)
, see Marking the ‘Card Table’
- Overall Heap Creation/Allocation - 204,800 Bytes (200 K) in
WKS::gc_heap::make_gc_heap(..)
- S.O.H Segment creation - 65,536 Bytes (64 K) in
WKS::gc_heap::allocate(..)
, triggered by the first object allocation
- L.O.H Segment creation - 65,536 Bytes (64 K) in
WKS::gc_heap::allocate_large_object(..)
, triggered by the first ‘large’ object allocation
- Handle Table - 20,480 Bytes (20 K) in HndCreateHandleTable(..)
- Stress Log - 4,194,304 Bytes (4,096 K) in StressLog::Initialize(..). Only if the ‘stress log’ is activated, see this comment for more info
- ‘Watson’ error reporting - 65,536 Bytes (64 K) in EEStartupHelper routine
- Virtual Call Stub Manager - 36,864 Bytes (36 K) in VirtualCallStubManager::InitStatic(), which in turn creates the DispatchCache. See ‘Virtual Stub Dispatch’ in the BOTR for more info
- Debugger Heap and Control-Block - 28,672 Bytes (28K) (only if debugging support is needed) in DebuggerHeap::Init(..) and DebuggerRCThread::Init(..), both called via InitializeDebugger(..)
Execution Engine Heaps
However another technique that it uses is to allocated ‘heaps’, often 64K at a time and then perform individual allocations within the heaps as needed. These heaps are split up into individual use-cases, the most common being for ‘frequently accessed’ data and it’s counter-part, data that is ‘rarely accessed’, see the explanation from this comment in loaderallocator.hpp for more. This is done to ensure that the CLR retains control over any memory allocations and can therefore prevent ‘fragmentation’.
These heaps are together known as ‘Loader Heaps’ as explained in Drill Into .NET Framework Internals to See How the CLR Creates Runtime Objects (wayback machine version):
LoaderHeaps
LoaderHeaps are meant for loading various runtime CLR artifacts and optimization artifacts that live for the lifetime of the domain. These heaps grow by predictable chunks to minimize fragmentation. LoaderHeaps are different from the garbage collector (GC) Heap (or multiple heaps in case of a symmetric multiprocessor or SMP) in that the GC Heap hosts object instances while LoaderHeaps hold together the type system. Frequently accessed artifacts like MethodTables, MethodDescs, FieldDescs, and Interface Maps get allocated on a HighFrequencyHeap, while less frequently accessed data structures, such as EEClass and ClassLoader and its lookup tables, get allocated on a LowFrequencyHeap. The StubHeap hosts stubs that facilitate code access security (CAS), COM wrapper calls, and P/Invoke.
One of the main places you see this high/low-frequency of access is in the heart of the Type system, where different data items are either classified as ‘hot’ (high-frequency) or ‘cold’ (low-frequency), from the ‘Key Data Structures’ section of the BOTR page on ‘Type Loader Design’:
EEClass
MethodTable data are split into “hot” and “cold” structures to improve working set and cache utilization. MethodTable itself is meant to only store “hot” data that are needed in program steady state. EEClass stores “cold” data that are typically only needed by type loading, JITing or reflection. Each MethodTable points to one EEClass.
Further to this, listed below are some specific examples of when each heap type is used:
All the general ‘Loader Heaps’ listed above are allocated in the LoaderAllocator::Init(..)
function (link to actual code), the executable
and stub
heap have the ‘executable’ flag set, all the rest don’t. The size of these heaps is configured in this code, they ‘reserve’ different amounts up front, but they all have a ‘commit’ size that is equivalent to one OS ‘page’.
In addition to the ‘general’ heaps, there are some others that are specifically used by the Virtual Stub Dispatch mechanism, they are known as the indcell_heap
, cache_entry_heap
, lookup_heap
, dispatch_heap
and resolve_heap
, they’re allocated in this code, using the specified commit/reserve sizes.
Finally, if you’re interested in the mechanics of how the heaps actually work take a look at LoaderHeap.cpp.
JIT Memory Usage
Last, but by no means least, there is one other component in the CLR that extensively allocates memory and that is the JIT. It does so in 2 main scenarios:
- ‘Transient’ or temporary memory needed when it’s doing the job of converting IL code into machine code
- ‘Permanent’ memory used when it needs to emit the ‘machine code’ for a method
‘Transient’ Memory
This is needed by the JIT when it is doing the job of converting IL code into machine code for the current CPU architecture. This memory is only needed whilst the JIT is running and can be re-used/discarded later, it is used to hold the internal JIT data structures (e.g. Compiler
, BasicBlock
, GenTreeStmt
, etc).
For example, take a look at the following code from Compiler::fgValueNumber():
...
// Allocate the value number store.
assert(fgVNPassesCompleted > 0 || vnStore == nullptr);
if (fgVNPassesCompleted == 0)
{
CompAllocator* allocator = new (this, CMK_ValueNumber) CompAllocator(this, CMK_ValueNumber);
vnStore = new (this, CMK_ValueNumber) ValueNumStore(this, allocator);
}
...
The line vnStore = new (this, CMK_ValueNumber) ...
ends up calling the specialised new
operator defined in compiler.hpp (code shown below), which as per the comment, uses a customer ‘Arena Allocator’ that is implemented in /src/jit/alloc.cpp
/*****************************************************************************
* operator new
*
* Note that compGetMem is an arena allocator that returns memory that is
* not zero-initialized and can contain data from a prior allocation lifetime.
* it also requires that 'sz' be aligned to a multiple of sizeof(int)
*/
inline void* __cdecl operator new(size_t sz, Compiler* context, CompMemKind cmk)
{
sz = AlignUp(sz, sizeof(int));
assert(sz != 0 && (sz & (sizeof(int) - 1)) == 0);
return context->compGetMem(sz, cmk);
}
This technique (of overriding the new
operator) is used in lots of places throughout the CLR, for instance there is a generic one implemented in the CLR Host.
‘Permanent’ Memory
The last type of memory that the JIT uses is ‘permanent’ memory to store the JITted machine code, this is done via calls to Compiler::compGetMem(..), starting from Compiler::compCompile(..) via the call-stack shown below. Note that as before this uses a customer ‘Arena Allocator’ that is implemented in /src/jit/alloc.cpp
+ clrjit!ClrAllocInProcessHeap
+ clrjit!ArenaAllocator::allocateHostMemory
+ clrjit!ArenaAllocator::allocateNewPage
+ clrjit!ArenaAllocator::allocateMemory
+ clrjit!Compiler::compGetMem
+ clrjit!emitter::emitGetMem
+ clrjit!emitter::emitAllocInstr
+ clrjit!emitter::emitNewInstrTiny
+ clrjit!emitter::emitIns_R_R
+ clrjit!emitter::emitInsBinary
+ clrjit!CodeGen::genCodeForStoreLclVar
+ clrjit!CodeGen::genCodeForTreeNode
+ clrjit!CodeGen::genCodeForBBlist
+ clrjit!CodeGen::genGenerateCode
+ clrjit!Compiler::compCompile
Real-world example
Finally, to prove that this investigation matches with more real-world scenarios, we can see similar memory usage breakdowns in this GitHub issue: [Question] Reduce memory consumption of CoreCLR
Yes, we have profiled several Xamarin GUI applications on Tizen Mobile.
Typical profile of CoreCLR’s memory on the GUI applications is the following:
- Mapped assembly images - 4.2 megabytes (50%)
- JIT-compiler’s memory - 1.7 megabytes (20%)
- Execution engine - about 1 megabyte (11%)
- Code heap - about 1 megabyte (11%)
- Type information - about 0.5 megabyte (6%)
- Objects heap - about 0.2 megabyte (2%)
Discuss this post on HackerNews
Further Reading
See the links below for additional information on ‘Loader Heaps’
The post Memory Usage Inside the CLR first appeared on my blog Performance is a Feature!
CodeProject
Mon, 10 Jul 2017, 12:00 am
How the .NET Runtime loads a Type
It is something we take for granted every time we run a .NET program, but it turns out that loading a Type or class
is a fairly complex process.
So how does the .NET Runtime (CLR) actually load a Type?
If you want the tl;dr it’s done carefully, cautiously and step-by-step
Ensuring Type Safety
One of the key requirements of a ‘Managed Runtime’ is providing Type Safety, but what does it actually mean? From the MSDN page on Type Safety and Security
Type-safe code accesses only the memory locations it is authorized to access. (For this discussion, type safety specifically refers to memory type safety and should not be confused with type safety in a broader respect.) For example, type-safe code cannot read values from another object’s private fields. It accesses types only in well-defined, allowable ways.
So in effect, the CLR has to ensure your Types/Classes are well-behaved and following the rules.
Compiler prevents you from creating an ‘abstract’ class
But lets look at a more concrete example, using the C# code below
public abstract class AbstractClass
{
public AbstractClass() { }
}
public class NormalClass : AbstractClass
{
public NormalClass() { }
}
public static void Main(string[] args)
{
var test = new AbstractClass();
}
The compiler quite rightly refuses to compile this and gives the following error, because abstract
classes can’t be created, you can only inherit from them.
error CS0144: Cannot create an instance of the abstract class or interface
'ConsoleApplication.AbstractClass'
So that’s all well and good, but the CLR can’t rely on all code being created via a well-behaved compiler, or in fact via a compiler at all. So it has to check for and prevent any attempt to create an abstract
class.
Writing IL code by hand
One way to circumvent the compiler is to write IL code by hand using the IL Assembler tool (ILAsm) which will do almost no checks on the validity of the IL you give it.
For instance the IL below is the equivalent of writing var test = new AbstractClass();
(if the C# compiler would let us):
.method public hidebysig static void Main(string[] args) cil managed
{
.entrypoint
.maxstack 1
.locals init (
[0] class ConsoleApplication.NormalClass class2)
// System.InvalidOperationException: Instances of abstract classes cannot be created.
newobj instance void ConsoleApplication.AbstractClass::.ctor()
stloc.0
ldloc.0
callvirt instance class [mscorlib]System.Type [mscorlib]System.Object::GetType()
callvirt instance string [mscorlib]System.Reflection.MemberInfo::get_Name()
call void [mscorlib]Internal.Console::WriteLine(string)
ret
}
Fortunately the CLR has got this covered and will throw an InvalidOperationException
when you execute the code. This is due to this check which is hit when the JIT compiles the newobj
IL instruction.
Creating Types at run-time
One other way that you can attempt to create an abstract
class is at run-time, using reflection (thanks to this blog post for giving me some tips on other ways of creating Types).
This is shown in the code below:
var abstractType = Type.GetType("ConsoleApplication.AbstractClass");
Console.WriteLine(abstractType.FullName);
// System.MissingMethodException: Cannot create an abstract class.
var abstractInstance = Activator.CreateInstance(abstractType);
The compiler is completely happy with this, it doesn’t do anything to prevent or warn you and nor should it. However when you run the code, it will throw an exception, strangely enough a MissingMethodException
this time, but it does the job!
The call stack is below:
One final way (unless I’ve missed some out?) is to use GetUninitializedObject(..)
in the FormatterServices class like so:
public static object CreateInstance(Type type)
{
var constructor = type.GetConstructor(new Type[0]);
if (constructor == null && !type.IsValueType)
{
throw new NotSupportedException(
"Type '" + type.FullName + "' doesn't have a parameterless constructor");
}
var emptyInstance = FormatterServices.GetUninitializedObject(type);
if (constructor == null)
return null;
return constructor.Invoke(emptyInstance, new object[0]) ?? emptyInstance;
}
var abstractType = Type.GetType("ConsoleApplication.AbstractClass");
Console.WriteLine(abstractType.FullName);
// System.MemberAccessException: Cannot create an abstract class.
var abstractInstance = CreateInstance(abstractType);
Again the run-time stops you from doing this, however this time it decides to throw a MemberAccessException
?
This happens via the following call stack:
Further Type-Safety Checks
These checks are just one example of what the runtime has to validate when creating types, there are many more things is has to deal with. For instance you can’t:
Loading Types ‘step-by-step’
So we’ve seen that the CLR has to do multiple checks when it’s loading types, but why does it have to load them ‘step-by-step’?
Well in a nutshell, it’s because of circular references and recursion, particularly when dealing with generics types. If we take the code below from section ‘2.1 Load Levels’ in Type Loader Design (BotR):
classA : C
{ }
classB : C
{ }
classC
{ }
These are valid types and class A
depends on class B
and vice versa. So we can’t load A
until we know that B
is valid, but we can’t load B
, until we’re sure that A
is valid, a classic deadlock!!
How does the run-time get round this, well from the same BotR page:
The loader initially creates the structure(s) representing the type and initializes them with data that can be obtained without loading other types. When this “no-dependencies” work is done, the structure(s) can be referred from other places, usually by sticking pointers to them into another structures. After that the loader progresses in incremental steps and fills the structure(s) with more and more information until it finally arrives at a fully loaded type. In the above example, the base types of A and B will be approximated by something that does not include the other type, and substituted by the real thing later.
(there is also some more info here)
So it loads types in stages, step-by-step, ensuring each dependant type has reached the same stage before continuing. These ‘Class Load’ stages are shown in the image below and explained in detail in this very helpful source-code comment (Yay for Open-Sourcing the CoreCLR!!)
The different levels are handled in the ClassLoader::DoIncrementalLoad(..) method, which contains the switch
statement that deals with them all in turn.
However this is part of a bigger process, which controls loading an entire file, also known as a Module
or Assembly
in .NET terminology. The entire process for that is handled in by another dispatch loop (switch statement), that works with the FileLoadLevel
enum (definition). So in reality the whole process for loading an Assembly
looks like this (the loading of one or more Types happens as sub-steps once the Module
had reached the FILE_LOADED
stage)
- FILE_LOAD_CREATE - DomainFile ctor()
- FILE_LOAD_BEGIN - Begin()
- FILE_LOAD_FIND_NATIVE_IMAGE - FindNativeImage()
- FILE_LOAD_VERIFY_NATIVE_IMAGE_DEPENDENCIES - VerifyNativeImageDependencies()
- FILE_LOAD_ALLOCATE - Allocate()
- FILE_LOAD_ADD_DEPENDENCIES - AddDependencies()
- FILE_LOAD_PRE_LOADLIBRARY - PreLoadLibrary()
- FILE_LOAD_LOADLIBRARY - LoadLibrary()
- FILE_LOAD_POST_LOADLIBRARY - PostLoadLibrary()
- FILE_LOAD_EAGER_FIXUPS - EagerFixups()
- FILE_LOAD_VTABLE_FIXUPS - VtableFixups()
- FILE_LOAD_DELIVER_EVENTS - DeliverSyncEvents()
- FILE_LOADED - FinishLoad()
- CLASS_LOAD_BEGIN
- CLASS_LOAD_UNRESTOREDTYPEKEY
- CLASS_LOAD_UNRESTORED
- CLASS_LOAD_APPROXPARENTS
- CLASS_LOAD_EXACTPARENTS
- CLASS_DEPENDENCIES_LOADED
- CLASS_LOADED
- FILE_LOAD_VERIFY_EXECUTION - VerifyExecution()
- FILE_ACTIVE - Activate()
We can see this in action if we build a Debug version of the CoreCLR and enable the relevant configuration knobs. For a simple ‘Hello World’ program we get the log output shown below, where LOADER:
messages correspond to FILE_LOAD_XXX
stages and PHASEDLOAD:
messages indicate which CLASS_LOAD_XXX
step we are on.
You can also see some of the other events that happen at the same time, these include creation of static
variables (STATICS:
), thread-statics (THREAD STATICS:
) and PreStubWorker
which indicates methods being prepared for the JITter.
-------------------------------------------------------------------------------------------------------
This is NOT the full output, it's only the parts that reference 'Program.exe' and it's modules/classses
-------------------------------------------------------------------------------------------------------
PEImage: Opened HMODULE C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe
StoreFile: Add cached entry (000007FE65174540) with PEFile 000000000040D6E0
Assembly C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe: bits=0x2
LOADER: 439e30:***Program* >>>Load initiated, LOADED/LOADED
LOADER: 0000000000439E30:***Program* loading at level BEGIN
LOADER: 0000000000439E30:***Program* loading at level FIND_NATIVE_IMAGE
LOADER: 0000000000439E30:***Program* loading at level VERIFY_NATIVE_IMAGE_DEPENDENCIES
LOADER: 0000000000439E30:***Program* loading at level ALLOCATE
STATICS: Allocating statics for module Program
Loaded pModule: "C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe".
Module Program: bits=0x2
STATICS: Allocating 72 bytes for precomputed statics in module C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe in LoaderAllocator 000000000043AA18
StoreFile (StoreAssembly): Add cached entry (000007FE65174F28) with PEFile 000000000040D6E0Completed Load Level ALLOCATE for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 0000000000439E30:***Program* loading at level ADD_DEPENDENCIES
Completed Load Level ADD_DEPENDENCIES for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 0000000000439E30:***Program* loading at level PRE_LOADLIBRARY
LOADER: 0000000000439E30:***Program* loading at level LOADLIBRARY
LOADER: 0000000000439E30:***Program* loading at level POST_LOADLIBRARY
LOADER: 0000000000439E30:***Program* loading at level EAGER_FIXUPS
LOADER: 0000000000439E30:***Program* loading at level VTABLE FIXUPS
LOADER: 0000000000439E30:***Program* loading at level DELIVER_EVENTS
DRCT::IsReady - wait(0x100)=258, GetLastError() = 42424
DRCT::IsReady - wait(0x100)=258, GetLastError() = 42424
D::LA: Load Assembly Asy:0x000000000040D8C0 AD:0x0000000000439E30 which:C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe
Completed Load Level DELIVER_EVENTS for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 0000000000439E30:***Program* loading at level LOADED
Completed Load Level LOADED for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 439e30:***Program* Load initiated, ACTIVE/ACTIVE
LOADER: 0000000000439E30:***Program* loading at level VERIFY_EXECUTION
LOADER: 0000000000439E30:***Program* loading at level ACTIVE
Completed Load Level ACTIVE for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 439e30:***Program*
Thu, 15 Jun 2017, 12:00 am
Lowering in the C# Compiler (and what happens when you misuse it)
Turns out that what I’d always thought of as “Compiler magic” or “Syntactic sugar” is actually known by the technical term ‘Lowering’ and the C# compiler (a.k.a Roslyn) uses it extensively.
But what is it? Well this quote from So You Want To Write Your Own Language? gives us some idea:
Lowering
One semantic technique that is obvious in hindsight (but took Andrei Alexandrescu to point out to me) is called “lowering.” It consists of, internally, rewriting more complex semantic constructs in terms of simpler ones. For example, while loops and foreach loops can be rewritten in terms of for loops. Then, the rest of the code only has to deal with for loops. This turned out to uncover a couple of latent bugs in how while loops were implemented in D, and so was a nice win. It’s also used to rewrite scope guard statements in terms of try-finally statements, etc. Every case where this can be found in the semantic processing will be win for the implementation.
– by Walter Bright (author of the D programming language)
But if you’re still not sure what it means, have a read of Eric Lippert’s post on the subject, Lowering in language design, which contains this quote:
A common technique along the way though is to have the compiler “lower” from high-level language features to low-level language features in the same language.
As an aside, if you like reading about the Roslyn compiler source you may like these other posts that I’ve written:
What does ‘Lowering’ look like?
The C# compiler has used lowering for a while, one of the oldest or most recognised examples is when this code:
using System.Collections.Generic;
public class C {
public IEnumerable M()
{
foreach (var value in new [] { 1, 2, 3, 4, 5 })
{
yield return value;
}
}
}
is turned into this
public class C
{
[CompilerGenerated]
private sealed class d__0 : IEnumerable, IEnumerable, IEnumerator, IDisposable, IEnumerator
{
private int 1__state;
private int 2__current;
private int l__initialThreadId;
public C 4__this;
private int[] s__1;
private int s__2;
private int 5__3;
int IEnumerator.Current
{
[DebuggerHidden]
get
{
return this.2__current;
}
}
object IEnumerator.Current
{
[DebuggerHidden]
get
{
return this.2__current;
}
}
[DebuggerHidden]
public d__0(int 1__state)
{
this.1__state = 1__state;
this.l__initialThreadId = Environment.CurrentManagedThreadId;
}
[DebuggerHidden]
void IDisposable.Dispose()
{
}
bool IEnumerator.MoveNext()
{
int num = this.1__state;
if (num != 0)
{
if (num != 1)
{
return false;
}
this.1__state = -1;
this.s__2++;
}
else
{
this.1__state = -1;
this.s__1 = new int[] { 1, 2, 3, 4, 5 };
this.s__2 = 0;
}
if (this.s__2 >= this.s__1.Length)
{
this.s__1 = null;
return false;
}
this.5__3 = this.s__1[this.s__2];
this.2__current = this.5__3;
this.1__state = 1;
return true;
}
[DebuggerHidden]
void IEnumerator.Reset()
{
throw new NotSupportedException();
}
[DebuggerHidden]
IEnumerator IEnumerable.GetEnumerator()
{
C.d__0 d__;
if (this.1__state == -2 && this.l__initialThreadId == Environment.CurrentManagedThreadId)
{
this.1__state = 0;
d__ = this;
}
else
{
d__ = new C.d__0(0);
d__.4__this = this.4__this;
}
return d__;
}
[DebuggerHidden]
IEnumerator IEnumerable.GetEnumerator()
{
return this.System.Collections.Generic.IEnumerable.GetEnumerator();
}
}
[IteratorStateMachine(typeof(C.d__0))]
public IEnumerable M()
{
C.d__0 expr_07 = new C.d__0(-2);
expr_07.4__this = this;
return expr_07;
}
}
Yikes, I’m glad we don’t have to write that code ourselves!! There’s an entire state-machine in there, built to allow our original code to be halted/resumed each time round the loop (at the ‘yield’ statement).
The C# compiler and ‘Lowering’
But it turns out that the Roslyn compiler does a lot more ‘lowering’ than you might think. If you take a look at the code under ‘/src/Compilers/CSharp/Portable/Lowering’ (VB.NET equivalent here), you see the following folders:
Which correspond to some C# language features you might be familar with, such as ‘lambdas’, i.e. x => x.Name > 5
, ‘iterators’ used by yield
(above) and the async
keyword.
However if we look at bit deeper, under the ‘LocalRewriter’ folder we can see lots more scenarios that we might never have considered ‘lowering’, such as:
So a big thank-you is due to all the past and present C# language developers and designers, they did all this work for us. Imagine that C# didn’t have all these high-level features, we’d be stuck writing them by hand.
It would be like writing Java :-)
What happens when you misuse it
But of course the real fun part is ‘misusing’ or outright ‘abusing’ the compiler. So I set up a little twitter competition just how much ‘lowering’ could we get the compiler to do for us (i.e the highest ratio of ‘input’ lines of code to ‘output’ lines).
It had the following rules (see this gist for more info):
- You can have as many lines as you want within method
M()
- No single line can be longer than 100 chars
- To get your score, divide the ‘# of expanded lines’ by the ‘# of original line(s)’
- Based on the default output formatting of https://sharplab.io, no re-formatting allowed!!
- But you can format the intput however you want, i.e. make use of the full 100 chars
- Must compile with no warnings on https://sharplab.io (allows C# 7 features)
- But doesn’t have to do anything sensible when run
- You cannot modify the code that is already there, i.e.
public class C {}
and public void M()
- Cannot just add
async
to public void M()
, that’s too easy!!
- You can add new
using ...
declarations, these do not count towards the line count
For instance with the following code (interactive version available on sharplab.io):
using System;
public class C {
public void M() {
Func test = () => "blah"?.ToString();
}
}
This counts as 1 line of original code (only code inside method M()
is counted)
This expands to 23 lines (again only lines of code inside the braces ({
, }
) of class C
are counted.
Giving a total score of 23 (23 / 1)
....
public class C
{
[CompilerGenerated]
[Serializable]
private sealed class c
{
public static readonly C.c 9;
public static Func 9__0_0;
static c()
{
// Note: this type is marked as 'beforefieldinit'.
C.c.9 = new C.c();
}
internal string b__0_0()
{
return "blah";
}
}
public void M()
{
if (C.c.9__0_0 == null)
{
C.c.9__0_0 = new Func(C.c.9.b__0_0);
}
}
}
Results
The first place entry was the following entry from Schabse Laks, which contains 9 lines-of-code inside the M()
method:
using System.Linq;
using Y = System.Collections.Generic.IEnumerable;
public class C {
public void M() {
((Y)null).Select(async x => await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await x.x()());
}
}
this expands to an impressive 7964 lines of code (yep you read that right!!) for a score of 885 (7964 / 9). The main trick he figured out was that adding more lines to the input increased the score, i.e is scales superlinearly. Although it you take things too far the compiler bails out with a pretty impressive error message:
error CS8078: An expression is too long or complex to compile
Here’s the Top 6 top results:
Submitter
Entry
Score
Schabse Laks
link
885 (7964 / 9)
Andrey Dyatlov
link
778 (778 / 1)
alrz
link
755 (755 / 1)
Andy Gocke *
link
633 (633 / 1)
Jared Parsons *
link
461 (461 / 1)
Jonathan Chambers
link
384 (384 / 1)
*
= member of the Roslyn compiler team (they’re not disqualified, but maybe they should have some kind of handicap applied to ‘even out’ the playing field?)
Honourable mentions
However there were some other entries that whilst they didn’t make it into the Top 6, are still worth a mention due to the ingenuity involved:
Discuss this post on HackerNews, /r/programming or /r/csharp (whichever takes your fancy!!)
The post Lowering in the C# Compiler (and what happens when you misuse it) first appeared on my blog Performance is a Feature!
CodeProject
Thu, 25 May 2017, 12:00 am
Adding a new Bytecode Instruction to the CLR
Now that the CoreCLR is open-source we can do fun things, for instance find out if it’s possible to add new IL (Intermediate Language) instruction to the runtime.
TL;DR it turns out that it’s easier than you might think!! Here are the steps you need to go through:
Update: turns out that I wasn’t the only person to have this idea, see Beachhead implements new opcode on CLR JIT for another implementation by Kouji Matsui.
Step 0
But first a bit of background information. Adding a new IL instruction to the CLR is a pretty rare event, that last time is was done for real was in .NET 2.0 when support for generics was added. This is part of the reason why .NET code had good backwards-compatibility, from Backward compatibility and the .NET Framework 4.5:
The .NET Framework 4.5 and its point releases (4.5.1, 4.5.2, 4.6, 4.6.1, 4.6.2, and 4.7) are backward-compatible with apps that were built with earlier versions of the .NET Framework. In other words, apps and components built with previous versions will work without modification on the .NET Framework 4.5.
Side note: The .NET framework did break backwards compatibility when moving from 1.0 to 2.0, precisely so that support for generics could be added deep into the runtime, i.e. with support in the IL. Java took a different decision, I guess because it had been around longer, breaking backwards-comparability was a bigger issue. See the excellent blog post Comparing Java and C# Generics for more info.
Step 1
For this exercise I plan to add a new IL instruction (op-code) to the CoreCLR runtime and because I’m a raving narcissist (not really, see below) I’m going to name it after myself. So let me introduce the matt
IL instruction, that you can use like so:
.method private hidebysig static int32 TestMattOpCodeMethod(int32 x, int32 y)
cil managed noinlining
{
.maxstack 2
ldarg.0
ldarg.1
matt // yay, my name as an IL op-code!!!!
ret
}
But because I’m actually a bit-British (i.e. I don’t like to ‘blow my own trumpet’), I’m going to make the matt
op-code almost completely pointless, it’s going to do exactly the same thing as calling Math.Max(x, y)
, i.e. just return the largest of the 2 numbers.
The other reason for naming it matt
is that I’d really like someone to make a version of the C# (Roslyn) compiler that allows you to write code like this:
Console.WriteLine("{0} m@ {1} = {2}", 1, 7, 1 m@ 7)); // prints '1 m@ 7 = 7'
I definitely want the m@
operator to be a thing (pronounced ‘matt’, not ‘m-at’), maybe the other ‘Matt Warren’ who works at Microsoft on the C# Language Design Team can help out!! Seriously though, if anyone reading this would like to write a similar blog post, showing how you’d add the m@
operator to the Roslyn compiler, please let me know I’d love to read it.
Update: Thanks to Marcin Juraszek (@mmjuraszek) you can now use the m@
in a C# program, see Adding Matt operator to Roslyn - Syntax, Lexer and Parser, Adding Matt operator to Roslyn - Binder and Adding Matt operator to Roslyn - Emitter for the full details.
Now we’ve defined the op-code, the first step is to ensure that the run-time and tooling can recognise it. In particular we need the IL Assembler (a.k.a ilasm
) to be able to take the IL code above (TestMattOpCodeMethod(..)
) and produce a .NET executable.
As the .NET runtime source code is nicely structured (+1 to the runtime devs), to make this possible we only need to makes changes in opcode.def:
--- a/src/inc/opcode.def
+++ b/src/inc/opcode.def
@@ -154,7 +154,7 @@ OPDEF(CEE_NEWOBJ, "newobj", VarPop, Pu
OPDEF(CEE_CASTCLASS, "castclass", PopRef, PushRef, InlineType, IObjModel, 1, 0xFF, 0x74, NEXT)
OPDEF(CEE_ISINST, "isinst", PopRef, PushI, InlineType, IObjModel, 1, 0xFF, 0x75, NEXT)
OPDEF(CEE_CONV_R_UN, "conv.r.un", Pop1, PushR8, InlineNone, IPrimitive, 1, 0xFF, 0x76, NEXT)
-OPDEF(CEE_UNUSED58, "unused", Pop0, Push0, InlineNone, IPrimitive, 1, 0xFF, 0x77, NEXT)
+OPDEF(CEE_MATT, "matt", Pop1+Pop1, Push1, InlineNone, IPrimitive, 1, 0xFF, 0x77, NEXT)
OPDEF(CEE_UNUSED1, "unused", Pop0, Push0, InlineNone, IPrimitive, 1, 0xFF, 0x78, NEXT)
OPDEF(CEE_UNBOX, "unbox", PopRef, PushI, InlineType, IPrimitive, 1, 0xFF, 0x79, NEXT)
OPDEF(CEE_THROW, "throw", PopRef, Push0, InlineNone, IObjModel, 1, 0xFF, 0x7A, THROW)
I just picked the first available unused
slot and added matt
in there. It’s defined as Pop1+Pop1
because it takes 2 values from the stack as input and Push1
because after is has executed, a single result is pushed back onto the stack.
Note: all the changes I made are available in one-place on GitHub if you’d rather look at them like that.
Once this change was done ilasm
will successfully assembly the test code file HelloWorld.il
that contains TestMattOpCodeMethod(..)
as shown above:
λ ilasm /EXE /OUTPUT=HelloWorld.exe -NOLOGO HelloWorld.il
Assembling 'HelloWorld.il' to EXE --> 'HelloWorld.exe'
Source file is ANSI
Assembled method HelloWorld::Main
Assembled method HelloWorld::TestMattOpCodeMethod
Creating PE file
Emitting classes:
Class 1: HelloWorld
Emitting fields and methods:
Global
Class 1 Methods: 2;
Resolving local member refs: 1 -> 1 defs, 0 refs, 0 unresolved
Emitting events and properties:
Global
Class 1
Resolving local member refs: 0 -> 0 defs, 0 refs, 0 unresolved
Writing PE file
Operation completed successfully
Step 2
However at this point the matt
op-code isn’t actually executed, at runtime the CoreCLR just throws an exception because it doesn’t know what to do with it. As a first (simpler) step, I just wanted to make the .NET Interpreter work, so I made the following changes to wire it up:
--- a/src/vm/interpreter.cpp
+++ b/src/vm/interpreter.cpp
@@ -2726,6 +2726,9 @@ void Interpreter::ExecuteMethod(ARG_SLOT* retVal, __out bool* pDoJmpCall, __out
case CEE_REM_UN:
BinaryIntOp();
break;
+ case CEE_MATT:
+ BinaryArithOp();
+ break;
case CEE_AND:
BinaryIntOp();
break;
--- a/src/vm/interpreter.hpp
+++ b/src/vm/interpreter.hpp
@@ -298,10 +298,14 @@ void Interpreter::BinaryArithOpWork(T val1, T val2)
{
res = val1 / val2;
}
- else
+ else if (op == BA_Rem)
{
res = RemFunc(val1, val2);
}
+ else if (op == BA_Matt)
+ {
+ res = MattFunc(val1, val2);
+ }
}
and then I added the methods that would actually implement the interpreted code:
--- a/src/vm/interpreter.cpp
+++ b/src/vm/interpreter.cpp
@@ -10801,6 +10804,26 @@ double Interpreter::RemFunc(double v1, double v2)
return fmod(v1, v2);
}
+INT32 Interpreter::MattFunc(INT32 v1, INT32 v2)
+{
+ return v1 > v2 ? v1 : v2;
+}
+
+INT64 Interpreter::MattFunc(INT64 v1, INT64 v2)
+{
+ return v1 > v2 ? v1 : v2;
+}
+
+float Interpreter::MattFunc(float v1, float v2)
+{
+ return v1 > v2 ? v1 : v2;
+}
+
+double Interpreter::MattFunc(double v1, double v2)
+{
+ return v1 > v2 ? v1 : v2;
+}
So fairly straight-forward and the bonus is that at this point the matt
operator is fully operational, you can actually write IL using it and it will run (interpreted only).
Step 3
However not everyone wants to re-compile the CoreCLR just to enable the Interpreter, so I want to also make it work for real via the Just-in-Time (JIT) compiler.
The full changes to make this work were spread across multiple files, but were mostly housekeeping so I won’t include them all here, check-out the full diff if you’re interested. But the significant parts are below:
--- a/src/jit/importer.cpp
+++ b/src/jit/importer.cpp
@@ -11112,6 +11112,10 @@ void Compiler::impImportBlockCode(BasicBlock* block)
oper = GT_UMOD;
goto MATH_MAYBE_CALL_NO_OVF;
+ case CEE_MATT:
+ oper = GT_MATT;
+ goto MATH_MAYBE_CALL_NO_OVF;
+
MATH_MAYBE_CALL_NO_OVF:
ovfl = false;
MATH_MAYBE_CALL_OVF:
--- a/src/vm/jithelpers.cpp
+++ b/src/vm/jithelpers.cpp
@@ -341,6 +341,14 @@ HCIMPL2(UINT32, JIT_UMod, UINT32 dividend, UINT32 divisor)
HCIMPLEND
/*********************************************************************/
+HCIMPL2(INT32, JIT_Matt, INT32 x, INT32 y)
+{
+ FCALL_CONTRACT;
+ return x > y ? x : y;
+}
+HCIMPLEND
+
+/*********************************************************************/
HCIMPL2_VV(INT64, JIT_LDiv, INT64 dividend, INT64 divisor)
{
FCALL_CONTRACT;
In summary, these changes mean that during the JIT’s ‘Morph phase’ the IL containing the matt
op code is converted from:
fgMorphTree BB01, stmt 1 (before)
[000004] ------------ ▌ return int
[000002] ------------ │ ┌──▌ lclVar int V01 arg1
[000003] ------------ └──▌ m@ int
[000001] ------------ └──▌ lclVar int V00 arg0
into this:
fgMorphTree BB01, stmt 1 (after)
[000004] --C--+------ ▌ return int
[000003] --C--+------ └──▌ call help int HELPER.CORINFO_HELP_MATT
[000001] -----+------ arg0 in rcx ├──▌ lclVar int V00 arg0
[000002] -----+------ arg1 in rdx └──▌ lclVar int V01 arg1
Note the call to HELPER.CORINFO_HELP_MATT
When this is finally compiled into assembly code it ends up looking like so:
// Assembly listing for method HelloWorld:TestMattOpCodeMethod(int,int):int
// Emitting BLENDED_CODE for X64 CPU with AVX
// optimized code
// rsp based frame
// partially interruptible
// Final local variable assignments
//
// V00 arg0 [V00,T00] ( 3, 3 ) int -> rcx
// V01 arg1 [V01,T01] ( 3, 3 ) int -> rdx
// V02 OutArgs [V02 ] ( 1, 1 ) lclBlk (32) [rsp+0x00]
//
// Lcl frame size = 40
G_M9261_IG01:
4883EC28 sub rsp, 40
G_M9261_IG02:
E8976FEB5E call CORINFO_HELP_MATT
90 nop
G_M9261_IG03:
4883C428 add rsp, 40
C3 ret
I’m not entirely sure why there is a nop
instruction in there? But it works, which is the main thing!!
Step 4
In the CLR you can also dynamically emit code at runtime using the methods that sit under the ‘System.Reflection.Emit’ namespace, so the last task is to add the OpCodes.Matt
field and have it emit the correct values for the matt
op-code.
--- a/src/mscorlib/src/System/Reflection/Emit/OpCodes.cs
+++ b/src/mscorlib/src/System/Reflection/Emit/OpCodes.cs
@@ -139,6 +139,7 @@ internal enum OpCodeValues
Castclass = 0x74,
Isinst = 0x75,
Conv_R_Un = 0x76,
+ Matt = 0x77,
Unbox = 0x79,
Throw = 0x7a,
Ldfld = 0x7b,
@@ -1450,6 +1451,16 @@ private OpCodes()
(0
Fri, 19 May 2017, 12:00 am
Arrays and the CLR - a Very Special Relationship
A while ago I wrote about the ‘special relationship’ that exists between Strings and the CLR, well it turns out that Arrays and the CLR have an even deeper one, the type of closeness where you hold hands on your first meeting
As an aside, if you like reading about CLR internals you may find these other posts interesting:
Fundamental to the Common Language Runtime (CLR)
Arrays are such a fundamental part of the CLR that they are included in the ECMA specification, to make it clear that the runtime has to implement them:
In addition, there are several IL (Intermediate Language) instructions that specifically deal with arrays:
newarr
- Create a new array with elements of type etype.
ldelem.ref
- Load the element at index onto the top of the stack as an O. The type of the O is the same as the element type of the array pushed on the CIL stack.
stelem
- Replace array element at index with the value on the stack (also
stelem.i
, stelem.i1
, stelem.i2
, stelem.r4
etc)
ldlen
- Push the length (of type native unsigned int) of array on the stack.
This makes sense because arrays are the building blocks of so many other data types, you want them to be available, well defined and efficient in a modern high-level language like C#. Without arrays you can’t have lists, dictionaries, queues, stacks, trees, etc, they’re all built on-top of arrays which provided low-level access to contiguous pieces of memory in a type-safe way.
Memory and Type Safety
This memory and type-safety is important because without it .NET couldn’t be described as a ‘managed runtime’ and you’d be left having to deal with the types of issues you get when you are writing code in a more low-level language.
More specifically, the CLR provides the following protections when you are using arrays (from the section on Memory and Type Safety in the BOTR ‘Intro to the CLR’ page):
While a GC is necessary to ensure memory safety, it is not sufficient. The GC will not prevent the program from indexing off the end of an array or accessing a field off the end of an object (possible if you compute the field’s address using a base and offset computation). However, if we do prevent these cases, then we can indeed make it impossible for a programmer to create memory-unsafe programs.
While the common intermediate language (CIL) does have operators that can fetch and set arbitrary memory (and thus violate memory safety), it also has the following memory-safe operators and the CLR strongly encourages their use in most programming:
- Field-fetch operators (LDFLD, STFLD, LDFLDA) that fetch (read), set and take the address of a field by name.
- Array-fetch operators (LDELEM, STELEM, LDELEMA) that fetch, set and take the address of an array element by index. All arrays include a tag specifying their length. This facilitates an automatic bounds check before each access.
Also, from the section on Verifiable Code - Enforcing Memory and Type Safety in the same BOTR page
In practice, the number of run-time checks needed is actually very small. They include the following operations:
- Casting a pointer to a base type to be a pointer to a derived type (the opposite direction can be checked statically)
- Array bounds checks (just as we saw for memory safety)
- Assigning an element in an array of pointers to a new (pointer) value. This particular check is only required because CLR arrays have liberal casting rules (more on that later…)
However you don’t get this protection for free, there’s a cost to pay:
Note that the need to do these checks places requirements on the runtime. In particular:
- All memory in the GC heap must be tagged with its type (so the casting operator can be implemented). This type information must be available at runtime, and it must be rich enough to determine if casts are valid (e.g., the runtime needs to know the inheritance hierarchy). In fact, the first field in every object on the GC heap points to a runtime data structure that represents its type.
- All arrays must also have their size (for bounds checking).
- Arrays must have complete type information about their element type.
Implementation Details
It turns out that large parts of the internal implementation of arrays is best described as magic, this Stack Overflow comment from Marc Gravell sums it up nicely
Arrays are basically voodoo. Because they pre-date generics, yet must allow on-the-fly type-creation (even in .NET 1.0), they are implemented using tricks, hacks, and sleight of hand.
Yep that’s right, arrays were parametrised (i.e. generic) before generics even existed. That means you could create arrays such as int[]
and string[]
, long before you were able to write List
or List
, which only became possible in .NET 2.0.
Special helper classes
All this magic or sleight of hand is made possible by 2 things:
- The CLR breaking all the usual type-safety rules
- A special array helper class called
SZArrayHelper
But first the why, why were all these tricks needed? From .NET Arrays, IList, Generic Algorithms, and what about STL?:
When we were designing our generic collections classes, one of the things that bothered me was how to write a generic algorithm that would work on both arrays and collections. To drive generic programming, of course we must make arrays and generic collections as seamless as possible. It felt that there should be a simple solution to this problem that meant you shouldn’t have to write the same code twice, once taking an IList and again taking a T[]. The solution that dawned on me was that arrays needed to implement our generic IList. We made arrays in V1 implement the non-generic IList, which was rather simple due to the lack of strong typing with IList and our base class for all arrays (System.Array). What we needed was to do the same thing in a strongly typed way for IList.
But it was only done for the common case, i.e. ‘single dimensional’ arrays:
There were some restrictions here though – we didn’t want to support multidimensional arrays since IList only provides single dimensional accesses. Also, arrays with non-zero lower bounds are rather strange, and probably wouldn’t mesh well with IList, where most people may iterate from 0 to the return from the Count property on that IList. So, instead of making System.Array implement IList, we made T[] implement IList. Here, T[] means a single dimensional array with 0 as its lower bound (often called an SZArray internally, but I think Brad wanted to promote the term ‘vector’ publically at one point in time), and the element type is T. So Int32[] implements IList, and String[] implements IList.
Also, this comment from the array source code sheds some further light on the reasons:
//----------------------------------------------------------------------------------
// Calls to (IList)(array).Meth are actually implemented by SZArrayHelper.Meth
// This workaround exists for two reasons:
//
// - For working set reasons, we don't want insert these methods in the array
// hierachy in the normal way.
// - For platform and devtime reasons, we still want to use the C# compiler to
// generate the method bodies.
//
// (Though it's questionable whether any devtime was saved.)
//
// ....
//----------------------------------------------------------------------------------
So it was done for convenience and efficiently, as they didn’t want every instance of System.Array
to carry around all the code for the IEnumerable
and IList
implementations.
This mapping takes places via a call to GetActualImplementationForArrayGenericIListOrIReadOnlyListMethod(..), which wins the prize for the best method name in the CoreCLR source!! It’s responsible for wiring up the corresponding method from the SZArrayHelper class, i.e. IList.Count
-> SZArrayHelper.Count
or if the method is part of the IEnumerator
interface, the SZGenericArrayEnumerator is used.
But this has the potential to cause security holes, as it breaks the normal C# type system guarantees, specifically regarding the this
pointer. To illustrate the problem, here’s the source code of the Count
property, note the call to JitHelpers.UnsafeCast
:
internal int get_Count()
{
//! Warning: "this" is an array, not an SZArrayHelper. See comments above
//! or you may introduce a security hole!
T[] _this = JitHelpers.UnsafeCast(this);
return _this.Length;
}
Yikes, it has to remap this
to be able to call Length
on the correct object!!
And just in case those comments aren’t enough, there is a very strongly worded comment at the top of the class that further spells out the risks!!
Generally all this magic is hidden from you, but occasionally it leaks out. For instance if you run the code below, SZArrayHelper
will show up in the StackTrace
and TargetSite
of properties of the NotSupportedException
:
try {
int[] someInts = { 1, 2, 3, 4 };
IList collection = someInts;
// Throws NotSupportedException 'Collection is read-only'
collection.Clear();
} catch (NotSupportedException nsEx) {
Console.WriteLine("{0} - {1}", nsEx.TargetSite.DeclaringType, nsEx.TargetSite);
Console.WriteLine(nsEx.StackTrace);
}
Removing Bounds Checks
The runtime also provides support for arrays in more conventional ways, the first of which is related to performance. Array bounds checks are all well and good when providing memory-safety, but they have a cost, so where possible the JIT removes any checks that it knows are redundant.
It does this by calculating the range of values that a for
loop access and compares those to the actual length of the array. If it determines that there is never an attempt to access an item outside the permissible bounds of the array, the run-time checks are then removed.
For more information, the links below take you to the areas of the JIT source code that deal with this:
And if you are really keen, take a look at this gist that I put together to explore the scenarios where bounds checks are ‘removed’ and ‘not removed’.
Allocating an array
Another task that the runtime helps with is allocating arrays, using hand-written assembly code so the methods are as optimised as possible, see:
Run-time treats arrays differently
Finally, because arrays are so intertwined with the CLR, there are lots of places in which they are dealt with as a special-case. For instance a search for ‘IsArray()’ in the CoreCLR source returns over 60 hits, including:
- The method table for an array is built differently
- When you call
ToString()
on an array, you get special formatting, i.e. ‘System.Int32[]’ or ‘MyClass[,]’
So yes, it’s fair to say that arrays and the CLR have a Very Special Relationship
Further Reading
As always, here are some more links for your enjoyment!!
Array source code references
The post Arrays and the CLR - a Very Special Relationship first appeared on my blog Performance is a Feature!
CodeProject
Mon, 8 May 2017, 12:00 am
The CLR Thread Pool 'Thread Injection' Algorithm
If you’re near London at the end of April, I’ll be speaking at ProgSCon 2017 on Microsoft and Open-Source – A ‘Brave New World’. ProgSCon is 1-day conference, with talks covering an eclectic range of topics, you’ll learn lots!!
As part of a never-ending quest to explore the CoreCLR source code I stumbled across the intriguing titled ‘HillClimbing.cpp’ source file. This post explains what it does and why.
What is ‘Hill Climbing’
It turns out that ‘Hill Climbing’ is a general technique, from the Wikipedia page on the Hill Climbing Algorithm:
In computer science, hill climbing is a mathematical optimization technique which belongs to the family of local search. It is an iterative algorithm that starts with an arbitrary solution to a problem, then attempts to find a better solution by incrementally changing a single element of the solution. If the change produces a better solution, an incremental change is made to the new solution, repeating until no further improvements can be found.
But in the context of the CoreCLR, ‘Hill Climbing’ (HC) is used to control the rate at which threads are added to the Thread Pool, from the MSDN page on ‘Parallel Tasks’:
Thread Injection
The .NET thread pool automatically manages the number of worker threads in the pool. It adds and removes threads according to built-in heuristics. The .NET thread pool has two main mechanisms for injecting threads: a starvation-avoidance mechanism that adds worker threads if it sees no progress being made on queued items and a hill-climbing heuristic that tries to maximize throughput while using as few threads as possible.
…
A goal of the hill-climbing heuristic is to improve the utilization of cores when threads are blocked by I/O or other wait conditions that stall the processor
….
The .NET thread pool has an opportunity to inject threads every time a work item completes or at 500 millisecond intervals, whichever is shorter. The thread pool uses this opportunity to try adding threads (or taking them away), guided by feedback from previous changes in the thread count. If adding threads seems to be helping throughput, the thread pool adds more; otherwise, it reduces the number of worker threads. This technique is called the hill-climbing heuristic.
For more specifics on what the algorithm is doing, you can read the research paper Optimizing Concurrency Levels in the .NET ThreadPool published by Microsoft, although it you want a brief outline of what it’s trying to achieve, this summary from the paper is helpful:
In addition the controller should have:
- short settling times so that cumulative throughput is maximized
- minimal oscillations since changing control settings incurs overheads that reduce throughput
- fast adaptation to changes in workloads and resource characteristics.
So reduce throughput, don’t add and then remove threads too fast, but still adapt quickly to changing work-loads, simple really!!
As an aside, after reading (and re-reading) the research paper I found it interesting that a considerable amount of it was dedicated to testing, as the following excerpt shows:
In fact the approach to testing was considered so important that they wrote an entire follow-up paper that discusses it, see Configuring Resource Managers Using Model Fuzzing.
Why is it needed?
Because, in short, just adding new threads doesn’t always increase throughput and ultimately having lots of threads has a cost. As this comment from Eric Eilebrecht, one of the authors of the research paper explains:
Throttling thread creation is not only about the cost of creating a thread; it’s mainly about the cost of having a large number of running threads on an ongoing basis. For example:
- More threads means more context-switching, which adds CPU overhead. With a large number of threads, this can have a significant impact.
- More threads means more active stacks, which impacts data locality. The more stacks a CPU is having to juggle in its various caches, the less effective those caches are.
The advantage of more threads than logical processors is, of course, that we can keep the CPU busy if some of the threads are blocked, and so get more work done. But we need to be careful not to “overreact” to blocking, and end up hurting performance by having too many threads.
Or in other words, from Concurrency - Throttling Concurrency in the CLR 4.0 ThreadPool
As opposed to what may be intuitive, concurrency control is about throttling and reducing the number of work items that can be run in parallel in order to improve the worker ThreadPool throughput (that is, controlling the degree of concurrency is preventing work from running).
So the algorithm was designed with all these criteria in mind and was then tested over a large range of scenarios, to ensure it actually worked! This is why it’s often said that you should just leave the .NET ThreadPool alone, not try and tinker with it. It’s been heavily tested to work across a multiple situations and it was designed to adapt over time, so it should have you covered! (although of course, there are times when it doesn’t work perfectly!!)
The Algorithm in Action
As the source in now available, we can actually play with the algorithm and try it out in a few scenarios to see what it does. It needs very few dependences and therefore all the relevant code is contained in the following files:
(For comparison, there’s an implementation of the same algorithm in the Mono source code)
I have a project up on my GitHub page that allows you to test the hill-climbing algorithm in a self-contained console app. If you’re interested you can see the changes/hacks I had to do to get it building, although in the end it was pretty simple! (Update Kudos to Christian Klutz who ported my self-contained app to C#, nice job!!)
The algorithm is controlled via the following HillClimbing_XXX
settings:
Setting
Default Value
Notes
HillClimbing_WavePeriod
4
HillClimbing_TargetSignalToNoiseRatio
300
HillClimbing_ErrorSmoothingFactor
1
HillClimbing_WaveMagnitudeMultiplier
100
HillClimbing_MaxWaveMagnitude
20
HillClimbing_WaveHistorySize
8
HillClimbing_Bias
15
The ‘cost’ of a thread. 0 means drive for increased throughput regardless of thread count; higher values bias more against higher thread counts
HillClimbing_MaxChangePerSecond
4
HillClimbing_MaxChangePerSample
20
HillClimbing_MaxSampleErrorPercent
15
HillClimbing_SampleIntervalLow
10
HillClimbing_SampleIntervalHigh
200
HillClimbing_GainExponent
200
The exponent to apply to the gain, times 100. 100 means to use linear gain, higher values will enhance large moves and damp small ones
Because I was using the code in a self-contained console app, I just hard-coded the default values into the source, but in the CLR it appears that you can modify these values at runtime.
Working with the Hill Climbing code
There are several things I discovered when implementing a simple test app that works with the algorithm:
- The calculation is triggered by calling the function
HillClimbingInstance.Update(currentThreadCount, sampleDuration, numCompletions, &threadAdjustmentInterval)
and the return value is the new ‘maximum thread count’ that the algorithm is proposing.
- It calculates the desired number of threads based on the ‘current throughput’, which is the ‘# of tasks completed’ (
numCompletions
) during the current time-period (sampleDuration
in seconds).
- It also takes the current thread count (
currentThreadCount
) into consideration.
- The core calculations (excluding error handling and house-keeping) are only just over 100 LOC, so it’s not too hard to follow.
- It works on the basis of ‘transitions’ (
HillClimbingStateTransition
), first Warmup
, then Stabilizing
and will only recommend a new value once it’s moved into the ClimbingMove
state.
- The real .NET Thread Pool only increases the thread-count by one thread every 500 milliseconds. It keeps doing this until the ‘# of threads’ has reached the amount that the hill-climbing algorithm suggests. See ThreadpoolMgr::ShouldAdjustMaxWorkersActive() and ThreadpoolMgr::AdjustMaxWorkersActive() for the code that handles this.
- If it hasn’t got enough samples to do a ‘statistically significant’ calculation this algorithm will indicate this via the
threadAdjustmentInterval
variable. This means that you should not call HillClimbingInstance.Update(..)
until another threadAdjustmentInterval
milliseconds have elapsed. (link to source code that calculates this)
- The current thread count is only decreased when threads complete their current task. At that point the current count is compared to the desired amount and if necessary a thread is ‘retired’
- The algorithm with only returns values that respect the limits specified by ThreadPool.SetMinThreads(..) and ThreadPool.SetMaxThreads(..) (link to the code that handles this)
- In addition, it will only recommend increasing the thread count if the CPU Utilization is below 95%
First lets look at the graphs that were published in the research paper from Microsoft (Optimizing Concurrency Levels in the .NET ThreadPool):
They clearly show the thread-pool adapting the number of threads (up and down) as the throughput changes, so it appears the algorithm is doing what it promises.
Now for a similar image using the self-contained test app I wrote. Now, my test app only pretends to add/remove threads based on the results for the Hill Climbing algorithm, so it’s only an approximation of the real behaviour, but it does provide a nice way to see it in action outside of the CLR.
In this simple scenario, the work-load that we are asking the thread-pool to do is just moving up and then down (click for full-size image):
Finally, we’ll look at what the algorithm does in a more noisy scenario, here the current ‘work load’ randomly jumps around, rather than smoothly changing:
So with a combination of a very detailed MSDN article, a easy-to-read research paper and most significantly having the source code available, we are able to get an understanding of what the .NET Thread Pool is doing ‘under-the-hood’!
References
- Concurrency - Throttling Concurrency in the CLR 4.0 ThreadPool (I recommend reading this article before reading the research papers)
- Optimizing Concurrency Levels in the .NET ThreadPool: A case study of controller design and implementation
- Configuring Resource Managers Using Model Fuzzing: A Case Study of the .NET Thread Pool
- MSDN page on ‘Parallel Tasks’ (see section on ‘Thread Injection’)
- Patent US20100083272 - Managing pools of dynamic resources
Further Reading
- Erika Parsons and Eric Eilebrecht - CLR 4 - Inside the Thread Pool - Channel 9
- New and Improved CLR 4 Thread Pool Engine (Work-stealing and Local Queues)
- .NET CLR Thread Pool Internals (compares the new Hill Climbing algorithm, to the previous algorithm used in the Legacy Thread Pool)
- CLR thread pool injection, stuttering problems
- Why the CLR 2.0 SP1’s threadpool default max thread count was increased to 250/CPU
- Use a more dependable policy for thread pool thread injection (CoreCLR GitHub Issue)
- Use a more dependable policy for thread pool thread injection (CoreFX GitHub Issue)
- ThreadPool Growth: Some Important Details
- .NET’s ThreadPool Class - Behind The Scenes (Based on SSCLI source, not CoreCLR)
- CLR Execution Context (in Russian, but Google Translate does a reasonable job)
- Thread Pool + Task Testing (by Ben Adams)
- The Injector: A new Executor for Java (an improved thread-injector for the Java Thread Pool)
Discuss this post on Hacker News and /r/programming
The post The CLR Thread Pool 'Thread Injection' Algorithm first appeared on my blog Performance is a Feature!
CodeProject
Thu, 13 Apr 2017, 12:00 am
The .NET IL Interpreter
Whilst writing a previous blog post I stumbled across the .NET Interpreter, tucked away in the source code. Although, it I’d made even the smallest amount of effort to look for it, I’d have easily found it via the GitHub ‘magic’ file search:
Usage Scenarios
Before we look at how to use it and what it does, it’s worth pointing out that the Interpreter is not really meant for production code. As far as I can tell, its main purpose is to allow you to get the CLR up and running on a new CPU architecture. Without the interpreter you wouldn’t be able to test any C# code until you had a fully functioning JIT that could emit machine code for you. For instance see ‘[ARM32/Linux] Initial bring up of FEATURE_INTERPRETER’ and ‘[aarch64] Enable the interpreter on linux as well.
Also it doesn’t have a few key features, most notable debugging support, that is you can’t debug through C# code that has been interpreted, although you can of course debug the interpreter itself. From ‘Tiered Compilation step 1’:
…. - the interpreter is not in good enough shape to run production code as-is. There are also some significant issues if you want debugging and profiling tools to work (which we do).
You can see an example of this in ‘Interpreter: volatile ldobj appears to have incorrect semantics?’ (thanks to alexrp for telling me about this issue). There is also a fair amount of TODO
comments in the code, although I haven’t verified what (if any) specific C# code breaks due to the missing functionality.
However, I think another really useful scenario for the Interpreter is to help you learn about the inner workings of the CLR. It’s only 8,000 lines long, but it’s all in one file and most significantly it’s written in C++. The code that the CLR/JIT uses when compiling for real is in multiple several files (the JIT on it’s own is over 200,000 L.O.C, spread across 100’s of files) and there are large amounts hand-written written in raw assembly.
In theory the Interpreter should work in the same way as the full runtime, albeit not as optimised. This means that it much simpler and those of us who aren’t CLR and/or assembly experts can have a chance of working out what’s going on!
Enabling the Interpreter
The Interpreter is disabled by default, so you have to build the CoreCLR from source to make it work (it used to be the fallback for ARM64 but that’s no longer the case), here’s the diff of the changes you need to make:
--- a/src/inc/switches.h
+++ b/src/inc/switches.h
@@ -233,5 +233,8 @@
#define FEATURE_STACK_SAMPLING
#endif // defined (ALLOW_SXS_JIT)
+// Let's test the .NET Interpreter!!
+#define FEATURE_INTERPRETER
+
#endif // !defined(CROSSGEN_COMPILE)
You also need to enable some environment variables, the ones that I used are in the table below. For the full list, take a look at Host Configuration Knobs and search for ‘Interpreter’.
Name
Description
Interpret
Selectively uses the interpreter to execute the specified methods
InterpreterDoLoopMethods
If set, don’t check for loops, start by interpreting
all methods
InterpreterPrintPostMortem
Prints summary information about the execution to the console
DumpInterpreterStubs
Prints all interpreter stubs that are created to the console
TraceInterpreterEntries
Logs entries to interpreted methods to the console
TraceInterpreterIL
Logs individual instructions of interpreted methods to the console
TraceInterpreterVerbose
Logs interpreter progress with detailed messages to the console
TraceInterpreterJITTransition
Logs when the interpreter determines a method should be JITted
To test out the Interpreter, I will be using the code below:
public static void Main(string[] args)
{
var max = 1000 * 1000;
if (args.Length > 0)
int.TryParse(args[0], out max);
var timer = Stopwatch.StartNew();
for (int i = 1; i
Thu, 30 Mar 2017, 12:00 am
A Hitchhikers Guide to the CoreCLR Source Code
photo by Alan O’Rourke
Just over 2 years ago Microsoft open-sourced the entire .NET framework, this posts attempts to provide a ‘Hitchhikers Guide’ to the source-code found in the CoreCLR GitHub repository.
To make it easier for you to get to the information you’re interested in, this post is split into several parts
It’s worth pointing out that .NET Developers have provided 2 excellent glossaries, the CoreCLR one and the CoreFX one, so if you come across any unfamiliar terms or abbreviations, check these first. Also there is extensive documentation available and if you are interested in the low-level details I really recommend checking out the ‘Book of the Runtime’ (BotR).
Overall Stats
If you take a look at the repository on GitHub, it shows the following stats for the entire repo
But most of the C# code is test code, so if we just look under /src
(i.e. ignore any code under /tests
) there are the following mix of Source file types, i.e. no ‘.txt’, ‘.dat’, etc:
- 2,012 .cpp
- 1,183 .h
- 956 .cs
- 113 .inl
- 98 .hpp
- 51 .S
- 43 .py
- 42 .asm
- 24 .idl
- 20 .c
So by far the majority of the code is written in C++, but there is still also a fair amount of C# code (all under ‘mscorlib’). Clearly there are low-level parts of the CLR that have to be written in C++ or Assembly code because they need to be ‘close to the metal’ or have high performance, but it’s interesting that there are large parts of the runtime written in managed code itself.
Note: All stats/lists in the post were calculated using commit 51a6b5c from the 9th March 2017.
Compared to ‘Rotor’
As a comparison here’s what the stats for ‘Rotor’ the Shared Source CLI looked like back in October 2002. Rotor was ‘Shared Source’, not truly ‘Open Source’, so it didn’t have the same community involvements as the CoreCLR.
Note: SSCLI aka ‘Rotor’ includes the fx or base class libraries (BCL), but the CoreCLR doesn’t as they are now hosted separately in the CoreFX GitHub repository
For reference, the equivalent stats for the CoreCLR source in March 2017 look like this:
- Packaged as 61.2 MB .zip archive
- Over 10.8 million lines of code (2.6 million of source code, under \src)
- 24,485 Files (7,466 source)
- 6,626 C# (956 source)
- 2,074 C and C++
- 3,701 IL
- 93 Assembler
- 43 Python
- 6 Perl
- Over 8.2 million lines of test code
- Build output expands to over 1.2 G with tests
- Product binaries 342 MB
- Test binaries 909 MB
Top 10 lists
These lists are mostly just for fun, but they do give some insights into the code-base and how it’s structured.
Top 10 Largest Files
You might have heard about the mammoth source file that is gc.cpp, which is so large that GitHub refuses to display it.
But it turns out it’s not the only large file in the source, there are also several files in the JIT that are around 20K LOC. However it seems that all the large files are C++ source code, so if you’re only interested in C# code, you don’t have to worry!!
File
# Lines of Code
Type
Location
gc.cpp
37,037
.cpp
\src\gc\
flowgraph.cpp
24,875
.cpp
\src\jit\
codegenlegacy.cpp
21,727
.cpp
\src\jit\
importer.cpp
18,680
.cpp
\src\jit\
morph.cpp
18,381
.cpp
\src\jit\
isolationpriv.h
18,263
.h
\src\inc\
cordebug.h
18,111
.h
\src\pal\prebuilt\inc\
gentree.cpp
17,177
.cpp
\src\jit\
debugger.cpp
16,975
.cpp
\src\debug\ee\
Top 10 Longest Methods
The large methods aren’t actually that hard to find, because they’re all have #pragma warning(disable:21000)
before them, to keep the compiler happy! There are ~40 large methods in total, here’s the ‘Top 10’
Method
# Lines of Code
MarshalInfo::MarshalInfo(Module* pModule,
1,507
void gc_heap::plan_phase (int condemned_gen_number)
1,505
void CordbProcess::DispatchRCEvent()
1,351
void DbgTransportSession::TransportWorker()
1,238
LPCSTR Exception::GetHRSymbolicName(HRESULT hr)
1,216
BOOL Disassemble(IMDInternalImport *pImport, BYTE *ILHeader,…
1,081
bool Debugger::HandleIPCEvent(DebuggerIPCEvent * pEvent)
1,050
void LazyMachState::unwindLazyState(LazyMachState* baseState…
901
VOID ParseNativeType(Module* pModule,
886
VOID StubLinkerCPU::EmitArrayOpStub(const ArrayOpScript* pAr…
839
Top 10 files with the Most Commits
Finally, lets look at which files have been changed the most since the initial commit on GitHub back in January 2015 (ignore ‘merge’ commits)
File
# Commits
src\jit\morph.cpp
237
src\jit\compiler.h
231
src\jit\importer.cpp
196
src\jit\codegenxarch.cpp
190
src\jit\flowgraph.cpp
171
src\jit\compiler.cpp
161
src\jit\gentree.cpp
157
src\jit\lower.cpp
147
src\jit\gentree.h
137
src\pal\inc\pal.h
136
High-level Overview
Next we’ll take a look at how the source code is structured and what are the main components.
They say “A picture is worth a thousand words”, so below is a treemap with the source code files grouped by colour into the top-level sections they fall under. You can hover over an individual box to get more detailed information and can click on the different radio buttons to toggle the sizing (LOC/Files/Commits)
Total L.O.C
# Files
# Commits
Notes and Observations
- The ‘# Commits’ only represent the commits made on GitHub, in the 2 1/2 years since the CoreCLR was open-sourced. So they are skewed to the recent work and don’t represent changes made over the entire history of the CLR. However it’s interesting to see which components have had more ‘churn’ in the last few years (i.e ‘jit’) and which have been left alone (e.g. ‘pal’)
- From the number of LOC/files it’s clear to see what the significant components are within the CoreCLR source, e.g ‘vm’, ‘jit’, ‘pal’ & ‘mscorlib’ (these are covered in detail in the next part of this post)
- In the ‘VM’ section it’s interesting to see how much code is generic ~650K LOC and how much is per-CPU architecture 25K LOC for ‘i386’, 16K for ‘amd64’, 14K for ‘arm’ and 7K for ‘arm64’. This suggests that the code is nicely organised so that the per-architecture work is minimised and cleanly separated out.
- It’s surprising (to me) that the ‘GC’ section is as small as it is, I always thought of the GC is a very complex component, but there is way more code in the ‘debugger’ and the ‘pal’.
- Likewise, I never really appreciated the complexity if the ‘JIT’, it’s the 2nd largest component, comprising over 370K LOC.
If you’re interested, this raw numbers for the code under ‘/src’ are available in this gist and for the code under ‘/tests/src’ in this gist.
Deep Dive into Individual Areas
As the source code is well organised, the top-level folders (under /src) correspond to the logical components within the CoreCLR. We’ll start off by looking at the most significant components, i.e. the ‘Debugger’, ‘Garbage Collector’ (GC), ‘Just-in-Time compiler’ (JIT), ‘mscorlib’ (all the C# code), ‘Platform Adaptation Layer’ (PAL) and the CLR ‘Virtual Machine’ (VM).
The ‘mscorlib’ folder contains all the C# code within the CoreCLR, so it’s the place that most C# developers would start looking if they wanted to contribute. For this reason it deserves it’s own treemap, so we can see how it’s structured:
Total L.O.C
# Files
# Commits
So by-far the bulk of the code is at the ‘top-level’, i.e. directly in the ‘System’ namespace, this contains the fundamental types that have to exist for the CLR to run, such as:
AppDomain
, WeakReference
, Type
,
Array
, Delegate
, Object
, String
Boolean
, Byte
, Char
, Int16
, Int32
, etc
Tuple
, Span
, ArraySegment
, Attribute
, DateTime
Where possible the CoreCLR is written in C#, because of the benefits that ‘managed code’ brings, so there is a significant amount of code within the ‘mscorlib’ section. Note that anything under here is not externally exposed, when you write C# code that runs against the CoreCLR, you actually access everything through the CoreFX, which then type-forwards to the CoreCLR where appropriate.
I don’t know the rules for what lives in CoreCLR v CoreFX, but based on what I’ve read on various GitHub issues, it seems that over time, more and more code is moving from CoreCLR -> CoreFX.
However the managed C# code is often deeply entwined with unmanaged C++, for instance several types are implemented across multiple files, e.g.
From what I understand this is done for performance reasons, any code that is perf sensitive will end up being implemented in C++ (or even Assembly), unless the JIT can suitable optimise the C# code.
Code shared with CoreRT
Recently there has been a significant amount of work done to moved more and more code over into the ‘shared partition’. This is the area of the CoreCLR source code that is shared with CoreRT (‘the .NET Core runtime optimized for AOT compilation’). Because certain classes are implemented in both runtimes, they’ve ensured that the work isn’t duplicated and any fixes are shared in both locations. You can see how this works by looking at the links below:
Other parts of mscorlib
All the other sections of mscorlib line up with namespaces
available in the .NET runtime and contain functionality that most C# devs will have used at one time or another. The largest ones in there are shown below (click to go directly to the source code):
- System.Reflection and System.Reflection.Emit
FieldInfo
, PropertyInfo
, MethodInfo
, AssemblyBuilder
, TypeBuilder
, MethodBuilder
, ILGenerator
- System.Globalization
CultureInfo
, CalendarInfo
, DateTimeParse
, JulianCalendar
, HebrewCalendar
- System.Threading and System.Threading.Tasks
Thread
, Timer
, Semaphore
, Mutex
, AsyncLocal
, Task
, Task
, CancellationToken
- System.Runtime.CompilerServices and System.Runtime.InteropServices
Unsafe
, [CallerFilePath]
, [CallerLineNumber]
, [CallerMemberName]
, GCHandle
, [LayoutKind]
, [MarshalAs(..)]
, [StructLayout(LayoutKind ..)]
- System.Diagnostics
Assert
, Debugger
, Stacktrace
- System.Text
StringBuilder
, ASCIIEncoding
, UTF8Encoding
, UnicodeEncoding
- System.Collections
- System.Collections.Generic
- System.IO
Stream
, MemoryStream
, File
, TestReader
, TestWriter
The VM, not surprisingly, is the largest component of the CoreCLR, with over 640K L.O.C spread across 576 files, and it contains the guts of the runtime. The bulk of the code is OS and CPU independent and written in C++, however there is also a significant amount of architecture-specific assembly code, see the section ‘CPU Architecture-specific code’ for more info.
The VM contains the main start-up routine of the entire runtime EEStartupHelper()
in ceemain.cpp, see ‘The 68 things the CLR does before executing a single line of your code’ for all the details. In addition it provides the following functionality:
- Type System
- Loading types/classes
- Threading
- Exception Handling and Stack Walking
- Fundamental Types
- Generics
- An entire Interpreter (yes .NET can run interpreted!!)
- Function calling mechanisms (see BotR for more info)
- Stubs (used for virtual dispatch and delegates amongst other things)
- Event Tracing
- Profiler
- P/Invoke
- Reflection
CPU Architecture-specific code
All the architecture-specific code is kept separately in several sub-folders, amd64, arm, arm64 and i386. For example here’s the various implementations of the WriteBarrier
function used by the GC:
Before we look at the actual source code, it’s worth looking at the different ‘flavours’ or the JIT that are available:
Fortunately one of the Microsoft developers has clarified which one should be used
Here’s my guidance on how non-MS contributors should think about contributing to the JIT: If you want to help advance the state of the production code-generators for .NET, then contribute to the new RyuJIT x86/ARM32 backend. This is our long term direction. If instead your interest is around getting the .NET Core runtime working on x86 or ARM32 platforms to do other things, by all means use and contribute bug fixes if necessary to the LEGACY_BACKEND paths in the RyuJIT code base today to unblock yourself. We do run testing on these paths today in our internal testing infrastructure and will do our best to avoid regressing it until we can replace it with something better. We just want to make sure that there will be no surprises or hard feelings for when the time comes to remove them from the code-base.
JIT Phases
The JIT has almost 90 source files, but fortunately they correspond to the different phases it goes through, so it’s not too hard to find your way around. Using the table from ‘Phases of RyuyJIT’, I added the right-hand column so you can jump to the relevant source file(s):
Phase
IR Transformations
File
Pre-import
Compiler->lvaTable
created and filled in for each user argument and variable. BasicBlock list initialized.
compiler.hpp
Importation
GenTree
nodes created and linked in to Statements, and Statements into BasicBlocks. Inlining candidates identified.
importer.cpp
Inlining
The IR for inlined methods is incorporated into the flowgraph.
inline.cpp and
inlinepolicy.cpp
Struct Promotion
New lvlVars are created for each field of a promoted struct.
morph.cpp
Mark Address-Exposed Locals
lvlVars with references occurring in an address-taken context are marked. This must be kept up-to-date.
compiler.hpp
Morph Blocks
Performs localized transformations, including mandatory normalization as well as simple optimizations.
morph.cpp
Eliminate Qmarks
All
GT_QMARK
nodes are eliminated, other than simple ones that do not require control flow.
compiler.cpp
Flowgraph Analysis
BasicBlock
predecessors are computed, and must be kept valid. Loops are identified, and normalized, cloned and/or unrolled.
flowgraph.cpp
Normalize IR for Optimization
lvlVar references counts are set, and must be kept valid. Evaluation order of
GenTree
nodes (
gtNext
/
gtPrev
) is determined, and must be kept valid.
compiler.cpp and
lclvars.cpp
SSA and Value Numbering Optimizations
Computes liveness (
bbLiveIn
and
bbLiveOut
on
BasicBlocks
), and dominators. Builds SSA for tracked lvlVars. Computes value numbers.
liveness.cpp
Loop Invariant Code Hoisting
Hoists expressions out of loops.
optimizer.cpp
Copy Propagation
Copy propagation based on value numbers.
copyprop.cpp
Common Subexpression Elimination (CSE)
Elimination of redundant subexressions based on value numbers.
optcse.cpp
Assertion Propagation
Utilizes value numbers to propagate and transform based on properties such as non-nullness.
assertionprop.cpp
Range analysis
Eliminate array index range checks based on value numbers and assertions
rangecheck.cpp
Rationalization
Flowgraph order changes from
FGOrderTree
to
FGOrderLinear
. All
GT_COMMA
,
GT_ASG
and
GT_ADDR
nodes are transformed.
rationalize.cpp
Lowering
Register requirements are fully specified (
gtLsraInfo
). All control flow is explicit.
lower.cpp,
lowerarm.cpp,
lowerarm64.cpp and
lowerxarch.cpp
Register allocation
Registers are assigned (
gtRegNum
and/or
gtRsvdRegs
),and the number of spill temps calculated.
regalloc.cpp and
register_arg_convention.cp
Code Generation
Determines frame layout. Generates code for each
BasicBlock
. Generates prolog & epilog code for the method. Emit EH, GC and Debug info.
codegenarm.cpp,
codegenarm64.cpp,
codegencommon.cpp,
codegenlegacy.cpp,
codegenlinear.cpp and
codegenxarch.cpp
The PAL provides an OS independent layer to give access to common low-level functionality such as:
As .NET was originally written to run on Windows, all the APIs look very similar to the Win32 APIs. However for non-Windows platforms they are actually implemented using the functionality available on that OS. For example this is what PAL code to read/write a file looks like:
int main(int argc, char *argv[])
{
WCHAR src[4] = {'f', 'o', 'o', '\0'};
WCHAR dest[4] = {'b', 'a', 'r', '\0'};
WCHAR dir[5] = {'/', 't', 'm', 'p', '\0'};
HANDLE h;
unsigned int b;
PAL_Initialize(argc, (const char**)argv);
SetCurrentDirectoryW(dir);
SetCurrentDirectoryW(dir);
h = CreateFileW(src, GENERIC_WRITE, FILE_SHARE_READ, NULL, CREATE_NEW, 0, NULL);
WriteFile(h, "Testing\n", 8, &b, FALSE);
CloseHandle(h);
CopyFileW(src, dest, FALSE);
DeleteFileW(src);
PAL_Terminate();
return 0;
}
The PAL does contain some per-CPU assembly code, but it’s only for very low-level functionality, for instance here’s the different implementations of the DebugBreak
function:
The GC is clearly a very complex piece of code, lying right at the heart of the CLR, so for more information about what it does I recommend reading the BotR entry on ‘Garbage Collection Design’ and if you’re interested I’ve also written several blog posts looking at its functionality.
However from a source code point-of-view the GC is pretty simple, it’s spread across just 19 .cpp files, but the bulk of the work is in gc.cpp (raw version) all ~37K L.O.C of it!!
If you want to get deeper into the GC code (warning, it’s pretty dense), a good way to start is to search for the occurrences of various ETW
events that are fired as the GC moves through the phases outlined in the BotR post above, these events are listed below:
FireEtwGCTriggered(..)
FireEtwGCAllocationTick_V1(..)
FireEtwGCFullNotify_V1(..)
FireEtwGCJoin_V2(..)
FireEtwGCMarkWithType(..)
FireEtwGCPerHeapHistory_V3(..)
FireEtwGCGlobalHeapHistory_V2(..)
FireEtwGCCreateSegment_V1(..)
FireEtwGCFreeSegment_V1(..)
FireEtwBGCAllocWaitBegin(..)
FireEtwBGCAllocWaitEnd(..)
FireEtwBGCDrainMark(..)
FireEtwBGCRevisit(..)
FireEtwBGCOverflow(..)
FireEtwPinPlugAtGCTime(..)
FireEtwGCCreateConcurrentThread_V1(..)
FireEtwGCTerminateConcurrentThread_V1(..)
But the GC doesn’t work in isolation, it also requires help from the Execute Engine (EE), this is done via the GCToEEInterface
which is implemented in gcenv.ee.cpp.
Local GC and GC Sample
Finally, there are 2 others ways you can get into the GC code and understand what it does.
Firstly there is a GC sample the lets you use the full GC independent of the rest of the runtime. It shows you how to ‘create type layout information in format that the GC expects’, ‘implement fast object allocator and write barrier’ and ‘allocate objects and work with GC handles’, all in under 250 LOC!!
Also worth mentioning is the ‘Local GC’ project, which is an ongoing effort to decouple the GC from the rest of the runtime, they even have a dashboard so you can track its progress. Currently the GC code is too intertwined with the runtime and vica-versa, so ‘Local GC’ is aiming to break that link by providing a set of clear interfaces, GCToOSInterface
and GCToEEInterface
. This will help with the CoreCLR cross-platform efforts, making the GC easier to port to new OSes.
The CLR is a ‘managed runtime’ and one of the significant components it provides is a advanced debugging experience, via Visual Studio or WinDBG. This debugging experience is very complex and I’m not going to go into it in detail here, however if you want to learn more I recommend you read ‘Data Access Component (DAC) Notes’.
But what does the source look like, how is it laid out? Well the a several main sub-components under the top-level /debug
folder:
- dacaccess - the provides the ‘Data Access Component’ (DAC) functionality as outlined in the BotR page linked to above. The DAC is an abstraction layer over the internal structures in the runtime, which the debugger uses to inspect objects/classes
- di - this contains the exposed APIs (or entry points) of the debugger, implemented by
CoreCLRCreateCordbObject(..)
in cordb.cpp
- ee - the section of debugger that works with the Execution Engine (EE) to do things like stack-walking
- inc - all the interfaces (.h) files that the debugger components implement
All the rest
As well as the main components, there are various other top-level folders in the source, the full list is below:
- binder
- The ‘binder’ is responsible for loading assemblies within a .NET program (except the mscorlib binder which is elsewhere). The ‘binder’ comprises low-level code that controls Assemblies, Application Contexts and the all-important Fusion Log for diagnosing why assemblies aren’t loading!
- classlibnative
- Code for native implementations of many of the core data types in the CoreCLR, e.g. Arrays, System.Object, String, decimal, float and double.
- Also includes all the native methods exposed in the ‘System.Environment’ namespace, e.g.
Environment.ProcessorCount
, Environment.TickCount
, Environment.GetCommandLineArgs()
, Environment.FailFast()
, etc
- coreclr
- corefx
- dlls
- gcdump and gcinfo
- Code that will write-out the
GCInfo
that is produced by the JIT to help the GC do it’s job. This GCInfo
includes information about the ‘liveness’ of variables within a section of code and whether the method is fully or partially interruptible, which enables the EE to suspend methods when the GC is working.
- ilasm
- IL (Intermediate Language) Assembler is a tool for converting IL code into a .NET executable, see the MSDN page for more info and usage examples.
- ildasm
- Tool for disassembling a .NET executable into the corresponding IL source code, again, see the MSDN page for info and usage examples.
- inc
- Header files that define the ‘interfaces’ between the sub-components that make up the CoreCLR. For example corjit.h covers all communication between the Execution Engine (EE) and the JIT, that is ‘EE -> JIT’ and corinfo.h is the interface going the other way, i.e. ‘JIT -> EE’
- ipcman
- Code that enables the ‘Inter-Process Communication’ (IPC) used in .NET (mostly legacy and probably not cross-platform)
- md
- The MetaData (md) code provides the ability to gather information about methods, classes, types and assemblies and is what makes Reflection possible.
- nativeresources
- A simple tool that is responsible for converting/extracting resources from a Windows Resource File.
- palrt
- The PAL (Platform Adaptation Layer) Run-Time, contains specific parts of the PAL layer.
- scripts
- Several Python scripts for auto-generating various files in the source (e.g. ETW events).
- strongname
- ToolBox
- Contains 2 stand-alone tools
- SOS (son-of-strike) the CLR debugging extension that enables reporting of .NET specific information when using WinDBG
- SuperPMI which enables testing of the JIT without requiring the full Execution Engine (EE)
- tools
- unwinder
- Provides the low-level functionality to make it possible for the debugger and exception handling components to walk or unwind the stack. This is done via 2 functions,
GetModuleBase(..)
and GetFunctionEntry(..)
which are implemented in CPU architecture-specific code, see amd64, arm, arm64 and i386
- utilcode
- Shared utility code that is used by the VM, Debugger and JIT
- zap
If you’ve read this far ‘So long and thanks for all the fish’ (YouTube)
Discuss this post on Hacker News and /r/programming
Thu, 23 Mar 2017, 12:00 am
The 68 things the CLR does before executing a single line of your code (*)
Because the CLR is a managed environment there are several components within the runtime that need to be initialised before any of your code can be executed. This post will take a look at the EE (Execution Engine) start-up routine and examine the initialisation process in detail.
(*) 68 is only a rough guide, it depends on which version of the runtime you are using, which features are enabled and a few other things
‘Hello World’
Imagine you have the simplest possible C# program, what has to happen before the CLR prints ‘Hello World’ out to the console?
using System;
namespace ConsoleApplication
{
public class Program
{
public static void Main(string[] args)
{
Console.WriteLine("Hello World!");
}
}
}
The code path into the EE (Execution Engine)
When a .NET executable runs, control gets into the EE via the following code path:
- _CorExeMain() (the external entry point)
- _CorExeMainInternal()
- EnsureEEStarted()
- EEStartup()
- EEStartupHelper()
(if you’re interested in what happens before this, i.e. how a CLR Host can start-up the runtime, see my previous post ‘How the dotnet CLI tooling runs your code’)
And so we end up in EEStartupHelper()
, which at a high-level does the following (from a comment in ceemain.cpp):
EEStartup is responsible for all the one time initialization of the runtime.
Some of the highlights of what it does include
- Creates the default and shared, appdomains.
- Loads mscorlib.dll and loads up the fundamental types (System.Object …)
The main phases in EE (Execution Engine) start-up routine
But let’s look at what it does in detail, the lists below contain all the individual function calls made from EEStartupHelper() (~500 L.O.C). To make them easier to understand, we’ll split them up into separate phases:
- Phase 1 - Set-up the infrastructure that needs to be in place before anything else can run
- Phase 2 - Initialise the core, low-level components
- Phase 3 - Start-up the low-level components, i.e. error handling, profiling API, debugging
- Phase 4 - Start the main components, i.e. Garbage Collector (GC), AppDomains, Security
- Phase 5 - Final setup and then notify other components that the EE has started
Note some items in the list below are only included if a particular feature is defined at build-time, these are indicated by the inclusion on an ifdef
statement. Also note that the links take you to the code for the function being called, not the line of code within EEStartupHelper()
.
Phase 1 - Set-up the infrastructure that needs to be in place before anything else can run
- Wire-up console handling - SetConsoleCtrlHandler(..) (
ifndef FEATURE_PAL
)
- Initialise the internal
SString
class (everything uses strings!) - SString::Startup()
- Make sure the configuration is set-up, so settings that control run-time options can be accessed - EEConfig::Set-up() and InitializeHostConfigFile() (
#if !defined(CROSSGEN_COMPILE)
)
- Initialize Numa and CPU group information - NumaNodeInfo::InitNumaNodeInfo() and CPUGroupInfo::EnsureInitialized() (
#ifndef CROSSGEN_COMPILE
)
- Initialize global configuration settings based on startup flags - InitializeStartupFlags()
- Set-up the Thread Manager that gives the runtime access to the OS threading functionality (
StartThread()
, Join()
, SetThreadPriority()
etc) - InitThreadManager()
- Initialize Event Tracing (ETW) and fire off the CLR startup events - InitializeEventTracing() and ETWFireEvent(EEStartupStart_V1) (
#ifdef FEATURE_EVENT_TRACE
)
- Set-up the GS Cookie (Buffer Security Check) to help prevent buffer overruns - InitGSCookie()
- Create the data-structures needed to hold the ‘frames’ used for stack-traces - Frame::Init()
- Ensure initialization of Apphacks environment variables - GetGlobalCompatibilityFlags() (
#ifndef FEATURE_CORECLR
)
- Create the diagnostic and performance logs used by the runtime - InitializeLogging() (
#ifdef LOGGING
) and PerfLog::PerfLogInitialize() (#ifdef ENABLE_PERF_LOG
)
Phase 2 - Initialise the core, low-level components
- Write to the log
===================EEStartup Starting===================
- Ensure that the Runtime Library functions (that interact with ntdll.dll) are enabled - EnsureRtlFunctions() (
#ifndef FEATURE_PAL
)
- Set-up the global store for events (mutexes, semaphores) used for synchronisation within the runtime - InitEventStore()
- Create the Assembly Binding logging mechanism a.k.a Fusion - InitializeFusion() (
#ifdef FEATURE_FUSION
)
- Then initialize the actual Assembly Binder infrastructure - CCoreCLRBinderHelper::Init() which in turn calls AssemblyBinder::Startup() (
#ifdef FEATURE_FUSION
is NOT defined)
- Set-up the heuristics used to control Monitors, Crsts, and SimpleRWLocks - InitializeSpinConstants()
- Initialize the InterProcess Communication with COM (IPC) - InitializeIPCManager() (
#ifdef FEATURE_IPCMAN
)
- Set-up and enable Performance Counters - PerfCounters::Init() (
#ifdef ENABLE_PERF_COUNTERS
)
- Set-up the CLR interpreter - Interpreter::Initialize() (
#ifdef FEATURE_INTERPRETER
), turns out that the CLR has a mode where your code is interpreted instead of compiled!
- Initialise the stubs that are used by the CLR for calling methods and triggering the JIT - StubManager::InitializeStubManagers(), also Stub::Init() and StubLinkerCPU::Init()
- Set up the core handle map, used to load assemblies into memory - PEImage::Startup()
- Startup the access checks options, used for granting/denying security demands on method calls - AccessCheckOptions::Startup()
- Startup the mscorlib binder (used for loading “known” types from mscorlib.dll) - MscorlibBinder::Startup()
- Initialize remoting, which allows out-of-process communication - CRemotingServices::Initialize() (
#ifdef FEATURE_REMOTING
)
- Set-up the data structures used by the GC for weak, strong and no-pin references - Ref_Initialize()
- Set-up the contexts used to proxy method calls across App Domains - Context::Initialize()
- Wire-up events that allow the EE to synchronise shut-down -
g_pEEShutDownEvent->CreateManualEvent(FALSE)
- Initialise the process-wide data structures used for reader-writer lock implementation - CRWLock::ProcessInit() (
#ifdef FEATURE_RWLOCK
)
- Initialize the debugger manager - CCLRDebugManager::ProcessInit() (
#ifdef FEATURE_INCLUDE_ALL_INTERFACES
)
- Initialize the CLR Security Attribute Manager - CCLRSecurityAttributeManager::ProcessInit() (
#ifdef FEATURE_IPCMAN
)
- Set-up the manager for Virtual call stubs - VirtualCallStubManager::InitStatic()
- Initialise the lock that that GC uses when controlling memory pressure - GCInterface::m_MemoryPressureLock.Init(CrstGCMemoryPressure)
- Initialize Assembly Usage Logger - InitAssemblyUsageLogManager() (
#ifndef FEATURE_CORECLR
)
Phase 3 - Start-up the low-level components, i.e. error handling, profiling API, debugging
- Set-up the App Domains used by the CLR - SystemDomain::Attach() (also creates the DefaultDomain and the SharedDomain by calling SystemDomain::CreateDefaultDomain() and SharedDomain::Attach())
- Start up the ECall interface, a private native calling interface used within the CLR - ECall::Init()
- Set-up the caches for the stubs used by
delegates
- COMDelegate::Init()
- Set-up all the global/static variables used by the EE itself - ExecutionManager::Init()
- Initialise Watson, for windows error reporting - InitializeWatson(fFlags) (
#ifndef FEATURE_PAL
)
- Initialize the debugging services, this must be done before any EE thread objects are created, and before any classes or modules are loaded - InitializeDebugger() (
#ifdef DEBUGGING_SUPPORTED
)
- Activate the Managed Debugging Assistants that the CLR provides - ManagedDebuggingAssistants::EEStartupActivation() (
ifdef MDA_SUPPORTED
)
- Initialise the Profiling API - ProfilingAPIUtility::InitializeProfiling() (
#ifdef PROFILING_SUPPORTED
)
- Initialise the exception handling mechanism - InitializeExceptionHandling()
- Install the CLR global exception filter - InstallUnhandledExceptionFilter()
- Ensure that the initial runtime thread is created - SetupThread() in turn calls SetupThread(..)
- Initialise the PreStub manager (PreStub’s trigger the JIT) - InitPreStubManager() and the corresponding helpers StubHelpers::Init()
- Initialise the COM Interop layer - InitializeComInterop() (
#ifdef FEATURE_COMINTEROP
)
- Initialise NDirect method calls (lazy binding of unmanaged P/Invoke targets) - NDirect::Init()
- Set-up the JIT Helper functions, so they are in place before the execution manager runs - InitJITHelpers1() and InitJITHelpers2()
- Initialise and set-up the SyncBlock cache - SyncBlockCache::Attach() and SyncBlockCache::Start()
- Create the cache used when walking/unwinding the stack - StackwalkCache::Init()
Phase 4 - Start the main components, i.e. Garbage Collector (GC), AppDomains, Security
- Start up security system, that handles Code Access Security (CAS) - Security::Start() which in turn calls SecurityPolicy::Start()
- Wire-up an event to allow synchronisation of AppDomain unloads - AppDomain::CreateADUnloadStartEvent()
- Initialise the ‘Stack Probes’ used to setup stack guards InitStackProbes() (
#ifdef FEATURE_STACK_PROBE
)
- Initialise the GC and create the heaps that it uses - InitializeGarbageCollector()
- Initialise the tables used to hold the locations of pinned objects - InitializePinHandleTable()
- Inform the debugger about the DefaultDomain, so it can interact with it - SystemDomain::System()->PublishAppDomainAndInformDebugger(..) (
#ifdef DEBUGGING_SUPPORTED
)
- Initialise the existing OOB Assembly List (no idea?) - ExistingOobAssemblyList::Init() (
#ifndef FEATURE_CORECLR
)
- Actually initialise the System Domain (which contains mscorlib), so that it can start executing - SystemDomain::System()->Init()
Phase 5 Final setup and then notify other components that the EE has started
- Tell the profiler we’ve stated up - SystemDomain::NotifyProfilerStartup() (
#ifdef PROFILING_SUPPORTED
)
- Pre-create a thread to handle AppDomain unloads - AppDomain::CreateADUnloadWorker() (
#ifndef CROSSGEN_COMPILE
)
- Set a flag to confirm that ‘initialisation’ of the EE succeeded -
g_fEEInit = false
- Load the System Assemblies (‘mscorlib’) into the Default Domain - SystemDomain::System()->DefaultDomain()->LoadSystemAssemblies()
- Set-up all the shared static variables (and
String.Empty
) in the Default Domain - SystemDomain::System()->DefaultDomain()->SetupSharedStatics(), they are all contained in the internal class SharedStatics.cs
- Set-up the stack sampler feature, that identifies ‘hot’ methods in your code - StackSampler::Init() (
#ifdef FEATURE_STACK_SAMPLING
)
- Perform any once-only SafeHandle initialization - SafeHandle::Init() (
#ifndef CROSSGEN_COMPILE
)
- Set flags to indicate that the CLR has successfully started -
g_fEEStarted = TRUE
, g_EEStartupStatus = S_OK
and hr = S_OK
- Write to the log
===================EEStartup Completed===================
Once this is all done, the CLR is now ready to execute your code!!
Executing your code
Your code will be executed (after first being ‘JITted’) via the following code flow:
- CorHost2::ExecuteAssembly()
- Assembly::ExecuteMainMethod()
- RunMain() (in assembly.cpp)
Discuss this post on Hacker News and /r/programming
The CLR provides a huge amount of log information if you create a debug build and then enable the right environment variables. The links below take you to the various logs produced when running a simple ‘hello world’ program (shown at the top of this post), they give you an pretty good idea of the different things that the CLR is doing behind-the-scenes.
The post The 68 things the CLR does before executing a single line of your code (*) first appeared on my blog Performance is a Feature!
CodeProject
Tue, 7 Feb 2017, 12:00 am
How do .NET delegates work?
Delegates are a fundamental part of the .NET runtime and whilst you rarely create them directly, they are there under-the-hood every time you use a lambda in LINQ (=>
) or a Func
/Action
to make your code more functional. But how do they actually work and what’s going in the CLR when you use them?
IL of delegates and/or lambdas
Let’s start with a small code sample like this:
public delegate string SimpleDelegate(int x);
class DelegateTest
{
static int Main()
{
// create an instance of the class
DelegateTest instance = new DelegateTest();
instance.name = "My instance";
// create a delegate
SimpleDelegate d1 = new SimpleDelegate(instance.InstanceMethod);
// call 'InstanceMethod' via the delegate (compiler turns this into 'd1.Invoke(5)')
string result = d1(5); // returns "My instance: 5"
}
string InstanceMethod(int i)
{
return string.Format("{0}: {1}", name, i);
}
}
If you were to take a look at the IL of the SimpleDelegate
class, the ctor
and Invoke
methods look like so:
[MethodImpl(0, MethodCodeType=MethodCodeType.Runtime)]
public SimpleDelegate(object @object, IntPtr method);
[MethodImpl(0, MethodCodeType=MethodCodeType.Runtime)]
public virtual string Invoke(int x);
It turns out that this behaviour is manadated by the spec, from ECMA 335 Standard - Common Language Infrastructure (CLI):
So the internal implementation of a delegate, the part responsible for calling a method, is created by the runtime. This is because there needs to be complete control over those methods, delegates are a fundamental part of the CLR, any security issues, performance overhead or other inefficiencies would be a big problem.
Methods that are created in this way are technically know as EEImpl
methods (i.e. implemented by the ‘Execution Engine’), from the ‘Book of the Runtime’ (BOTR) section ‘Method Descriptor - Kinds of MethodDescs:
EEImpl
Delegate methods whose implementation is provided by the runtime (Invoke, BeginInvoke, EndInvoke). See ECMA 335 Partition II - Delegates.
There’s also more information available in these two excellent articles .NET Type Internals - From a Microsoft CLR Perspective (section on ‘Delegates’) and Understanding .NET Delegates and Events, By Practice (section on ‘Internal Delegates Representation’)
How the runtime creates delegates
Inlining of delegate ctors
So we’ve seen that the runtime has responsibility for creating the bodies of delegate methods, but how is this done. It starts by wiring up the delegate constructor (ctor), as per the BOTR page on ‘method descriptors’
FCall
Internal methods implemented in unmanaged code. These are methods marked with MethodImplAttribute(MethodImplOptions.InternalCall) attribute, delegate constructors and tlbimp constructors.
At runtime this happens when the JIT compiles a method that contains IL code for creating a delegate. In Compiler::fgOptimizeDelegateConstructor(..), the JIT firstly obtains a reference to the correct delegate ctor, which in the simple case is CtorOpened(Object target, IntPtr methodPtr, IntPtr shuffleThunk)
(link to C# code), before finally wiring up the ctor
, inlining it if possible for maximum performance.
Creation of the delegate Invoke() method
But what’s more interesting is the process that happens when creating the Invoke()
method, using a technique involving ‘stubs’ of code (raw-assembly) that know how to locate the information about the target method and can jump control to it. These ‘stubs’ are actually used in a wide-variety of scenarios, for instance during Virtual Method Dispatch and also by the JITter (when a method is first called it hits a ‘pre-code stub’ that causes the method to be JITted, the ‘stub’ is then replaced by a call to the JITted ‘native code’).
In the particular case of delegates, these stubs are referred to as ‘shuffle thunks’. This is because part of the work they have to do is ‘shuffle’ the arguments that are passed into the Invoke()
method, so that are in the correct place (stack/register) by the time the ‘target’ method is called.
To understand what’s going on, it’s helpful to look at the following diagram taken from the BOTR page on Method Descriptors and Precode stubs. The ‘shuffle thunks’ we are discussing are a particular case of a ‘stub’ and sit in the corresponding box in the diagram:
How ‘shuffle thunks’ are set-up
So let’s look at the code flow for the delegate we created in the sample at the beginning of this post, specifically an ‘open’ delegate, calling an instance method (if you are wondering about the difference between open and closed delegates, have a read of ‘Open Delegates vs. Closed Delegates’).
We start off in the impImportCall()
method, deep inside the .NET JIT, triggered when a ‘call’ op-code for a delegate is encountered, it then goes through the following functions:
- Compiler::impImportCall(..)
- Compiler::fgOptimizeDelegateConstructor(..)
- COMDelegate::GetDelegateCtor(..)
- COMDelegate::SetupShuffleThunk
- StubCacheBase::Canonicalize(..)
- ShuffleThunkCache::CompileStub()
- EmitShuffleThunk (specific assembly code for different CPU architectures)
Below is the code from the arm64 version (chosen because it’s the shortest one of the three!). You can see that it emits assembly code to fetch the real target address from MethodPtrAux
, loops through the method arguments and puts them in the correct register (i.e. ‘shuffles’ them into place) and finally emits a tail-call jump to the target method associated with the delegate.
VOID StubLinkerCPU::EmitShuffleThunk(ShuffleEntry *pShuffleEntryArray)
{
// On entry x0 holds the delegate instance. Look up the real target address stored in the MethodPtrAux
// field and save it in x9. Tailcall to the target method after re-arranging the arguments
// ldr x9, [x0, #offsetof(DelegateObject, _methodPtrAux)]
EmitLoadStoreRegImm(eLOAD, IntReg(9), IntReg(0), DelegateObject::GetOffsetOfMethodPtrAux());
//add x11, x0, DelegateObject::GetOffsetOfMethodPtrAux() - load the indirection cell into x11 used by ResolveWorkerAsmStub
EmitAddImm(IntReg(11), IntReg(0), DelegateObject::GetOffsetOfMethodPtrAux());
for (ShuffleEntry* pEntry = pShuffleEntryArray; pEntry->srcofs != ShuffleEntry::SENTINEL; pEntry++)
{
if (pEntry->srcofs & ShuffleEntry::REGMASK)
{
// If source is present in register then destination must also be a register
_ASSERTE(pEntry->dstofs & ShuffleEntry::REGMASK);
EmitMovReg(IntReg(pEntry->dstofs & ShuffleEntry::OFSMASK), IntReg(pEntry->srcofs & ShuffleEntry::OFSMASK));
}
else if (pEntry->dstofs & ShuffleEntry::REGMASK)
{
// source must be on the stack
_ASSERTE(!(pEntry->srcofs & ShuffleEntry::REGMASK));
EmitLoadStoreRegImm(eLOAD, IntReg(pEntry->dstofs & ShuffleEntry::OFSMASK), RegSp, pEntry->srcofs * sizeof(void*));
}
else
{
// source must be on the stack
_ASSERTE(!(pEntry->srcofs & ShuffleEntry::REGMASK));
// dest must be on the stack
_ASSERTE(!(pEntry->dstofs & ShuffleEntry::REGMASK));
EmitLoadStoreRegImm(eLOAD, IntReg(8), RegSp, pEntry->srcofs * sizeof(void*));
EmitLoadStoreRegImm(eSTORE, IntReg(8), RegSp, pEntry->dstofs * sizeof(void*));
}
}
// Tailcall to target
// br x9
EmitJumpRegister(IntReg(9));
}
Other functions that call SetupShuffleThunk(..)
The other places in code that also emit these ‘shuffle thunks’ are listed below. They are used in the various scenarios where a delegate is explicitly created, e.g. via `Delegate.CreateDelegate(..).
Different types of delegates
Now that we’ve looked at how one type of delegate works (#2 ‘Instance open non-virt’ in the table below), it will be helpful to see the other different types that the runtime deals with. From the very informative DELEGATE KINDS TABLE in the CLR source:
#
delegate type
_target
_methodPtr
_methodPtrAux
1
Instance closed
‘this’ ptr
target method
null
2
Instance open non-virt
delegate
shuffle thunk
target method
3
Instance open virtual
delegate
Virtual-stub dispatch
method id
4
Static closed
first arg
target method
null
5
Static closed (special sig)
delegate
specialSig thunk
target method
6
Static opened
delegate
shuffle thunk
target method
7
Secure
delegate
call thunk
MethodDesc (frame)
Note: The columns map to the internal fields of a delegate (from System.Delegate
)
So we’ve (deliberately) looked at the simple case, but the more complex scenarios all work along similar lines, just using different and more stubs/thunks as needed e.g. ‘virtual-stub dispatch’ or ‘call thunk’.
Delegates are special!!
As well as being responsible for creating delegates, the runtime also treats delegate specially, to enforce security and/or type-safety. You can see how this is implemented in the links below
In MethodTableBuilder.cpp:
In ClassCompat.cpp:
Discuss this post in /r/programming and /r/csharp
Other links:
If you’ve read this far, good job!!
As a reward, below are some extra links that cover more than you could possibly want to know about delegates!!
General Info:
Internal Delegate Info
Debugging delegates
The post How do .NET delegates work? first appeared on my blog Performance is a Feature!
CodeProject
Wed, 25 Jan 2017, 12:00 am
Analysing Pause times in the .NET GC
Over the last few months there have been several blog posts looking at GC pauses in different programming languages or runtimes. It all started with a post looking at the latency of the Haskell GC, next came a follow-up that compared Haskell, OCaml and Racket, followed by Go GC in Theory and Practice, before a final post looking at the situation in Erlang.
After reading all these posts I wanted to see how the .NET GC compares to the other runtime implementations.
The posts above all use a similar test program to exercise the GC, based on the message-bus scenario that Pusher initially described, fortunately Franck Jeannin had already started work on a .NET version, so this blog post will make us of that.
At the heart of the test is the following code:
for (var i = 0; i worst)
{
worst = sw.Elapsed;
}
}
private static unsafe void pushMessage(byte[][] array, int id)
{
array[id % windowSize] = createMessage(id);
}
The full code is available
So we are creating a ‘message’ (that is actually a byte[1024]
) and then putting it into a data structure (byte[][]
). This is repeated 10 million times (msgCount
), but at any one time there are only 200,000 (windowSize
) messages in memory, because we overwrite old ‘messages’ as we go along.
We are timing how long it takes to add the message to the array, which should be a very quick operation. It’s not guaranteed that this time will always equate to GC pauses, but it’s pretty likely. However we can also double check the actual GC pause times by using the excellent PerfView tool, to give us more confidence.
Workstation GC vs. Server GC
Unlike the Java GC that is very configurable, the .NET GC really only gives you a few options:
- Workstation
- Server
- Concurrent/Background
So we will be comparing the Server and Workstation modes, but as we want to reduce pauses we are going to always leave Concurrent/Background mode enabled.
As outlined in the excellent post Understanding different GC modes with Concurrency Visualizer, the 2 modes are optimised for different things (emphasis mine):
Workstation GC is designed for desktop applications to minimize the time spent in GC. In this case GC will happen more frequently but with shorter pauses in application threads. Server GC is optimized for application throughput in favor of longer GC pauses. Memory consumption will be higher, but application can process greater volume of data without triggering garbage collection.
Therefore Workstation mode should give us shorter pauses than Server mode and the results bear this out, below is a graph of the pause times at different percentiles, recorded with by HdrHistogram.NET (click for full-size image):
Note that the X-axis scale is logarithmic, the Workstation (WKS) pauses starts increasing at the 99.99%’ile, whereas the Server (SVR) pauses only start at the 99.9999%’ile, although they have a larger maximum.
Another way of looking at the results is the table below, here we can clearly see that Workstation has a-lot more GC pauses, although the max is smaller. But more significantly the total GC pause time is much higher and as a result the overall/elapsed time is twice as long (WKS v. SVR).
Workstation GC (Concurrent) vs. Server GC (Background) (On .NET 4.6 - Array tests - all times in milliseconds)
GC Mode
Max GC Pause
# GC Pauses
Total GC Pause Time
Elapsed Time
Peak Working Set (MB)
Workstation - 1
28.0
1,797
10,266.2
21,688.3
550.37
Workstation - 2
23.2
1,796
9,756.6
21,018.2
543.50
Workstation - 3
19.3
1,800
9,676.0
21,114.6
531.24
Server - 1
104.6
7
646.4
7,062.2
2,086.39
Server - 2
107.2
7
664.8
7,096.6
2,092.65
Server - 3
106.2
6
558.4
7,023.6
2,058.12
Therefore if you only care about the reducing the maximum pause time then Workstation mode is a suitable option, but you will experience more GC pauses overall and so the throughput of your application will be reduced. In addition, the working set is higher for Server mode as it allocates 1 heap per CPU.
Fortunately in .NET we have the choice of which mode we want to use, according to the fantastic article Modern garbage collection the GO runtime has optimised for pause time only:
The reality is that Go’s GC does not really implement any new ideas or research. As their announcement admits, it is a straightforward concurrent mark/sweep collector based on ideas from the 1970s. It is notable only because it has been designed to optimise for pause times at the cost of absolutely every other desirable characteristic in a GC. Go’s tech talks and marketing materials don’t seem to mention any of these tradeoffs, leaving developers unfamiliar with garbage collection technologies to assume that no such tradeoffs exist, and by implication, that Go’s competitors are just badly engineered piles of junk.
Max GC Pause Time compared to Amount of Live Objects
To investigate things further, let’s look at how the maximum pause times vary with the number of live objects. If you refer back to the sample code, we will still be allocating 10,000,000 message (msgCount
), but we will vary the amount that are kept around at any one time by changing the windowSize
value. Here are the results (click for full-size image):
So you can clearly see that the max pause time is proportional (linearly?) to the amount of live objects, i.e. the amount of objects that survive the GC. Why is this that case, well to get a bit more info we will again use PerfView to help us. If you compare the 2 tables below, you can see that the ‘Promoted MB’ is drastically different, a lot more memory is promoted when we have a larger windowSize
, so the GC has more work to do and as a result the ‘Pause MSec’ times go up.
GC Events by Time - windowSize = 100,000All times are in msec. Hover over columns for help.GC
IndexGenPause
MSecGen0
Alloc
MBPeak
MBAfter
MBPromoted
MBGen0
MBGen1
MBGen2
MBLOH
MB21N39.4431,516.3541,516.354108.647104.8310.000107.2000.0311.41530N38.5161,651.4660.000215.847104.8000.000214.4000.0311.41541N42.7321,693.9081,909.754108.647104.8000.000107.2000.0311.41550N35.0671,701.0121,809.658215.847104.8000.000214.4000.0311.41561N54.4241,727.3801,943.226108.647104.8000.000107.2000.0311.41570N35.2081,603.8321,712.479215.847104.8000.000214.4000.0311.415
Full PerfView output
GC Events by Time - windowSize = 400,000All times are in msec. Hover over columns for help.GC
IndexGenPause
MSecGen0
Alloc
MBPeak
MBAfter
MBPromoted
MBGen0
MBGen1
MBGen2
MBLOH
MB20N10.31976.17076.17076.13368.9830.00072.3180.0003.81531N47.192666.0890.000708.556419.2310.000704.0160.7253.81540N145.3471,023.3691,731.925868.610419.2000.000864.0700.7253.81551N190.7361,278.3142,146.923433.340419.2000.000428.8000.7253.81560N150.6891,235.1611,668.501862.140419.2000.000857.6000.7253.81571N214.4651,493.2902,355.430433.340419.2000.000428.8000.7253.81580N148.8161,055.4701,488.810862.140419.2000.000857.6000.7253.81591N225.8811,543.3452,405.485433.340419.2000.000428.8000.7253.815100N148.2921,077.1761,510.516862.140419.2000.000857.6000.7253.815111N225.9171,610.3192,472.459433.340419.2000.000428.8000.7253.815
Full PerfView output
Going ‘off-heap’
Finally, if we really want to eradicate GC pauses in .NET, we can go off-heap. To do that we can write unsafe
code like this:
var dest = array[id % windowSize];
IntPtr unmanagedPointer = Marshal.AllocHGlobal(dest.Length);
byte* bytePtr = (byte *) unmanagedPointer;
// Get the raw data into the bytePtr (byte *)
// in reality this would come from elsewhere, e.g. a network packet
// but for the test we'll just cheat and populate it in a loop
for (int i = 0; i
Fri, 13 Jan 2017, 12:00 am
Why Exceptions should be Exceptional
According to the NASA ‘Near Earth Object Program’ asteroid ‘101955 Bennu (1999 RQ36)’ has a Cumulative Impact Probability of 3.7e-04, i.e. there is a 1 in 2,700 (0.0370%) chance of Earth impact, but more reassuringly there is a 99.9630% chance the asteroid will miss the Earth completely!
But how does this relate to exceptions in the .NET runtime, well let’s take a look at the official .NET Framework Design Guidelines for Throwing Exceptions (which are based on the excellent book Framework Design Guidelines: Conventions, Idioms, and Patterns for Reusable .NET Libraries)
So exceptions should be exceptional, unusual or rare, much like a asteroid strike!!
.NET Framework TryXXX() Pattern
In .NET, the recommended was to avoid exceptions in normal code flow is to use the TryXXX()
pattern. As pointed out in the guideline section on Exceptions and Performance, rather than writing code like this, which has to catch the exception when the input string isn’t a valid integer:
try
{
int result = int.Parse("IANAN");
Console.WriteLine(result);
}
catch (FormatException fEx)
{
Console.WriteLine(fEx);
}
You should instead use the TryXXX
API, in the following pattern:
int result;
if (int.TryParse("IANAN", out result))
{
// SUCCESS!!
Console.WriteLine(result);
}
else
{
// FAIL!!
}
Fortunately large parts of the .NET runtime use this pattern for non-exceptional events, such as parsing a string, creating a URL or adding an item to a Concurrent Dictionary.
So onto the performance costs, I was inspired to write this post after reading this tweet from Clemens Vasters:
I also copied/borrowed a large amount of ideas from the excellent post ‘The Exceptional Performance of Lil’ Exception’ by Java performance guru Aleksey Shipilëv (this post is in essence the .NET version of his post, which focuses exclusively on exceptions in the JVM)
So lets start with the full results (click for full-size image):
(Full Benchmark Code and Results)
Rare exceptions v Error Code Handling
Up front I want to be clear that nothing in this post is meant to contradict the best-practices outlined in the .NET Framework Guidelines (above), in fact I hope that it actually backs them up!
Method
Mean
StdErr
StdDev
Scaled
ErrorCodeWithReturnValue
1.4472 ns
0.0088 ns
0.0341 ns
1.00
RareExceptionStackTrace
22.0401 ns
0.0292 ns
0.1132 ns
15.24
RareExceptionMediumStackTrace
61.8835 ns
0.0609 ns
0.2279 ns
42.78
RareExceptionDeepStackTrace
115.3692 ns
0.1795 ns
0.6953 ns
79.76
Here we can see that as long as you follow the guidance and ‘DO NOT use exceptions for the normal flow of control’ then they are actually not that costly. I mean yes, they’re 15 times slower than using error codes, but we’re only talking about 22 nanoseconds, i.e. 22 billionths of a second, you have to be throwing exceptions frequently for it to be noticeable. For reference, here’s what the code for the first 2 results looks like:
public struct ResultAndErrorCode
{
public T Result;
public int ErrorCode;
}
[Benchmark(Baseline = true)]
public ResultAndErrorCode ErrorCodeWithReturnValue()
{
var result = new ResultAndErrorCode();
result.Result = null;
result.ErrorCode = 5;
return result;
}
[Benchmark]
public string RareExceptionStackTrace()
{
try
{
RareLevel20(); // start all the way down
return null; //Prevent Error CS0161: not all code paths return a value
}
catch (InvalidOperationException ioex)
{
// Force collection of a full StackTrace
return ioex.StackTrace;
}
}
Where the ‘RareLevelXX() functions look like this (i.e. will only trigger an exception once for every 2,700 times it’s called):
[MethodImpl(MethodImplOptions.NoInlining)]
private static void RareLevel1() { RareLevel2(); }
[MethodImpl(MethodImplOptions.NoInlining)]
private static void RareLevel2() { RareLevel3(); }
... // several layers left out!!
[MethodImpl(MethodImplOptions.NoInlining)]
private static void RareLevel19() { RareLevel20(); }
[MethodImpl(MethodImplOptions.NoInlining)]
private static void RareLevel20()
{
counter++;
// will *rarely* happen (1 in 2700)
if (counter % chanceOfAsteroidHit == 1)
throw new InvalidOperationException("Deep Stack Trace - Rarely triggered");
}
Therefore RareExceptionMediumStackTrace()
just calls RareLevel10()
to get a medium stack trace and RareExceptionDeepStackTrace()
calls RareLevel1()
which triggers the full/deep one (the full benchmark code is available).
Stack traces
Now that we’ve seen the cost of calling exceptions rarely, we’re going to look at the effect the stack trace depth has on performance. Here are the full, raw results:
Method
Mean
StdErr
StdDev
Gen 0
Allocated
Exception-Message
9,187.9417 ns
13.4824 ns
48.6117 ns
-
148 B
Exception-TryCatch
9,253.0215 ns
13.2496 ns
51.3154 ns
-
148 B
Exception
Medium-Message
14,911.7999 ns
20.2448 ns
78.4078 ns
-
916 B
Exception
Medium-TryCatch
15,158.0940 ns
147.4210 ns
737.1049 ns
-
916 B
Exception
Deep-Message
19,166.3524 ns
30.0539 ns
116.3984 ns
-
916 B
Exception
Deep-TryCatch
19,581.6743 ns
208.3895 ns
833.5579 ns
-
916 B
CachedException-StackTrace
29,354.9344 ns
34.8932 ns
135.1407 ns
-
1.82 kB
Exception-StackTrace
30,178.7152 ns
41.0362 ns
158.9327 ns
-
1.93 kB
Exception
Medium-StackTrace
100,121.7951 ns
129.0631 ns
499.8591 ns
0.1953
15.71 kB
Exception
Deep-StackTrace
154,569.3454 ns
205.2174 ns
794.8034 ns
3.6133
27.42 kB
Note: in these tests we are triggering an exception every-time a method is called, they aren’t the rare cases that we measured previously.
Exception handling without collecting the full StackTrace
First we are going to look at the results measuring the scenario where we don’t explicitly collect the StackTrace
after the exception is caught, so the benchmark code looks like this:
[Benchmark]
public string ExceptionMessage()
{
try
{
Level20(); // start *all* the way down the stack
return null; //Prevent Error CS0161: not all code paths return a value
}
catch (InvalidOperationException ioex)
{
// Only get the simple message from the Exception
// (don't trigger a StackTrace collection)
return ioex.Message;
}
}
In the following graphs, shallow stack traces are in blue bars, medium in orange and deep stacks are shown in green
So we clearly see there is an extra cost for exception handling that increases the deeper the stack trace goes. This is because when an exception is thrown the runtime needs to search up the stack until it hits a method than can handle it. The further it has to look up the stack, the more work it has to do.
Exception handling including collection of the full StackTrace
Now for the final results, in which we explicitly ask the run-time to (lazily) fetch the full stack trace, by accessing the StackTrace
property. The code looks like this:
[Benchmark]
public string ExceptionStackTrace()
{
try
{
Level20(); // start *all* the way down the stack
return null; //Prevent Error CS0161: not all code paths return a value
}
catch (InvalidOperationException ioex)
{
// Force collection of a full StackTrace
return ioex.StackTrace;
}
}
Finally we see that fetching the entire stack trace (via StackTrace
) dominates the performance of just handling the exception (ie. only accessing the exception message). But again, the deeper the stack trace, the higher the cost.
So thanks goodness we’re in the .NET world, where huge stack traces are rare. Over in Java-land they have to deal with nonesense like this (click to see the full-res version!!):
Conclusion
- Rare or Exceptional exceptions are not hugely expensive and they should always be the preferred way of error handling in .NET
- If you have code that is expected to fail often (such as parsing a string into an integer), use the
TryXXX()
pattern
- The deeper the stack trace, the more work that has to be done, so the more overhead there is when catching/handling exceptions
- This is even more true if you are also fetching the entire stack trace, via the
StackTrace
property. So if you don’t need it, don’t fetch it.
Discuss this post in /r/programming and /r/csharp
Further Reading
Exception Cost: When to throw and when not to a classic post on the subject, by ‘.NET Perf Guru’ Rico Mariani.
The stack trace of a StackTrace!!
The full call-stack that the CLR goes through when fetching the data for the Exception StackTrace
property
The post Why Exceptions should be Exceptional first appeared on my blog Performance is a Feature!
CodeProject
Tue, 20 Dec 2016, 12:00 am
Why is reflection slow?
It’s common knowledge that reflection in .NET is slow, but why is that the case? This post aims to figure that out by looking at what reflection does under-the-hood.
CLR Type System Design Goals
But first it’s worth pointing out that part of the reason reflection isn’t fast is that it was never designed to have high-performance as one of its goals, from Type System Overview - ‘Design Goals and Non-goals’:
Goals
- Accessing information needed at runtime from executing (non-reflection) code is very fast.
- Accessing information needed at compilation time for generating code is straightforward.
- The garbage collector/stackwalker is able to access necessary information without taking locks, or allocating memory.
- Minimal amounts of types are loaded at a time.
- Minimal amounts of a given type are loaded at type load time.
- Type system data structures must be storable in NGEN images.
Non-Goals
- All information in the metadata is directly reflected in the CLR data structures.
- All uses of reflection are fast.
and along the same lines, from Type Loader Design - ‘Key Data Structures’:
EEClass
MethodTable data are split into “hot” and “cold” structures to improve working set and cache utilization. MethodTable itself is meant to only store “hot” data that are needed in program steady state. EEClass stores “cold” data that are typically only needed by type loading, JITing or reflection. Each MethodTable points to one EEClass.
How does Reflection work?
So we know that ensuring reflection was fast was not a design goal, but what is it doing that takes the extra time?
Well there several things that are happening, to illustrate this lets look at the managed and unmanaged code call-stack that a reflection call goes through.
- System.Reflection.RuntimeMethodInfo.Invoke(..) - source code link
- calling System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(..)
- System.RuntimeMethodHandle.PerformSecurityCheck(..) - link
- calling System.GC.KeepAlive(..)
- System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(..) - link
- calling stub for System.RuntimeMethodHandle.InvokeMethod(..)
- stub for System.RuntimeMethodHandle.InvokeMethod(..) - link
Even if you don’t click the links and look at the individual C#/cpp methods, you can intuitively tell that there’s alot of code being executed along the way. But to give you an example, the final method, where the bulk of the work is done, System.RuntimeMethodHandle.InvokeMethod
is over 400 LOC!
But this is a nice overview, however what is it specifically doing?
Before you can invoke a field/property/method via reflection you have to get the FieldInfo/PropertyInfo/MethodInfo
handle for it, using code like this:
Type t = typeof(Person);
FieldInfo m = t.GetField("Name");
As shown in the previous section there’s a cost to this, because the relevant meta-data has to be fetched, parsed, etc. Interestingly enough the runtime helps us by keeping an internal cache of all the fields/properties/methods. This cache is implemented by the RuntimeTypeCache
class and one example of its usage is in the RuntimeMethodInfo
class.
You can see the cache in action by running the code in this gist, which appropriately enough uses reflection to inspect the runtime internals!
Before you have done any reflection to obtain a FieldInfo
, the code in the gist will print this:
Type: ReflectionOverhead.Program
Reflection Type: System.RuntimeType (BaseType: System.Reflection.TypeInfo)
m_fieldInfoCache is null, cache has not been initialised yet
But once you’ve fetched even just one field, then the following will be printed:
Type: ReflectionOverhead.Program
Reflection Type: System.RuntimeType (BaseType: System.Reflection.TypeInfo)
RuntimeTypeCache: System.RuntimeType+RuntimeTypeCache,
m_cacheComplete = True, 4 items in cache
[0] - Int32 TestField1 - Private
[1] - System.String TestField2 - Private
[2] - Int32 k__BackingField - Private
[3] - System.String TestField3 - Private, Static
where ReflectionOverhead.Program
looks like this:
class Program
{
private int TestField1;
private string TestField2;
private static string TestField3;
private int TestProperty1 { get; set; }
}
This means that repeated calls to GetField
or GetFields
are cheaper as the runtime only has to filter the pre-existing list that’s already been created. The same applies to GetMethod
and GetProperty
, when you call them the first time the MethodInfo
or PropertyInfo
cache is built.
Argument Validation and Error Handling
But once you’ve obtained the MethodInfo
, there’s still a lot of work to be done when you call Invoke
on it. Imagine you wrote some code like this:
PropertyInfo stringLengthField =
typeof(string).GetProperty("Length",
BindingFlags.Instance | BindingFlags.Public);
var length = stringLengthField.GetGetMethod().Invoke(new Uri(), new object[0]);
If you run it you would get the following exception:
System.Reflection.TargetException: Object does not match target type.
at System.Reflection.RuntimeMethodInfo.CheckConsistency(..)
at System.Reflection.RuntimeMethodInfo.InvokeArgumentsCheck(..)
at System.Reflection.RuntimeMethodInfo.Invoke(..)
at System.Reflection.RuntimePropertyInfo.GetValue(..)
This is because we have obtained the PropertyInfo
for the Length
property on the String
class, but invoked it with an Uri
object, which is clearly the wrong type!
In addition to this, there also has to be validation of any arguments you pass through to the method you are invoking. To make argument passing work, reflection APIs take a parameter that is an array of object
’s, one per argument. So if you using reflection to call the method Add(int x, int y)
, you would invoke it by calling methodInfo.Invoke(.., new [] { 5, 6 })
. At run-time checks need to be carried out on the amount and types of the values passed in, in this case to ensure that there are 2 and that they are both int
’s. One down-side of all this work is that it often involves boxing which has an additional cost, but hopefully this will be minimised in the future.
Security Checks
The other main task that is happening along the way is multiple security checks. For instance, it turns out that you aren’t allowed to use reflection to call just any method you feel like. There are some restricted or ‘Dangerous Methods’, that can only be called by trusted .NET framework code. In addition to a black-list, there are also dynamic security checks depending on the current Code Access Security permissions that have to be checked during invocation.
How much does Reflection cost?
So now that we know what reflection is doing behind-the-scenes, it’s a good time to look at what it costs us. Please note that these benchmarks are comparing reading/writing a property directly v via reflection. In .NET properties are actually a pair of Get/Set
methods that the compiler generates for us, however when the property has just a simple backing field the .NET JIT inlines the method call for performance reasons. This means that using reflection to access a property will show reflection in the worse possible light, but it was chosen as it’s the most common use-case, showing up in ORMs, Json serialisation/deserialisation libraries and object mapping tools.
Below are the raw results as they are displayed by BenchmarkDotNet, followed by the same results displayed in 2 separate tables. (full Benchmark code is available)
Reading a Property (‘Get’)
Method
Mean
StdErr
Scaled
Bytes Allocated/Op
GetViaProperty
0.2159 ns
0.0047 ns
1.00
0.00
GetViaDelegate
1.8903 ns
0.0082 ns
8.82
0.00
GetViaILEmit
2.9236 ns
0.0067 ns
13.64
0.00
GetViaCompiledExpressionTrees
12.3623 ns
0.0200 ns
57.65
0.00
GetViaFastMember
35.9199 ns
0.0528 ns
167.52
0.00
GetViaReflectionWithCaching
125.3878 ns
0.2017 ns
584.78
0.00
GetViaReflection
197.9258 ns
0.2704 ns
923.08
0.01
GetViaDelegateDynamicInvoke
842.9131 ns
1.2649 ns
3,931.17
419.04
Writing a Property (‘Set’)
Method
Mean
StdErr
Scaled
Bytes Allocated/Op
SetViaProperty
1.4043 ns
0.0200 ns
6.55
0.00
SetViaDelegate
2.8215 ns
0.0078 ns
13.16
0.00
SetViaILEmit
2.8226 ns
0.0061 ns
13.16
0.00
SetViaCompiledExpressionTrees
10.7329 ns
0.0221 ns
50.06
0.00
SetViaFastMember
36.6210 ns
0.0393 ns
170.79
0.00
SetViaReflectionWithCaching
214.4321 ns
0.3122 ns
1,000.07
98.49
SetViaReflection
287.1039 ns
0.3288 ns
1,338.99
115.63
SetViaDelegateDynamicInvoke
922.4618 ns
2.9192 ns
4,302.17
390.99
So we can clearly see that regular reflection code (GetViaReflection
and SetViaReflection
) is considerably slower than accessing the property directly (GetViaProperty
and SetViaProperty
). But what about the other results, lets explore those in more detail.
Setup
First we start with a TestClass
that looks like this:
public class TestClass
{
public TestClass(String data)
{
Data = data;
}
private string data;
private string Data
{
get { return data; }
set { data = value; }
}
}
and the following common code, that all the options can make use of:
// Setup code, done only once
TestClass testClass = new TestClass("A String");
Type @class = testClass.GetType();
BindingFlag bindingFlags = BindingFlags.Instance |
BindingFlags.NonPublic |
BindingFlags.Public;
Regular Reflection
First we use regular benchmark code, that acts as out starting point and the ‘worst case’:
[Benchmark]
public string GetViaReflection()
{
PropertyInfo property = @class.GetProperty("Data", bindingFlags);
return (string)property.GetValue(testClass, null);
}
Option 1 - Cache PropertyInfo
Next up, we can gain a small speed boost by keeping a reference to the PropertyInfo
, rather than fetching it each time. But we’re still much slower than accessing the property directly, which demonstrates that there is a considerable cost in the ‘invocation’ part of reflection.
// Setup code, done only once
PropertyInfo cachedPropertyInfo = @class.GetProperty("Data", bindingFlags);
[Benchmark]
public string GetViaReflection()
{
return (string)cachedPropertyInfo.GetValue(testClass, null);
}
Option 2 - Use FastMember
Here we make use of Marc Gravell’s excellent Fast Member library, which as you can see is very simple to use!
// Setup code, done only once
TypeAccessor accessor = TypeAccessor.Create(@class, allowNonPublicAccessors: true);
[Benchmark]
public string GetViaFastMember()
{
return (string)accessor[testClass, "Data"];
}
Note that it’s doing something slightly different to the other options. It creates a TypeAccessor
that allows access to all the Properties on a type, not just one. But the downside is that, as a result, it takes longer to run. This is because internally it first has to get the delegate
for the Property you requested (in this case ‘Data’), before fetching it’s value. However this overhead is pretty small, FastMember is still way faster than Reflection and it’s very easy to use, so I recommend you take a look at it first.
This option and all subsequent ones convert the reflection code into a delegate
that can be directly invoked without the overhead of reflection every time, hence the speed boost!
Although it’s worth pointing out that the creation of a delegate
has a cost (see ‘Further Reading’ for more info). So in short, the speed boost is because we are doing the expensive work once (security checks, etc) and storing a strongly typed delegate
that we can use again and again with little overhead. You wouldn’t use these techniques if you were doing reflection once, but if you’re only doing it once it wouldn’t be a performance bottleneck, so you wouldn’t care if it was slow!
The reason that reading a property via a delegate
isn’t as fast as reading it directly is because the .NET JIT won’t inline a delegate
method call like it will do with a Property access. So with a delegate
we have to pay the cost of a method call, which direct access doesn’t.
Option 3 - Create a Delegate
In this option we use the CreateDelegate
function to turn our PropertyInfo into a regular delegate
:
// Setup code, done only once
PropertyInfo property = @class.GetProperty("Data", bindingFlags);
Func getDelegate =
(Func)Delegate.CreateDelegate(
typeof(Func),
property.GetGetMethod(nonPublic: true));
[Benchmark]
public string GetViaDelegate()
{
return getDelegate(testClass);
}
The drawback is that you to need to know the concrete type at compile-time, i.e. the Func
part in the code above (no you can’t use Func
, if you do it’ll thrown an exception!). In the majority of situations when you are doing reflection you don’t have this luxury, otherwise you wouldn’t be using reflection in the first place, so it’s not a complete solution.
For a very interesting/mind-bending way to get round this, see the MagicMethodHelper
code in the fantastic blog post from Jon Skeet ‘Making Reflection fly and exploring delegates’ or read on for Options 4 or 5 below.
Option 4 - Compiled Expression Trees
Here we generate a delegate
, but the difference is that we can pass in an object
, so we get round the limitation of ‘Option 3’. We make use of the .NET Expression
tree API that allows dynamic code generation:
// Setup code, done only once
PropertyInfo property = @class.GetProperty("Data", bindingFlags);
ParameterExpression = Expression.Parameter(typeof(object), "instance");
UnaryExpression instanceCast =
!property.DeclaringType.IsValueType ?
Expression.TypeAs(instance, property.DeclaringType) :
Expression.Convert(instance, property.DeclaringType);
Func GetDelegate =
Expression.Lambda(
Expression.TypeAs(
Expression.Call(instanceCast, property.GetGetMethod(nonPublic: true)),
typeof(object)),
instance)
.Compile();
[Benchmark]
public string GetViaCompiledExpressionTrees()
{
return (string)GetDelegate(testClass);
}
Full code for the Expression
based approach is available in the blog post Faster Reflection using Expression Trees
Option 5 - Dynamic code-gen with IL Emit
Finally we come to the lowest-level approach, emiting raw IL, although ‘with great power, comes great responsibility’:
// Setup code, done only once
PropertyInfo property = @class.GetProperty("Data", bindingFlags);
Sigil.Emit getterEmiter = Emit
.NewDynamicMethod("GetTestClassDataProperty")
.LoadArgument(0)
.CastClass(@class)
.Call(property.GetGetMethod(nonPublic: true))
.Return();
Func getter = getterEmiter.CreateDelegate();
[Benchmark]
public string GetViaILEmit()
{
return getter(testClass);
}
Using Expression
tress (as shown in Option 4), doesn’t give you as much flexibility as emitting IL codes directly, although it does prevent you from emitting invalid code! Because of this, if you ever find yourself needing to emil IL I really recommend using the excellent Sigil library, as it gives better error messages when you get things wrong!
Conclusion
The take-away is that if (and only if) you find yourself with a performance issue when using reflection, there are several different ways you can make it faster. These speed gains are achieved by getting a delegate
that allows you to access the Property/Field/Method directly, without all the overhead of going via reflection every-time.
Discuss this post in /r/programming and /r/csharp
Further Reading
For reference, below is the call-stack or code-flow that the runtime goes through when Creating a Delegate
Delegate CreateDelegate(Type type, MethodInfo method)
Delegate CreateDelegate(Type type, MethodInfo method, bool throwOnBindFailure)
Delegate CreateDelegateInternal(RuntimeType rtType, RuntimeMethodInfo rtMethod, Object firstArgument, DelegateBindingFlags flags, ref StackCrawlMark stackMark)
Delegate UnsafeCreateDelegate(RuntimeType rtType, RuntimeMethodInfo rtMethod, Object firstArgument, DelegateBindingFlags flags)
bool BindToMethodInfo(Object target, IRuntimeMethodInfo method, RuntimeType methodType, DelegateBindingFlags flags);
FCIMPL5(FC_BOOL_RET, COMDelegate::BindToMethodInfo, Object* refThisUNSAFE, Object* targetUNSAFE, ReflectMethodObject *pMethodUNSAFE, ReflectClassBaseObject *pMethodTypeUNSAFE, int flags)
COMDelegate::BindToMethod(DELEGATEREF *pRefThis, OBJECTREF *pRefFirstArg, MethodDesc *pTargetMethod, MethodTable *pExactMethodType, BOOL fIsOpenDelegate, BOOL fCheckSecurity)
The post Why is reflection slow? first appeared on my blog Performance is a Feature!
CodeProject
Wed, 14 Dec 2016, 12:00 am
Research papers in the .NET source
This post is completely inspired by (or ‘copied from’ depending on your point of view) a recent post titled JAVA PAPERS (also see the HackerNews discussion). However, instead of looking at Java and the JVM, I’ll be looking at references to research papers in the .NET language, runtime and compiler source code.
If I’ve missed any that you know of, please leave a comment below!
Note: I’ve deliberately left out links to specifications, standards documents or RFC’s, instead concentrating only on Research Papers.
Abstract
The red-black tree model for implementing balanced search trees, introduced by Guibas and Sedgewick thirty years ago, is now found throughout our computational infrastructure. Red-black trees are described in standard textbooks and are the underlying data structure for symbol-table implementations within C++, Java, Python, BSD Unix, and many other modern systems. However, many of these implementations have sacrificed some of the original design goals (primarily in order to develop an effective implementation of the delete operation, which was incompletely specified in the original paper), so a new look is worthwhile.
In this paper, we describe a new variant of redblack trees that meets many of the original design goals and leads to substantially simpler code for insert/delete, less than one-fourth as much code as in implementations in common use.
Abstract
We present a new class of resizable sequential and concur-rent hash map algorithms directed at both uni-processor and multicore machines. The new hopscotch algorithms are based on a novel hopscotch multi-phased probing and displacement technique that has the flavors of chaining, cuckoo hashing, and linear probing, all put together, yet avoids the limitations and overheads of these former approaches. The resulting algorithms provide tables with very low synchronization overheads and high cache hit ratios.
In a series of benchmarks on a state-of-the-art 64-way Niagara II multi- core machine, a concurrent version of hopscotch proves to be highly scal-able, delivering in some cases 2 or even 3 times the throughput of today’s most efficient concurrent hash algorithm, Lea’s ConcurrentHashMap from java.concurr.util. Moreover, in tests on both Intel and Sun uni-processor machines, a sequential version of hopscotch consistently outperforms the most effective sequential hash table algorithms including cuckoo hashing and bounded linear probing.
The most interesting feature of the new class of hopscotch algorithms is that they continue to deliver good performance when the hash table is more than 90% full, increasing their advantage over other algorithms as the table density grows.
Abstract
Method inlining is considered to be one of the most important optimizations in a compiler. However, a poor inlining heuristic can lead to significant degradation of a program’s running time. Therefore, it is important that an inliner has an effective heuristic that controls whether a method is inlined or not. An important component of any inlining heuristic are the features that characterize the inlining decision. These features often correspond to the caller method and the callee methods. However, it is not always apparent what the most important features are for this problem or the relative importance of these features. Compiler writers developing inlining heuristics may exclude critical information that can be obtained during each inlining decision. In this paper, we use a machine learning technique, namely neuro-evolution [18], to automatically induce effective inlining heuristics from a set of features deemed to be useful for inlining. Our learning technique is able to induce novel heuristics that significantly out-perform manually-constructed inlining heuristics. We evaluate the heuristic constructed by our neuro-evolutionary technique within the highly tuned Java HotSpot server compiler and the Maxine VM C1X compiler, and we are able to obtain speedups of up to 89% and 114%, respectively. In addition, we obtain an average speedup of almost 9% and 11% for the Java HotSpot VM and Maxine VM, respectively. However, the output of neuro-evolution, a neural network, is not human readable. We show how to construct more concise and read-able heuristics in the form of decision trees that perform as well as our neuro-evolutionary approach.
Abstract
Procedural languages are generally well understood. Their foundations have been cast in calculi that prove useful in matters of implementation and semantics. So far, an analogous understanding has not emerged for object-oriented languages. In this book the authors take a novel approach to the understanding of object-oriented languages by introducing object calculi and developing a theory of objects around them. The book covers both the semantics of objects and their typing rules, and explains a range of object-oriented concepts, such as self, dynamic dispatch, classes, inheritance, prototyping, subtyping, covariance and contravariance, and method specialization. Researchers and graduate students will find this an important development of the underpinnings of object-oriented programming.
Abstract
We present an optimized implementation of the linear scan register allocation algorithm for Sun Microsystems’ Java HotSpot™ client compiler. Linear scan register allocation is especially suitable for just-in-time compilers because it is faster than the common graph-coloring approach and yields results of nearly the same quality.Our allocator improves the basic linear scan algorithm by adding more advanced optimizations: It makes use of lifetime holes, splits intervals if the register pressure is too high, and models register constraints of the target architecture with fixed intervals. Three additional optimizations move split positions out of loops, remove register-to-register moves and eliminate unnecessary spill stores. Interval splitting is based on use positions, which also capture the kind of use and whether an operand is needed in a register or not. This avoids the reservation of a scratch register.Benchmark results prove the efficiency of the linear scan algorithm: While the compilation speed is equal to the old local register allocator that is part of the Sun JDK 5.0, integer benchmarks execute about 15% faster. Floating-point benchmarks show the high impact of the Intel SSE2 extensions on the speed of numeric Java applications: With the new SSE2 support enabled, SPECjvm98 executes 25% faster compared with the current Sun JDK 5.0.
Abstract
Pattern matching of algebraic data types (ADTs) is a standard feature in typed functional programming languages, but it is well known that it interacts poorly with abstraction. While several partial solutions to this problem have been proposed, few have been implemented or used. This paper describes an extension to the .NET language F# called active patterns, which supports pattern matching over abstract representations of generic heterogeneous data such as XML and term structures, including where these are represented via object models in other .NET languages. Our design is the first to incorporate both ad hoc pattern matching functions for partial decompositions and “views” for total decompositions, and yet remains a simple and lightweight extension. We give a description of the language extension along with numerous motivating examples. Finally we describe how this feature would interact with other reasonable and related language extensions: existential types quantified at data discrimination tags, GADTs, and monadic generalizations of pattern matching.
Abstract
The problem of searching the set of keys in a file to find a key which is closest to a given query key is discussed. After “closest,” in terms of a metric on the the key space, is suitably defined, three file structures are presented together with their corresponding search algorithms, which are intended to reduce the number of comparisons required to achieve the desired result. These methods are derived using certain inequalities satisfied by metrics and by graph-theoretic concepts. Some empirical results are presented which compare the efficiency of the methods.
For reference, the links below take you straight the the GitHub searches, so you can take a look yourself:
Research produced by work on the .NET Runtime or Compiler
But what about the other way round, are there instances of work being done in .NET that is then turned into a research paper? Well it turns out there is, the first example I came across was from a tweet by Joe Duffy:
(As an aside, I recommend checking out Joe Duffy’s blog, it contains lots of information about Midori the research project to build a managed OS!)
Abstract
There has been considerable interest in using control theory to build web servers, database managers, and other systems. We claim that the potential value of using control theory cannot be realized in practice without a methodology that addresses controller design, testing, and tuning. Based on our experience with building a controller for the .NET thread pool, we develop a methodology that: (a) designs for extensibility to integrate diverse control techniques, (b) scales the test infrastructure to enable running a large number of test cases, (c) constructs test cases for which the ideal controller performance is known a priori so that the outcomes of test cases can be readily assessed, and (d) tunes controller parameters to achieve good results for multiple performance metrics. We conclude by discussing how our methodology can be extended, especially to designing controllers for distributed systems.
Abstract
A key challenge for concurrent programming is that side-effects (memory operations) in one thread can affect the behavior of another thread. In this paper, we present a type system to restrict the updates to memory to prevent these unintended side-effects. We provide a novel combination of immutable and unique (isolated) types that ensures safe parallelism (race freedom and deterministic execution). The type system includes support for polymorphism over type qualifiers, and can easily create cycles of immutable objects. Key to the system’s flexibility is the ability to recover immutable or externally unique references after violating uniqueness without any explicit alias tracking. Our type system models a prototype extension to C# that is in active use by a Microsoft team. We describe their experiences building large systems with this extension. We prove the soundness of the type system by an embedding into a program logic.
Abstract
The Microsoft .NET Common Language Runtime provides a shared type system, intermediate language and dynamic execution environment for the implementation and inter-operation of multiple source languages. In this paper we extend it with direct support for parametric polymorphism (also known as generics), describing the design through examples written in an extended version of the C# programming language, and explaining aspects of implementation by reference to a prototype extension to the runtime. Our design is very expressive, supporting parameterized types, polymorphic static, instance and virtual methods, “F-bounded” type parameters, instantiation at pointer and value types, polymorphic recursion, and exact run-time types. The implementation takes advantage of the dynamic nature of the runtime, performing justin-time type specialization, representation-based code sharing and novel techniques for efficient creation and use of run-time types. Early performance results are encouraging and suggest that programmers will not need to pay an overhead for using generics, achieving performance almost matching hand-specialized code.
Abstract
The security of the .NET programming model is studied from the standpoint of fully abstract compilation of C#. A number of failures of full abstraction are identified, and fixes described. The most serious problems have recently been fixed for version 2.0 of the .NET Common Language Runtime.
Abstract
Concurrent garbage collection is highly attractive for real-time systems, because offloading the collection effort from the executing threads allows faster response, allowing for extremely short deadlines at the microseconds level. Concurrent collectors also offer much better scalability over incremental collectors. The main problem with concurrent real-time collectors is their complexity. The first concurrent real-time garbage collector that can support fine synchronization, STOPLESS, has recently been presented by Pizlo et al. In this paper, we propose two additional (and different) algorithms for concurrent real-time garbage collection: CLOVER and CHICKEN. Both collectors obtain reduced complexity over the first collector STOPLESS, but need to trade a benefit for it. We study the algorithmic strengths and weaknesses of CLOVER and CHICKEN and compare them to STOPLESS. Finally, we have implemented all three collectors on the Bartok compiler and runtime for C# and we present measurements to compare their efficiency and responsiveness.
Abstract
We present STOPLESS: a concurrent real-time garbage collector suitable for modern multiprocessors running parallel multithreaded applications. Creating a garbage-collected environment that supports real-time on modern platforms is notoriously hard, especially if real-time implies lock-freedom. Known real-time collectors either restrict the real-time guarantees to uniprocessors only, rely on special hardware, or just give up supporting atomic operations (which are crucial for lock-free software). STOPLESS is the first collector that provides real-time responsiveness while preserving lock-freedom, supporting atomic operations, controlling fragmentation by compaction, and supporting modern parallel platforms.
STOPLESS is adequate for modern languages such as C# or Java. It was implemented on top of the Bartok compiler and runtime for C# and measurements demonstrate high responsiveness (a factor of a 100 better than previously published systems), virtually no pause times, good mutator utilization, and acceptable overheads.
Finally, a full list of MS Research publications related to ‘programming languages and software engineering’ is available if you want to explore more of this research yourself.
Discuss this post on Hacker News
Mon, 12 Dec 2016, 12:00 am
Open Source .NET – 2 years later
A little over 2 years ago Microsoft announced that they were open sourcing large parts of the .NET framework and as Scott Hanselman said in his recent Connect keynote, the community has been contributing in a significant way:
You can see some more detail on this number in the talk ‘What’s New in the .NET Platform’ by Scott Hunter:
This post aims to give more context to those numbers and allow you to explore patterns and trends across different repositories.
Repository activity over time
First we are going to see an overview of the level of activity in each repo, by looking at the total number of ‘Issues’ (created) or ‘Pull Requests’ (closed) per month. (Yay sparklines FTW!!)
Note: Numbers in black are from the most recent month, with red showing the lowest and green the highest previous value. You can toggle between Issues and Pull Requests by clicking on the buttons, hover over individual sparklines to get a tooltip showing the per/month values and click on the project name to take you to the GitHub page for that repository.
Issues
Pull Requests
The main trend I see across all repos is there’s a sustained level of activity for the entire 2 years, things didn’t start with a bang and then tailed off. In addition, many (but not all) repos have a trend of increased activity month-by-month. For instance the PR’s in CoreFX or the Issues in Visual Studio Code (vscode) are clear example of this, their best months have been the most recent.
Finally one interesting ‘story’ that jumps out of this data is the contrasting levels of activity (PR’s) across the dnx, cli and msbuild repositories, as highlighted in the image below:
If you don’t know the full story, initially all the cmd-line tooling was known as dnx, but in RC2 was migrated to .NET Core CLI. You can see this on the chart, activity in the dnx repo decreased at the same time that work in cli ramped up.
Following that, in May this year, the whole idea of having ‘project.json’ files was abandoned in favour of sticking with ‘msbuild’, you can see this change happen towards the right of the chart, there is a marked increase in the msbuild repo activity as any improvements that had been done in cli were ported over.
But the main question I want to answer is:
How much Community involvement has there been since Microsoft open sourced large parts of the .NET framework?
(See my previous post to see how things looked after one year)
To do this we need to look at who opened the Issue or created the Pull Request (PR) and specifically if they worked for Microsoft or not. This is possible because (almost) all Microsoft employees have indicated where they work on their GitHub profile, for instance:
There are some notable exceptions, e.g. @shanselman clearly works at Microsoft, but it’s easy enough to allow for cases like this. Before you ask, I only analysed this data, I did not keep a copy of it in stored in MongoDB to sell to recruiters!!
This data represents the total participation from the last 2 years, i.e. November 2014 to October 2016. All Pull Requests are Issues are treated equally, so a large PR counts the same as one that fixes a spelling mistake. Whilst this isn’t ideal it’s the simplest way to get an idea of the Microsoft/Community split.
Note: You can hover over the bars to get the actual numbers, rather than percentages.
Issues:
Microsoft
Community
Pull Requests:
Microsoft
Community
The general pattern these graphs show is that the Community is more likely to open an Issue than submit a PR, which I guess isn’t that surprising given the relative amount of work involved. However it’s clear that the Community is still contributing a considerable amount of work, for instance if you look at the CoreCLR repo it only has 21% of PRs from the Community, but this stills account for almost 900!
There’s a few interesting cases that jump out here, for instance Roslyn gets 35% of its issues from the Community, but only 6% of its PR’s, clearly getting code into the compiler is a tough task. Likewise it doesn’t seem like the Community is that interested in submitting code to msbuild, although it does have my favourite PR ever:
Finally we can see the ‘per-month’ data from the last 2 years, i.e. November 2014 to October 2016.
Note: You can inspect different repos by selecting them from the pull-down list, but be aware that the y-axis on the graphs are re-scaled, so the maximum value will change each time.
Issues:
Microsoft
Community
Pull Requests:
Microsoft
Community
Whilst not every repo is growing month-by-month, the majority are and those that aren’t at least show sustained contributions across 2 years.
Summary
I think that it’s clear to see that the Community has got on-board with the new Open-Source Microsoft, producing a sustained level of contributions over the last 2 years, lets hope it continues!
Discuss this post in /r/programming
The post Open Source .NET – 2 years later first appeared on my blog Performance is a Feature!
CodeProject
Wed, 23 Nov 2016, 12:00 am
How does the 'fixed' keyword work?
Well it turns out that it’s a really nice example of collaboration between the main parts of the .NET runtime, here’s a list of all the components involved:
Now you could argue that all of these are required to execute any C# code, but what’s interesting about the fixed
keyword is that they all have a specific part to play.
Compiler
To start with let’s look at one of the most basic scenarios for using the fixed
keyword, directly accessing the contents of a C# string
, (taken from a Roslyn unit test)
using System;
unsafe class C
{
static unsafe void Main()
{
fixed (char* p = "hello")
{
Console.WriteLine(*p);
}
}
}
Which the compiler then turns into the following IL:
// Code size 34 (0x22)
.maxstack 2
.locals init (char* V_0, //p
pinned string V_1)
IL_0000: nop
IL_0001: ldstr "hello"
IL_0006: stloc.1
IL_0007: ldloc.1
IL_0008: conv.i
IL_0009: stloc.0
IL_000a: ldloc.0
IL_000b: brfalse.s IL_0015
IL_000d: ldloc.0
IL_000e: call "int System.Runtime.CompilerServices.RuntimeHelpers.OffsetToStringData.get"
IL_0013: add
IL_0014: stloc.0
IL_0015: nop
IL_0016: ldloc.0
IL_0017: ldind.u2
IL_0018: call "void System.Console.WriteLine(char)"
IL_001d: nop
IL_001e: nop
IL_001f: ldnull
IL_0020: stloc.1
IL_0021: ret
Note the pinned string V_1
that the compiler has created for us, it’s made a hidden local variable that holds a reference to the object
we are using in the fixed
statement, which in this case is the string “hello”. The purpose of this pinned local variable will be explained in a moment.
It’s also emitted an call to the OffsetToStringData
getter method (from System.Runtime.CompilerServices.RuntimeHelpers
), which we will cover in more detail when we discuss the CLR’s role.
However, as an aside the compiler is also performing an optimisation for us, normally it would wrap the fixed
statement in a finally
block to ensure the pinned local variable is nulled out after controls leaves the scope. But in this case it has determined that is can leave out the finally
statement entirely, from LocalRewriter_FixedStatement.cs in the Roslyn source:
// In principle, the cleanup code (i.e. nulling out the pinned variables) is always
// in a finally block. However, we can optimize finally away (keeping the cleanup
// code) in cases where both of the following are true:
// 1) there are no branches out of the fixed statement; and
// 2) the fixed statement is not in a try block (syntactic or synthesized).
if (IsInTryBlock(node) || HasGotoOut(rewrittenBody))
{
...
}
What is this pinned identifier?
Let’s start by looking at the authoritative source, from Standard ECMA-335 Common Language Infrastructure (CLI)
II.7.1.2 pinned
The signature encoding for pinned shall appear only in signatures that describe local variables (§II.15.4.1.3). While a method with a pinned local variable is executing, the VES shall not relocate the object to which the local refers. That is, if the implementation of the CLI uses a garbage collector that moves objects, the collector shall not move objects that are referenced by an active pinned local variable.
[Rationale: If unmanaged pointers are used to dereference managed objects, these objects shall be pinned. This happens, for example, when a managed object is passed to a method designed to operate with unmanaged data. end rationale]
VES = Virtual Execution System
CLI = Common Language Infrastructure
CTS = Common Type System
But if you prefer an explanation in more human readable form (i.e. not from a spec), then this extract from .Net IL Assembler Paperback by Serge Lidin is helpful:
(Also available on Google Books)
CLR
Arguably the CLR has the easiest job to do (if you accept that it exists as a separate component from the JIT and GC), its job is to provide the offset of the raw string
data via the OffsetToStringData
method that is emitted by the compiler.
Now you might be thinking that this method does some complex calculations to determine the exact offset, but nope, it’s hard-coded!! (I told you that Strings and the CLR have a Special Relationship):
public static int OffsetToStringData
{
// This offset is baked in by string indexer intrinsic, so there is no harm
// in getting it baked in here as well.
[System.Runtime.Versioning.NonVersionable]
get {
// Number of bytes from the address pointed to by a reference to
// a String to the first 16-bit character in the String. Skip
// over the MethodTable pointer, & String length. Of course, the
// String reference points to the memory after the sync block, so
// don't count that.
// This property allows C#'s fixed statement to work on Strings.
// On 64 bit platforms, this should be 12 (8+4) and on 32 bit 8 (4+4).
#if BIT64
return 12;
#else // 32
return 8;
#endif // BIT64
}
}
JITter
For the fixed
keyword to work the role of the JITter is to provide information to the GC/Runtime about the lifetimes of variables within a method and in-particular if they are pinned locals. It does this via the GCInfo
data it creates for every method:
To see this in action we have to enable the correct magic flags and then we will see the following:
Compiling 0 ConsoleApplication.Program::Main, IL size = 30, hsh=0x8d66958e
; Assembly listing for method ConsoleApplication.Program:Main(ref)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 arg0 [V00 ] ( 0, 0 ) ref -> zero-ref
; V01 loc0 [V01,T00] ( 5, 4 ) long -> rcx
; V02 loc1 [V02 ] ( 3, 3 ) ref -> [rsp+0x20] must-init pinned
; V03 tmp0 [V03,T01] ( 2, 4 ) long -> rcx
; V04 OutArgs [V04 ] ( 1, 1 ) lclBlk (32) [rsp+0x00]
;
; Lcl frame size = 40
G_M27250_IG01:
000000 4883EC28 sub rsp, 40
000004 33C0 xor rax, rax
000006 4889442420 mov qword ptr [rsp+20H], rax
G_M27250_IG02:
00000B 488B0C256830B412 mov rcx, gword ptr [12B43068H] 'hello'
000013 48894C2420 mov gword ptr [rsp+20H], rcx
000018 488B4C2420 mov rcx, gword ptr [rsp+20H]
00001D 4885C9 test rcx, rcx
000020 7404 je SHORT G_M27250_IG03
000022 4883C10C add rcx, 12
G_M27250_IG03:
000026 0FB709 movzx rcx, word ptr [rcx]
000029 E842FCFFFF call System.Console:WriteLine(char)
00002E 33C0 xor rax, rax
000030 4889442420 mov gword ptr [rsp+20H], rax
G_M27250_IG04:
000035 4883C428 add rsp, 40
000039 C3 ret
; Total bytes of code 58, prolog size 11 for method ConsoleApplication.Program:Main(ref)
; ============================================================
Set code length to 58.
Set Outgoing stack arg area size to 32.
Stack slot id for offset 32 (0x20) (sp) (pinned, untracked) = 0.
Defining 1 call sites:
Offset 0x29, size 5.
See how in the section titled “Final local variable assignments” is had indicated that the V02 loc1
variable is must-init pinned
and then down at the bottom is has this text:
Stack slot id for offset 32 (0x20) (sp) (pinned, untracked) = 0.
Aside: The JIT has also done some extra work for us and optimised away the call to OffsetToStringData
by inlining it as the assembly code add rcx, 12
. On a slightly related note, previously the fixed
keyword prevented a method from being inlined, but recently that changed, see Support inlining method with pinned locals for the full details.
Garbage Collector
Finally we come to the GC which has an important “role to play”, or “not to play” depending on which way you look at it.
In effect the GC has to get out of the way and leave the pinned local variable alone for the life-time of the method. Normally the GC is concerned about which objects are live or dead so that it knows what it has to clean up. But with pinned objects it has to go one step further, not only must it not clean up the object, but it must not move it around. Generally the GC likes to relocate objects around during the Compact Phase to make memory allocations cheap, but pinning prevents that as the object is being accessed via a pointer and therefore its memory address has to remain the same.
There is a great visual explanation of what that looks like from the excellent presentation CLR: Garbage Collection Inside Out by Maoni Stephens (click for full-sized version):
Note how the pinned blocks (marked with a ‘P’) have remained where they are, forcing the Gen 0/1/2 segments to start at awkard locations. This is why pinning too many objects and keeping them pinned for too long can cause GC overhead, it has to perform extra booking keeping and work around them.
In reality, when using the fixed
keyword, your object will only remain pinned for a short period of time, i.e. until control leaves the scope. But if you are pinning object via the GCHandle
class then the lifetime could be longer.
So to finish, let’s get the final word on pinning from Maoni Stephens, from Using GC Efficiently – Part 3 (read the blog post for more details):
When you do need to pin, here are some things to keep in mind:
- Pinning for a short time is cheap.
- Pinning an older object is not as harmful as pinning a young object.
- Creating pinned buffers that stay together instead of scattered around. This way you create fewer holes.
Summary
So that’s it, simple really!!
All the main parts of the .NET runtime do their bit and we get to use a handy feature that lets us drop-down and perform some bare-metal coding!!
Discuss this post in /r/programming
Further Reading
If you’ve read this far, you might find some of these links useful:
The post How does the 'fixed' keyword work? first appeared on my blog Performance is a Feature!
CodeProject
Wed, 26 Oct 2016, 12:00 am
Adding a verb to the dotnet CLI tooling
The dotnet
CLI tooling comes with several built-in cmds such as build
, run
and test
, but it turns out it’s possible to add your own verb to that list.
Arbitrary cmds
From Intro to .NET Core CLI - Design
The way the dotnet
driver finds the command it is instructed to run using dotnet {command}
is via a convention; any executable that is placed in the PATH and is named dotnet-{command}
will be available to the driver. For example, when you install the CLI toolchain there will be an executable called dotnet-build
in your PATH; when you run dotnet build
, the driver will run the dotnet-build
executable. All of the arguments following the command are passed to the command being invoked. So, in the invocation of dotnet build --native
, the --native
switch will be passed to dotnet-build
executable that will do some action based on it (in this case, produce a single native binary).
This is also the basics of the current extensibility model of the toolchain. Any executable found in the PATH named in this way, that is as dotnet-{command}
, will be invoked by the dotnet
driver.
Fun fact: This means that it’s actually possible to make a dotnet go
command! You just need to make a copy of go.exe
and rename it to dotnet-go.exe
Yay dotnet go
(I know, completely useless, but fun none-the-less)!!
(and yes before you ask, you can also make dotnet dotnet
work, but please don’t do that!!)
With regards to documentation, there’s further information in the ‘Adding a Command’ section of the Developer Guide. Also the source code of the dotnet test
command is a really useful reference and helped me out several times.
Before I go any further I just want to acknowledge the 2 blog posts listed below. They show you how to build a custom command that will compresses all the images in the current directory and how to make it available to the dotnet
tooling as a NuGet package:
However they don’t explain how to interact with the current project or access it’s output. This is what I wanted to do, so this post will pick up where those posts left off.
Any effective dotnet
verb needs to know about the project it is running in and helpfully those kind developers at Microsoft have created some useful classes that will parse and examine a project.json
file (available in the Microsoft.DotNet.ProjectModel NuGet package). It’s pretty simple to work with, just a few lines of code and you’re able to access the entire Project model:
Project project;
var currentDirectory = Directory.GetCurrentDirectory();
if (ProjectReader.TryGetProject(currentDirectory, out project))
{
if (project.Files.SourceFiles.Any())
{
Console.WriteLine("Files:");
foreach (var file in project.Files.SourceFiles)
Console.WriteLine(" {0}", file.Replace(currentDirectory, ""));
}
if (project.Dependencies.Any())
{
Console.WriteLine("Dependencies:");
foreach (var dependancy in project.Dependencies)
{
Console.WriteLine(" {0} - Line:{1}, Column:{2}",
dependancy.SourceFilePath.Replace(currentDirectory, ""),
dependancy.SourceLine,
dependancy.SourceColumn);
}
}
...
}
Building a Project
In addition to knowing about the current project, we need to ensure it successfully builds before we can do anything else with it. Fortunately this is also simple thanks to the Microsoft.DotNet.Cli.Utils NuGet package (along with further help from Microsoft.DotNet.ProjectModel
which provides the BuildWorkspace
):
// Create a workspace
var workspace = new BuildWorkspace(ProjectReaderSettings.ReadFromEnvironment());
// Fetch the ProjectContexts
var projectPath = project.ProjectFilePath;
var runtimeIdentifiers =
RuntimeEnvironmentRidExtensions.GetAllCandidateRuntimeIdentifiers();
var projectContexts = workspace.GetProjectContextCollection(projectPath)
.EnsureValid(projectPath)
.FrameworkOnlyContexts
.Select(c => workspace.GetRuntimeContext(c, runtimeIdentifiers))
.ToList();
// Setup the build arguments
var projectContextToBuild = projectContexts.First();
var cmdArgs = new List
{
projectPath,
"--configuration", "Release",
"--framework", projectContextToBuild.TargetFramework.ToString()
};
// Build!!
Console.WriteLine("Building Project for {0}", projectContextToBuild.RuntimeIdentifier);
var result = Command.CreateDotNet("build", cmdArgs).Execute();
Console.WriteLine("Build {0}", result.ExitCode == 0 ? "SUCCEEDED" : "FAILED");
When this runs you get the familiar dotnet build
output if it successfully builds or any error/diagnostic messages if not.
Integrating with BenchmarkDotNet
Now that we know the project has produced an .exe or .dll, we can finally wire-up BenchmarkDotNet and get it to execute the benchmarks for us:
try
{
Console.WriteLine("Running BenchmarkDotNet");
var benchmarkAssemblyPath =
projectContextToBuild.GetOutputPaths(config).RuntimeFiles.Assembly;
var benchmarkAssembly =
AssemblyLoadContext.Default.LoadFromAssemblyPath(benchmarkAssemblyPath);
Console.WriteLine("Successfully loaded: {0}\n", benchmarkAssembly);
var switcher = new BenchmarkSwitcher(benchmarkAssembly);
var summary = switcher.Run(args);
}
catch (Exception ex)
{
Console.WriteLine("Error running BenchmarkDotNet");
Console.WriteLine(ex);
}
Because BenchmarkDotNet is a command-line tool we don’t actually need to do much work. It’s just a case of creating a BenchmarkSwitcher
, giving it a reference to the dll that contains the benchmarks and then passing in the command line arguments. BenchmarkDotNet will then do the rest of the work for us!
However if you need to parse command line arguments yourself I’d recommend re-using the existing helper classes as they make life much easier and will ensure that your tool fits in with the dotnet
tooling ethos.
The final result
Finally, to test it out, we’ll use a simple test app from the BenchmarkDotNet Getting Started Guide, with the following in the project.json file (note the added tools
section):
{
"version": "1.0.0-*",
"buildOptions": {
"emitEntryPoint": true
},
"dependencies": {
"Microsoft.NETCore.App": {
"type": "platform",
"version": "1.0.0-rc2-3002702"
},
"BenchmarkDotNet": "0.9.9"
},
"frameworks": {
"netcoreapp1.0": {
"imports": "dnxcore50"
}
},
"tools": {
"BenchmarkCommand": "1.0.0"
}
}
Then after doing a dotnet restore
, we can finally run our new dotnet benchmark
command:
λ dotnet benchmark --class Md5VsSha256
Building Project - BenchmarkCommandTest
Project BenchmarkCommandTest (.NETCoreApp,Version=v1.0) will be compiled because expected outputs are missing
Compiling BenchmarkCommandTest for .NETCoreApp,Version=v1.0
Compilation succeeded.
0 Warning(s)
0 Error(s)
Time elapsed 00:00:00.9760886
Build SUCCEEDED
Running BenchmarkDotNet
C:\Projects\BenchmarkCommandTest\bin\Release\netcoreapp1.0\BenchmarkCommandTest.dll
Successfully loaded: BenchmarkCommandTest, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
Target type: Md5VsSha256
// ***** BenchmarkRunner: Start *****
// Found benchmarks:
// Md5VsSha256_Sha256
// Md5VsSha256_Md5
// Validating benchmarks:
// **************************
// Benchmark: Md5VsSha256_Sha256
// *** Generate ***
// Result = Success
// BinariesDirectoryPath = C:\Projects\BDN.Auto\binaries
// *** Build ***
// Result = Success
// *** Execute ***
// Launch: 1
// Benchmark Process Environment Information:
// CLR=CORE, Arch=64-bit ? [RyuJIT]
// GC=Concurrent Workstation
...
If you’ve used BenchmarkDotNet before you’ll recognise its output, if not it’s output is all the lines starting with //
. A final note, currently the Console colours from the command aren’t displayed, but that should be fixed sometime soon, which is great because BenchmarkDotNet looks way better in full-colour!!
Discuss this post in /r/csharp
The post Adding a verb to the dotnet CLI tooling first appeared on my blog Performance is a Feature!
CodeProject
Mon, 3 Oct 2016, 12:00 am
Optimising LINQ
What’s the problem with LINQ?
As outlined by Joe Duffy, LINQ introduces inefficiencies in the form of hidden allocations, from The ‘premature optimization is evil’ myth:
To take an example of a technology that I am quite supportive of, but that makes writing inefficient code very easy, let’s look at LINQ-to-Objects. Quick, how many inefficiencies are introduced by this code?
int[] Scale(int[] inputs, int lo, int hi, int c) {
var results = from x in inputs
where (x >= lo) && (x = this.lo) && (x
Thu, 29 Sep 2016, 12:00 am
Compact strings in the CLR
In the CLR strings are stored as a sequence of UTF-16 code units, i.e. an array of char
items. So if we have the string ‘testing’, in memory it looks like this:
But look at all those zero’s, wouldn’t it be more efficient if it could be stored like this instead?
Now this is a contrived example, clearly not all strings are simple ASCII
text that can be compacted this way. Also, even though I’m an English speaker, I’m well aware that there are other languages with character sets than can only be expressed in Unicode
. However it turns out that even in a fully internationalised modern web-application, there are still a large amount of strings that could be expressed as ASCII
, such as:
So there is still an overall memory saving if the CLR provided an implementation that stored some strings in a more compact encoding that only takes 1 byte per character (ASCII
or even ISO-8859-1 (Latin-1)
) and the rest as Unicode
(2 bytes per character).
Aside: If you are wondering “Why does C# use UTF-16 for strings?” Eric Lippert has a great post on this exact subject and Jon Skeet has something interesting to say about the subject in “Of Memory and Strings”
Real-world data
In theory this is all well and good, but what about in practice, what about a real-world example?
Well Nick Craver a developer at Stack Overflow was kind enough to run my Heap Analyser tool one of their memory dumps:
.NET Memory Dump Heap Analyser - created by Matt Warren - github.com/mattwarren
Found CLR Version: v4.6.1055.00
...
Overall 30,703,367 "System.String" objects take up 4,320,235,704 bytes (4,120.10 MB)
Of this underlying byte arrays (as Unicode) take up 3,521,948,162 bytes (3,358.79 MB)
Remaining data (object headers, other fields, etc) is 798,287,542 bytes (761.31 MB), at 26 bytes per object
Actual Encoding that the "System.String" could be stored as (with corresponding data size)
3,347,868,352 bytes are ASCII
5,078,902 bytes are ISO-8859-1 (Latin-1)
169,000,908 bytes are Unicode (UTF-16)
Total: 3,521,948,162 bytes (expected: 3,521,948,162)
Compression Summary:
1,676,473,627 bytes Compressed (to ISO-8859-1 (Latin-1))
169,000,908 bytes Uncompressed (as Unicode/UTF-16)
30,703,367 bytes EXTRA to enable compression (one byte field, per "System.String" object)
Total: 1,876,177,902 bytes, compared to 3,521,948,162 before compression
(The full output is available)
Here we can see that there are over 30 million strings in memory, taking up 4,120 MB out of a total heap size of 13,232 MB (just over 30%).
Further more we can see that the raw data used by the strings (excluding the CLR Object headers) takes up 3,358 MB when encoded as Unicode
. However if the relevant strings were compacted to ASCII
/Latin-1
only 1,789 MB would be needed to store them, a pretty impressive saving!
A proposal for compact strings in the CLR
I learnt about the idea of “Compact Strings” when reading about how they were implemented in Java and so I put together a proposal for an implementation in the CLR (isn’t .NET OSS Great!!).
Turns out that Vance Morrison (Performance Architect on the .NET Runtime Team) has been thinking about the same idea for quite a while:
To answer @mattwarren question on whether changing the internal representation of a string has been considered before, the short answer is YES. In fact it has been a pet desire of mine for probably over a decade now.
He also confirmed that they’ve done their homework and found that a significant amount of strings could be compacted:
What was clear now and has held true for quite sometime is that:
Typical apps have 20% of their GC heap as strings. Most of the 16 bit characters have 0 in their upper byte. Thus you can save 10% of typical heaps by encoding in various ways that eliminate these pointless upper bytes.
It’s worth reading his entire response if you are interested in the full details of the proposal, including the trade-offs, benefits and drawbacks.
Implementation details
At a high-level the proposal would allow to strings to be stored in 2 formats:
- Regular - i.e. Unicode encoded, as they are currently stored by the CLR
- Compact - ASCII, ISO-8859-1 (Latin-1) or even another format
When you create a string, the constructor would determine the most efficient encoding and encode the data in that format. The formant used would then be stored in a field, so that the encoding is always known (CLR strings are immutable). That means that each method within the string class can use this field to determine how it operates, for instance the pseudo-code for the Equals
method is shown below:
public boolean Equals(string other)
{
if (this.type != other.type)
return false;
if (type == ASCII)
return StringASCII.Equals(this, other);
else
return StringLatinUTF16.Equals(this, other);
}
This shows a nice property of having strings in two formats; some operations can be short-circuited, because we know that strings stored in different encodings won’t be the same.
Advantages
Disadvantages
- Makes some operations slower due to the extra
if (type == ...)
check needed
- Breaks the
fixed
keyword, as well as COM and P/Invoke interop that relies on the current string layout/format
- If very few strings in the application can be compacted, this will have an overhead for no gain
Next steps
In his reply Vance Morrison highlighted that solving the issue with the fixed
keyword was a first step, because that has a hard dependency on the current string layout. Once that’s done the real work of making large, sweeping changes to the CLR can be done:
The main challenge is dealing with fixed, but there is also frankly at least a few man-months of simply dealing with the places in the runtime where we took a dependency on the layout of string (in the runtime, interop, and things like stringbuilder, and all the uses of ‘fixed’ in corefx).
Thus it IS doable, but it is at least moderately expensive (man months), and the payoff is non-trivial but not huge.
So stay tuned, one day we might have a more compact, more efficient implementation of strings in the CLR, yay!!
Further Reading
Discuss this post on /r/programming
Mon, 19 Sep 2016, 12:00 am
Subverting .NET Type Safety with 'System.Runtime.CompilerServices.Unsafe'
In which we use System.Runtime.CompilerServices.Unsafe
a generic API (“type-safe” but still “unsafe”) and mess with the C# Type System!
The post covers the following topics:
What it is and why it’s useful
The XML documentation comments for System.Runtime.CompilerServices.Unsafe
state that it:
Contains generic, low-level functionality for manipulating pointers.
But we can get a better understanding of what it is by looking at the actual API definition from the current NuGet package (4.0.0):
// Contains generic, low-level functionality for manipulating pointers.
public static class Unsafe
{
// Casts the given object to the specified type.
public static T As(object o) where T : class
// Returns a pointer to the given by-ref parameter.
public static void* AsPointer(ref T value);
// Copies a value of type T to the given location.
public static void Copy(void* destination, ref T source);
// Copies a value of type T to the given location.
public static void Copy(ref T destination, void* source);
// Copies bytes from the source address to the destination address.
public static void CopyBlock(void* destination, void* source, uint byteCount);
// Initializes a block of memory at the given location with a given initial value.
public static void InitBlock(void* startAddress, byte value, uint byteCount);
// Reads a value of type T from the given location.
public static T Read(void* source);
// Returns the size of an object of the given type parameter.
public static int SizeOf();
// Writes a value of type T to the given location.
public static void Write(void* destination, T value);
}
Note: I edited the the XML doc-comments for brevity, the full versions are available in the source. There are also some additional methods that have been added to the API, but to make use of them you have to use a version of the C# compiler with support for ref returns and locals.
However this doesn’t really tell us why it’s useful, to get some background on that we can look at the GitHub issue “Provide a generic API to read from and write to a pointer”:
So at a high-level the goals of the System.Runtime.CompilerServices.Unsafe
library are to:
- Provide a safer way of writing low-level
unsafe
code
- Without this library you have to resort to
fixed
and pointer manipulation, which can be error prone
- Allow access to functionality that can’t be expressed in C#, but is possible in IL
- Save developers from having to repeatedly write the same
unsafe
code
It’s also worth pointing out that the library is primarily for use with a Value Type (int, float, etc) rather than a class
or Reference type. You can use it with classes, however you have to pin them first, so they don’t move about in memory whilst you are working with the pointer.
Update: It was pointed out to me that Niels wrote an initial implementation of this library in a separate project, before Microsoft made their own version.
How it works
Because the library allows access to functionality that can’t be expressed in C#, it has to be written in raw IL, which is then compiled by a custom build-step. As an example we will look at the AsPointer
method, which has the following signature:
public static void* AsPointer(ref T value)
The IL for this is shown below, note how the ref
keyword becomes &
in IL and
is expressed as !!T
:
.method public hidebysig static void* AsPointer(!!T& 'value') cil managed aggressiveinlining
{
.custom instance void System.Runtime.Versioning.NonVersionableAttribute::.ctor() = ( 01 00 00 00 )
.maxstack 1
ldarg.0
conv.u
ret
} // end of method Unsafe::AsPointer
Here we can see that it’s making use of the conv.u
IL instruction. For reference the explanation of this, along with some of the other op codes used by the library are shown below:
- Conv_U - Converts the value on top of the evaluation stack to unsigned native int, and extends it to native int.
- Ldobj - Copies the value type object pointed to by an address to the top of the evaluation stack.
- Stobj - Copies a value of a specified type from the evaluation stack into a supplied memory address.
After searching around I found several other places in the .NET Runtime that make use of raw IL in this way:
Code samples
There’s a nice set of unit tests that show the main use-cases for the library, for instance here is how to use Unsafe.Write(..)
to directly change the value of an int
via a pointer.
[Fact]
public static unsafe void WriteInt32()
{
int value = 10;
int* address = (int*)Unsafe.AsPointer(ref value);
int expected = 20;
Unsafe.Write(address, expected);
Assert.Equal(expected, value);
Assert.Equal(expected, *address);
Assert.Equal(expected, Unsafe.Read(address));
}
You can write something similar by manipulating pointers directly, but it’s not as straightforward (unless you are familiar with C or C++)
int value = 10;
int* ptr = &value;
*ptr = 30;
Console.WriteLine(value); // prints "30"
For a more real-world use case, the code below shows how you can access a KeyValuePair
directly as a byte []
(taken from a GitHub discussion):
var dt = new KeyValuePair[2];
ref byte asRefByte = ref Unsafe.As(ref dt[0]);
fixed (byte * ptr = &asRefByte)
{
// Treat the KeyValuePair as if it were a byte []
...
}
(this example is based on the StackOverflow question: “Get unsafe pointer to array of KeyValuePair in C#”)
Tricks you can do with it
Despite providing you with a nice strongly-typed API, you still have to mark your code as unsafe
, which it’s a bit of a give-away that you can use it to do things that normal C# can’t!
Breaking immutability
Strings in C# are immutable and the runtime goes to great lengths to ensure you can’t bypass this behaviour. However under-the-hood the String data is just bytes which can be manipulated, indeed the runtime does this manipulation itself inside the StringBuilder
class.
So using Unsafe.Write(..)
we can modify the contents of a String - yay!! However it needs to be pointed out that this code will potentially break the behaviour of the String class in many subtle ways, so don’t ever use it in a real application!!
var text = "ABCDEFGHIJKLMNOPQRSTUVWXKZ";
Console.WriteLine("String Length {0}", text.Length); // prints 26
Console.WriteLine("Text: \"{0}\"", text); // "ABCDEFGHIJKLMNOPQRSTUVWXKZ"
var pinnedText = GCHandle.Alloc(text, GCHandleType.Pinned);
char* textAddress = (char*)pinnedText.AddrOfPinnedObject().ToPointer();
// Make an immutable string think that it is shorter than it actually is!!!
Unsafe.Write(textAddress - 2, 5);
Console.WriteLine("String Length {0}", text.Length); // prints 5
Console.WriteLine("Text: \"{0}\"", text); // prints "ABCDE
// change the 2nd character 'B' to '@'
Unsafe.Write(textAddress + 1, '@');
Console.WriteLine("Text: \"{0}\"", text); // prints "A@CDE
pinnedText.Free();
Messing with the CLR type-system
But we can go even further than that and do a really nasty trick to completely defeat the CLR type-system. This code is horrible and could potentially break the CLR in several ways, so as before don’t ever use it in a real application!!
int intValue = 5;
float floatValue = 5.0f;
object boxedInt = (object)intValue, boxedFloat = (object)floatValue;
var pinnedFloat = GCHandle.Alloc(boxedFloat, GCHandleType.Pinned);
var pinnedInt = GCHandle.Alloc(boxedInt, GCHandleType.Pinned);
int* floatAddress = (int*)pinnedFloat.AddrOfPinnedObject().ToPointer();
int* intAddress = (int*)pinnedInt.AddrOfPinnedObject().ToPointer();
Console.WriteLine("Type: {0}, Value: {1}", boxedInt.GetType().FullName, boxedInt);
// Make an int think it's a float!!!
int floatType = Unsafe.Read(floatAddress - 1);
Unsafe.Write(intAddress - 1, floatType);
Console.WriteLine("Type: {0}, Value: {1}", boxedInt.GetType().FullName, boxedInt);
pinnedFloat.Free();
pinnedInt.Free();
Which prints out:
Type: System.Int32, Value: 5
Type: System.Single, Value: 7.006492E-45
Yep, we’ve managed to convince a int
(Int32) type that it’s actually a float
(Single) and behave like one instead!!
This works by overwriting the Method Table pointer for the int
, with the same value as the float
one. So when it looks up it’s type or prints out it’s value, it uses the float
methods instead! Thanks to @Porges for the example that motivated this, his code does the same thing using fixed
instead.
Using it safely
Despite the library requiring you to annotate your code with unsafe
, there are still some safe or maybe more accurately safer ways to use it!
Fortunately one of the main .NET runtime developers provided a nice list of what you can and can’t do:
But as with all unsafe
code, you’re asking the runtime to let you do things that you are normally prevented from doing, things that it normally saves you from, so you have to be careful!
Discuss this post in /r/csharp or /r/programming
The post Subverting .NET Type Safety with 'System.Runtime.CompilerServices.Unsafe' first appeared on my blog Performance is a Feature!
CodeProject
Wed, 14 Sep 2016, 12:00 am
Analysing .NET Memory Dumps with CLR MD
If you’ve ever spent time debugging .NET memory dumps in WinDBG you will be familiar with the commands shown below, which aren’t always the most straight-forward to work with!
However back in May 2013 Microsoft released the CLR MD library, describing it as:
… a set of advanced APIs for programmatically inspecting a crash dump of a .NET program much in the same way as the SOS Debugging Extensions (SOS). It allows you to write automated crash analysis for your applications and automate many common debugger tasks.
This post explores some of the things you can achieve by instead using CLR MD, a C# library which is now available as a NuGet Package. If you’re interested the full source code for all the examples is available.
Getting started with CLR MD
This post isn’t meant to serve as a Getting Started guide, there’s already a great set of Tutorials linked from project README that serve that purpose:
However we will be looking at what else CLR MD allows you to achieve.
I’ve previously written about the Garbage Collectors, so the first thing that we’ll do is see what GC related information we can obtain. The .NET GC creates 1 or more Heaps, depending on the number of CPU cores available and the mode it is running in (Server/Workstation). These heaps are in-turn made up of several Segments, for the different Generations (Gen0/Ephememral, Gen1, Gen2 and Large). Finally it’s worth pointing out that the GC initially Reserves the memory it wants, but only Commits it when it actually needs to. So using the code shown here, we can iterate through the different GC Heaps, printing out the information about their individual Segments as we go:
Analysing String usage
But knowing what’s inside those heaps is more useful, as David Fowler nicely summed up in a tweet, strings often significantly contribute to memory usage:
Now we could analyse the memory dump to produce a list of the most frequently occurring strings, as Nick Craver did with a memory dump from the App Pool of a Stack Overflow server (click for larger image):
However we’re going to look more closely at the actual contents of the string and in-particular analyse what the underlying encoding is, i.e. ASCII
, ISO-8859-1 (Latin-1)
or Unicode
.
By default the .NET string Encoder, instead of giving an error, replaces any characters it can’t convert with ‘�’ (which is known as the Unicode Replacement Character). So we will need to force it to throw an exception. This means we can detect the most compact encoding possible, by trying to convert to the raw string data to ASCII
, ISO-8859-1 (Latin-1)
and then Unicode
(sequence of UTF-16 code units) in turn. To see this in action, below is the code from the IsASCII(..)
function:
private static Encoding asciiEncoder = Encoding.GetEncoding(
Encoding.ASCII.EncodingName,
EncoderFallback.ExceptionFallback,
DecoderFallback.ExceptionFallback);
private static bool IsASCII(string text, out byte[] textAsBytes)
{
var unicodeBytes = Encoding.Unicode.GetBytes(text);
try
{
textAsBytes = Encoding.Convert(Encoding.Unicode, asciiEncoder, unicodeBytes);
return true;
}
catch (EncoderFallbackException /*efEx*/)
{
textAsBytes = null;
return false;
}
}
Next we run this on a memory dump of Visual Studio with the HeapStringAnalyser source code solution loaded and get the following output:
The most interesting part is reproduced below:
Overall 145,872 "System.String" objects take up 12,391,286 bytes (11.82 MB)
Of this underlying byte arrays (as Unicode) take up 10,349,078 bytes (9.87 MB)
Remaining data (object headers, other fields, etc) are 2,042,208 bytes (1.95 MB), at 14 bytes per object
Actual Encoding that the "System.String" could be stored as (with corresponding data size)
10,339,638 bytes ( 145,505 strings) as ASCII
3,370 bytes ( 65 strings) as ISO-8859-1 (Latin-1)
6,070 bytes ( 302 strings) as Unicode
Total: 10,349,078 bytes
So in this case we can see that out of the 145,872 string objects in memory, 145,505 of them could actually be stored as ASCII
, a further 65 as ISO-8859-1 (Latin-1)
and only 302 need the full Unicode
encoding.
Additional resources
Hopefully this post has demonstrated that CLR MD is a powerful tool, if you want to find out more please refer to the links below:
The post Analysing .NET Memory Dumps with CLR MD first appeared on my blog Performance is a Feature!
CodeProject
Tue, 6 Sep 2016, 12:00 am
Analysing Optimisations in the Wire Serialiser
Recently Roger Johansson wrote a post titled Wire – Writing one of the fastest .NET serializers, describing the optimisation that were implemented to make Wire as fast as possible. He also followed up that post with a set of benchmarks, showing how Wire compared to other .NET serialisers:
Using BenchmarkDotNet, this post will analyse the individual optimisations and show how much faster each change is. For reference, the full list of optimisations in the original blog post are:
- Looking up value serializers by type
- Looking up types when deserializing
- Byte buffers, allocations and GC
- Clever allocations
- Boxing, Unboxing and Virtual calls
- Fast creation of empty objects
Looking up value serializers by type
This optimisation changed code like this:
public ValueSerializer GetSerializerByType(Type type)
{
ValueSerializer serializer;
if (_serializers.TryGetValue(type, out serializer))
return serializer;
//more code to build custom type serializers.. ignore for now.
}
into this:
public ValueSerializer GetSerializerByType(Type type)
{
if (ReferenceEquals(type.GetTypeInfo().Assembly, ReflectionEx.CoreAssembly))
{
if (type == TypeEx.StringType) //we simply keep a reference to each primitive type
return StringSerializer.Instance;
if (type == TypeEx.Int32Type)
return Int32Serializer.Instance;
if (type == TypeEx.Int64Type)
return Int64Serializer.Instance;
...
}
So it has replaced a dictionary
lookup with an if
statement. In addition it is caching the Type
instance of known types, rather than calculating them every time. As you can see the optimisation pays off in some circumstance but not in others, so it’s not a clear win. It depends on where the type is in the list of if
statements. If it’s near the beginning (e.g. System.String
) it’ll be quicker than if it’s near the end (e.g. System.Byte[]
), which makes sense as all the other comparisons have to be done first.
Full benchmark code and results
Looking up types when deserializing
The 2nd optimisation works by removing all unnecessary memory allocations, it did this by:
- Using a custom
struct
(value type) rather than a class
- Pre-calculating a hash code once, rather than each time a comparison is needed.
- Doing string comparisons with raw
byte []
, rather than deserialising to a string
Full benchmark code and results
Note: these results nicely demonstrate how BenchmarkDotNet can show you memory allocations as well as the time taken.
Interestingly they hadn’t actually removed all memory allocations as the comparisons between OptimisedLookup
and OptimisedLookupCustomComparer
show. To fix this I sent a P.R which removes unnecessary boxing, by using a Custom Comparer rather than the default struct
comparer.
Byte buffers, allocations and GC
Again removing unnecessary memory allocations were key in this optimisation, most of which can be seen in the NoAllocBitConverter. Clearly serialisation spends a lot of time converting from the in-memory representation of an object to the serialised version, i.e. a byte []
. So several tricks were employed to ensure that temporary memory allocations were either removed completely or if that wasn’t possible, they were done by re-using a buffer from a pool rather than allocating a new one each time (see “Buffer recycling”)
Full benchmark code and results
Clever allocations
This optimisation is perhaps the most interesting, because it’s implemented by creating a custom data structure, tailored to the specific needs of Wire. So, rather than using the default .NET dictionary, they implemented FastTypeUShortDictionary. In essence this data structure optimises for having only 1 item, but falls back to a regular dictionary when it grows larger. To see this in action, here is the code from the TryGetValue(..) method:
public bool TryGetValue(Type key, out ushort value)
{
switch (_length)
{
case 0:
value = 0;
return false;
case 1:
if (key == _firstType)
{
value = _firstValue;
return true;
}
value = 0;
return false;
default:
return _all.TryGetValue(key, out value);
}
}
Like we’ve seen before, the performance gains aren’t clear-cut. For instance it depends on whether FastTypeUShortDictionary
contains the item you are looking for (Hit
v Miss
), but generally it is faster:
Full benchmark code and results
Boxing, Unboxing and Virtual calls
This optimisation is based on the widely used trick that I imagine almost all .NET serialisers employ. For a serialiser to be generic, is has to be able to handle any type of object that is passed to it. Therefore the first thing it does is use reflection to find the public fields/properties of that object, so that it knows the data is has to serialise. Doing reflection like this time-and-time again gets expensive, so the way to get round it is to do reflection once and then use dynamic code generation to compile a delegate
than you can then call again and again.
If you are interested in how to implement this, see the Wire compiler source or this Stack Overflow question. As shown in the results below, compiling code dynamically is much faster than reflection and only a little bit slower than if you read/write the property directly in C# code:
Full benchmark code and results
Fast creation of empty objects
The final optimisation trick used is also based on dynamic code creation, but this time it is purely dealing with creating empty objects. Again this is something that a serialiser does many time, so any optimisations or savings are worth it.
Basically the benchmark is comparing code like this:
FormatterServices.GetUninitializedObject(type);
with dynamically generated code, based on Expression trees:
var newExpression = ExpressionEx.GetNewExpression(typeToUse);
Func optimisation = Expression.Lambda(newExpression).Compile();
However this trick only works if the constructor
of the type being created is empty, otherwise it has to fall back to the slow version. But as shown in the results below, we can see that the optimisation is a clear win and worth implementing:
Full benchmark code and results
Summary
So it’s obvious that Roger Johansson and Szymon Kulec (who also contributed performance improvements) know their optimisations and as a result they have steadily made the Wire serialiser faster, which makes is an interesting project to learn from.
The post Analysing Optimisations in the Wire Serialiser first appeared on my blog Performance is a Feature!
CodeProject
Tue, 23 Aug 2016, 12:00 am
Preventing .NET Garbage Collections with the TryStartNoGCRegion API
Pauses are a known problem in runtimes that have a Garbage Collector (GC), such as Java or .NET. GC Pauses can last several milliseconds, during which your application is blocked or suspended. One way you can alleviate the pauses is to modify your code so that it doesn’t allocate, i.e. so the GC has nothing to do. But this can require lots of work and you really have to understand the runtime as many allocation are hidden.
Another technique is to temporarily suspend the GC, during a critical region of your code where you don’t want any pauses and then start it up again afterwards. This is exactly what the TryStartNoGCRegion
API (added in .NET 4.6) allows you to do.
From the MSDN docs:
Attempts to disallow garbage collection during the execution of a critical path if a specified amount of memory is available.
TryStartNoGCRegion in Action
To see how the API works, I ran some simple tests using the .NET GC Workstation mode, on a 32-bit CPU. The test simply call TryStartNoGCRegion
and then verify how much memory can be allocated before a Collection happens. The code is available if you want to try it out for yourself.
Test 1: Regular allocation, TryStartNoGCRegion
not called
You can see that a garbage collection happens after the 2nd allocation (indicated by “**”):
Prevent GC: False, Over Allocate: False
Allocated: 3.00 MB, Mode: Interactive, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated: 6.00 MB, Mode: Interactive, Gen0: 1, Gen1: 1, Gen2: 1, **
Allocated: 9.00 MB, Mode: Interactive, Gen0: 1, Gen1: 1, Gen2: 1,
Allocated: 12.00 MB, Mode: Interactive, Gen0: 1, Gen1: 1, Gen2: 1,
Allocated: 15.00 MB, Mode: Interactive, Gen0: 1, Gen1: 1, Gen2: 1,
Test 2: TryStartNoGCRegion(..)
with size set to 15MB
Here we see that despite allocating the same amount as in the first test, no garbage collections are triggered during the run.
Prevent GC: True, Over Allocate: False
TryStartNoGCRegion: Size=15 MB (15,360 K or 15,728,640 bytes) SUCCEEDED
Allocated: 3.00 MB, Mode: NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated: 6.00 MB, Mode: NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated: 9.00 MB, Mode: NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated: 12.00 MB, Mode: NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated: 15.00 MB, Mode: NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Test 3: TryStartNoGCRegion(..)
size of 15MB, but allocating more than 15MB
Finally we see that once we’ve allocated more that the size
we asked for, the mode switches from NoGCRegion
to Interactive
and garbage collections can now happen.
Prevent GC: True, Over Allocate: True
TryStartNoGCRegion: Size=15 MB (15,360 K or 15,728,640 bytes) SUCCEEDED
Allocated: 3.00 MB, Mode: NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated: 6.00 MB, Mode: NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated: 9.00 MB, Mode: NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated: 12.00 MB, Mode: NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated: 15.00 MB, Mode: NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated: 18.00 MB, Mode: NoGCRegion, Gen0: 0, Gen1: 0, Gen2: 0,
Allocated: 21.00 MB, Mode: Interactive, Gen0: 1, Gen1: 1, Gen2: 1, **
Allocated: 24.00 MB, Mode: Interactive, Gen0: 1, Gen1: 1, Gen2: 1,
Allocated: 27.00 MB, Mode: Interactive, Gen0: 2, Gen1: 2, Gen2: 2, **
Allocated: 30.00 MB, Mode: Interactive, Gen0: 2, Gen1: 2, Gen2: 2,
So this shows that at least in the simple test we’ve done, the API works as advertised. As long as you don’t subsequently allocate more memory than you asked for, no Garbage Collections will take place.
Object Size
However there are a few caveats when using TryStartNoGCRegion
, the first of which is that you are required to know up-front, the total size in bytes of the objects you will be allocating. As we’ve seen previously if you allocate more than totalSize
bytes, the No GC Region will no longer be active and it will then be possible for garbage collections to happen.
It’s not straight forward to get the size of an object in .NET, it’s a managed-runtime and it tries it’s best to hide that sort of detail from you. To further complicate matters is varies depending on the CPU architecture and even the version of the runtime.
But you do have a few options:
- Guess?!
- Search on Stack Overflow
- Start-up WinDbg and use the
!objsize
command on a memory dump of your process
- Get a estimate using the technique that Jon Skeet proposes
- Use DotNetEx, which relies on inspecting the internal fields of the CLR object
Personally I would go with a variation of 3), use WinDbg, but automate it using the excellent CLRMD C# library.
Segment Size
Update: It turns out that I completely missed the section on segment sizes on the MSDN page, thanks to Maoni for pointing this out to me. In the section on “Generations” there is the following chart (which fortunately correlates with my findings below):
However even when you know how many bytes will be allocated within the No GC Region, you still need to ensure that it’s less that the maximum amount allowed, because if you specify a value too large an ArgumentOutOfRangeException
exception is thrown. From the MSDN docs (emphasis mine):
The amount of memory in bytes to allocate without triggering a garbage collection. It must be less than or equal to the size of an ephemeral segment. For information on the size of an ephemeral segment, see the “Ephemeral generations and segments” section in the Fundamentals of Garbage Collection article.
However if you visit the linked article on GC Fundamentals, it has no exact figure for the size of an ephemeral segment, it does however have this stark warning:
Important
The size of segments allocated by the garbage collector is implementation-specific and is subject to change at any time, including in periodic updates. Your app should never make assumptions about or depend on a particular segment size, nor should it attempt to configure the amount of memory available for segment allocations.
Excellent, that’s very helpful!?
So let me get this straight, to prevent TryStartNoGCRegion
from throwing an exception, we have to pass in a totalSize
value that isn’t larger than the size of an ephemeral segment, but we’re not allowed to know the actual value of an ephemeral segment, in-case we assume too much!!
So where does that leave us?
Well fortunately it’s possible to figure out the size of an ephemeral or Small Object Heap (SOH) segment using either VMMap, or the previously mentioned CLRMD library (code sample available).
Here are the results I got with the .NET Framework 4.6.1, running on a 4 Core (HT) - Intel® Core™ i7-4800MQ, i.e. Environment.ProcessorCount = 8. If you click on the links for each row heading, you can see the full breakdown as reported by VMMap.
GC Mode
CPU Arch
SOH Segment
LOH Segment
Initial GC Size
Largest
No GC Region totalSize
value
Workstation
32-bit
16 MB
16 MB
32 MB
16 MB
Workstation
64-bit
256 MB
128 MB
384 MB
244 MB
Server
32-bit
32 MB
16 MB
384 MB
256 MB
Server
64-bit
2,048 MB
256 MB
18,423 MB
16,384 MB
The final column is the largest totalSize
value that can be passed into TryStartNoGCRegion(long totalSize)
, this was found by experimentation/trial-and-error.
Note: The main difference between Server and Workstation is that in Workstation mode there is only one heap, whereas in Server mode there is one heap per logical CPU.
TryStartNoGCRegion under-the-hood
What’s nice is that the entire feature is in a single Github commit, so it’s easy to see what code changes were made:
Around half of the files modified (listed below) are the changes needed to set-up the plumbing and error handling involved in adding a API to the System.GC class, they also give an interesting overview of what’s involved in having the external C#
code talk to the internal C++
code in the CLR (click on a link to go directly to the diff):
The rest of the changes are where the actual work takes place, with all the significant heavy-lifting happening in gc.cpp
:
TryStartNoGCRegion Implementation
When you call TryStartNoGCRegion
the following things happen:
- The maximum required heap sizes are calculated based on the
totalSize
parameter passed in. These calculations take place in gc_heap::prepare_for_no_gc_region
- If the current heaps aren’t large enough to accommodate the new value, they are re-sized. To achieve this a full collection is triggered (see GCHeap::StartNoGCRegion)
Note: Due to the way the GC uses segments, it won’t always allocate memory. It will however ensure that it reserves the maximum amount of memory required, so that it can be committed when actually needed.
Then next time the GC wants to perform a collection it checks:
- Is the current mode set to No GC Region
- Can we stay in the No GC Region mode
- This is done by calling gc_heap::should_proceed_for_no_gc(), which performs a sanity-check to ensure that we haven’t allocated more than the # of bytes we asked for when
TryStartNoGCRegion
was set-up
If 1) and 2) are both true then a collection does not take place because the GC knows that it has already reserved enough memory to fulfil future allocations, so it doesn’t need to clean-up up any existing garbage to make space.
Further Reading:
The post Preventing .NET Garbage Collections with the TryStartNoGCRegion API first appeared on my blog Performance is a Feature!
CodeProject
Tue, 16 Aug 2016, 12:00 am
GC Pauses and Safe Points
GC pauses are a popular topic, if you do a google search, you’ll see lots of articles explaining how to measure and more importantly how to reduce them. This issue is that in most runtimes that have a GC, allocating objects is a quick operation, but at some point in time the GC will need to clean up all the garbage and to do this is has to pause the entire runtime (except if you happen to be using Azul’s pauseless GC for Java).
The GC needs to pause the entire runtime so that it can move around objects as part of it’s compaction phase. If these objects were being referenced by code that was simultaneously executing then all sorts of bad things would happen. So the GC can only make these changes when it knows that no other code is running, hence the need to pause the entire runtime.
GC Flow
In a previous post I demonstrated how you can use ETW Events to visualise what the .NET Garbage Collector (GC) is doing. That post included the following GC flow for a Foreground/Blocking Collection (info taken from the excellent blog post by Maoni Stephens the main developer on the .NET GC):
GCSuspendEE_V1
GCSuspendEEEnd_V1
Mon, 8 Aug 2016, 12:00 am
How the dotnet CLI tooling runs your code
Just over a week ago the official 1.0 release of .NET Core was announced, the release includes:
the .NET Core runtime, libraries and tools and the ASP.NET Core libraries.
However alongside a completely new, revamped, xplat version of the .NET runtime, the development experience has been changed, with the dotnet
based tooling now available (Note: the tooling itself is currently still in preview and it’s expected to be RTM later this year)
So you can now write:
dotnet new
dotnet restore
dotnet run
and at the end you’ll get the following output:
Hello World!
It’s the dotnet
CLI (Command Line Interface) tooling that is the focus of this post and more specifically how it actually runs your code, although if you want a tl;dr version see this tweet from @citizenmatt:
Traditional way of running .NET executables
As a brief reminder, .NET executables can’t be run directly (they’re just IL, not machine code), therefore the Windows OS has always needed to do a few tricks to execute them, from CLR via C#:
After Windows has examined the EXE file’s header to determine whether to create a 32-bit process, a 64-bit process, or a WoW64 process, Windows loads the x86, x64, or IA64 version of MSCorEE.dll into the process’s address space.
…
Then, the process’ primary thread calls a method defined inside MSCorEE.dll. This method initializes the CLR, loads the EXE assembly, and then calls its entry point method (Main). At this point, the managed application is up and running.
New way of running .NET executables
dotnet run
So how do things work now that we have the new CoreCLR and the CLI tooling? Firstly to understand what is going on under-the-hood, we need to set a few environment variables (COREHOST_TRACE
and DOTNET_CLI_CAPTURE_TIMING
) so that we get a more verbose output:
Here, amongst all the pretty ASCII-art, we can see that dotnet run
actually executes the following cmd:
dotnet exec --additionalprobingpath C:\Users\matt\.nuget\packages c:\dotnet\bin\Debug\netcoreapp1.0\myapp.dll
Note: this is what happens when running a Console Application. The CLI tooling supports other scenarios, such as self-hosted web sites, which work differently.
dotnet exec
and corehost
Up-to this point everything was happening within managed code, however once dotnet exec
is called we jump over to unmanaged code within the corehost application. In addition several other .dlls are loaded, the last of which is the CoreCLR runtime itself (click to go to the main source file for each module):
The main task that the corehost
application performs is to calculate and locate all the dlls needed to run the application, along with their dependencies. The full output is available, but in summary it processes:
There are so many individual files because the CoreCLR operates on a “pay-for-play” model, from Motivation Behind .NET Core:
By factoring the CoreFX libraries and allowing individual applications to pull in only those parts of CoreFX they require (a so-called “pay-for-play” model), server-based applications built with ASP.NET 5 can minimize their dependencies.
Finally, once all the housekeeping is done control is handed off to corehost
, but not before the following properties are set to control the execution of the CoreCLR itself:
- TRUSTED_PLATFORM_ASSEMBLIES =
- Paths to 235 .dlls (99 managed, 136 native), from
C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.0.0-rc2-3002702
- APP_PATHS =
c:\dotnet\bin\Debug\netcoreapp1.0
- APP_NI_PATHS =
c:\dotnet\bin\Debug\netcoreapp1.0
- NATIVE_DLL_SEARCH_DIRECTORIES =
C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.0.0-rc2-3002702
c:\dotnet\bin\Debug\netcoreapp1.0
- PLATFORM_RESOURCE_ROOTS =
c:\dotnet\bin\Debug\netcoreapp1.0
C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.0.0-rc2-3002702
- AppDomainCompatSwitch =
UseLatestBehaviorWhenTFMNotSpecified
- APP_CONTEXT_BASE_DIRECTORY =
c:\dotnet\bin\Debug\netcoreapp1.0
- APP_CONTEXT_DEPS_FILES =
c:\dotnet\bin\Debug\netcoreapp1.0\dotnet.deps.json
C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.0.0-rc2-3002702\Microsoft.NETCore.App.deps.json
- FX_DEPS_FILE =
C:\Program Files\dotnet\shared\Microsoft.NETCore.App\1.0.0-rc2-3002702\Microsoft.NETCore.App.deps.json
Note: You can also run your app by invoking corehost.exe
directly with the following command:
corehost.exe C:\dotnet\bin\Debug\netcoreapp1.0\myapp.dll
Executing a .NET Assembly
At last we get to the point at which the .NET dll/assembly is loaded and executed, via the code shown below, taken from unixinterface.cpp:
hr = host->SetStartupFlags(startupFlags);
IfFailRet(hr);
hr = host->Start();
IfFailRet(hr);
hr = host->CreateAppDomainWithManager(
appDomainFriendlyNameW,
// Flags:
// APPDOMAIN_ENABLE_PLATFORM_SPECIFIC_APPS
// - By default CoreCLR only allows platform neutral assembly to be run. To allow
// assemblies marked as platform specific, include this flag
//
// APPDOMAIN_ENABLE_PINVOKE_AND_CLASSIC_COMINTEROP
// - Allows sandboxed applications to make P/Invoke calls and use COM interop
//
// APPDOMAIN_SECURITY_SANDBOXED
// - Enables sandboxing. If not set, the app is considered full trust
//
// APPDOMAIN_IGNORE_UNHANDLED_EXCEPTION
// - Prevents the application from being torn down if a managed exception is unhandled
//
APPDOMAIN_ENABLE_PLATFORM_SPECIFIC_APPS |
APPDOMAIN_ENABLE_PINVOKE_AND_CLASSIC_COMINTEROP |
APPDOMAIN_DISABLE_TRANSPARENCY_ENFORCEMENT,
NULL, // Name of the assembly that contains the AppDomainManager implementation
NULL, // The AppDomainManager implementation type name
propertyCount,
propertyKeysW,
propertyValuesW,
(DWORD *)domainId);
This is making use of the ICLRRuntimeHost Interface, which is part of the COM based hosting API for the CLR. Despite the file name, it is actually from the Windows version of the CLI tooling. In the xplat world of the CoreCLR the hosting API that was originally written for Unix has been replicated across all the platforms so that a common interface is available for any tools that want to use it, see the following GitHub issues for more information:
And that’s it, your .NET code is now running, simple really!!
The post How the dotnet CLI tooling runs your code first appeared on my blog Performance is a Feature!
CodeProject
Mon, 4 Jul 2016, 12:00 am
Visualising the .NET Garbage Collector
As part of an ongoing attempt to learn more about how a real-life Garbage Collector (GC) works (see part 1) and after being inspired by Julia Evans’ excellent post gzip + poetry = awesome I spent a some time writing a tool to enable a live visualisation of the .NET GC in action.
The output from the tool is shown below, click to Play/Stop (direct link to gif). The full source is available if you want to take a look.
Capturing GC Events in .NET
Fortunately there is a straight-forward way to capture the raw GC related events, using the excellent TraceEvent library that provides a wrapper over the underlying ETW Events the .NET GC outputs.
It’s a simple as writing code like this :
session.Source.Clr.GCAllocationTick += allocationData =>
{
if (ProcessIdsUsedInRuns.Contains(allocationData.ProcessID) == false)
return;
totalBytesAllocated += allocationData.AllocationAmount;
Console.Write(".");
};
Here we are wiring up a callback each time a GCAllocationTick
event is fired, other events that are available include GCStart
, GCEnd
, GCSuspendEEStart
, GCRestartEEStart
and many more.
As well outputting a visualisation of the raw events, they are also aggregated so that a summary can be produced:
Memory Allocations:
1,065,720 bytes currently allocated
1,180,308,804 bytes have been allocated in total
GC Collections:
16 in total (12 excluding B/G)
2 - generation 0
9 - generation 1
1 - generation 2
4 - generation 2 (B/G)
Time in GC: 1,300.1 ms (108.34 ms avg)
Time under test: 3,853 ms (33.74 % spent in GC)
Total GC Pause time: 665.9 ms
Largest GC Pause: 75.99 ms
GC Pauses
Most of the visualisation and summary information is relatively easy to calculate, however the timings for the GC pauses are not always straight-forward. Since .NET 4.5 the Server GC has 2 main modes available the new Background GC mode and the existing Foreground/Non-Concurrent one. The .NET Workstation GC has had a Background GC mode since .NET 4.0 and a Concurrent mode before that.
The main benefit of the Background mode is that it reduces GC pauses, or more specifically it reduces the time that the GC has to suspend all the user threads running inside the CLR. The problem with these “stop-the-world” pauses, as they are also known, is that during this time your application can’t continue with whatever it was doing and if the pauses last long enough users will notice.
As you can see in the image below (courtesy of the .NET Blog) , with the newer Background mode in .NET 4.5 the time during which user-threads are suspended is much smaller (the dark blue arrows). They only need to be suspended for part of the GC process, not the entire duration.
Foreground (Blocking) GC flow
So calculating the pauses for a Foreground GC (this means all Gen 0/1 GCs and full blocking GCs) is relatively straightforward, using the info from the excellent blog post by Maoni Stephens the main developer on the .NET GC:
GCSuspendEE_V1
GCSuspendEEEnd_V1
Mon, 20 Jun 2016, 12:00 am
Strings and the CLR - a Special Relationship
Strings and the Common Language Runtime (CLR) have a special relationship, but it’s a bit different (and way less political) than the UK US special relationship that is often talked about.
This relationship means that Strings can do things that aren’t possible in the C# code that you and I can write and they also get a helping hand from the runtime to achieve maximum performance, which makes sense when you consider how ubiquitous they are in .NET applications.
String layout in memory
Firstly strings differ from any other data type in the CLR (other than arrays) in that their size isn’t fixed. Normally the .NET GC knows the size of an object when it’s being allocated, because it’s based on the size of the fields/properties within the object and they don’t change. However in .NET a string object doesn’t contain a pointer to the actual string data, which is then stored elsewhere on the heap. That raw data, the actual bytes that make up the text are contained within the string object itself. That means that the memory representation of a string looks like this:
The benefit is that this gives excellent memory locality and ensures that when the CLR wants to access the raw string data it doesn’t have to do another pointer lookup. For more information, see the Stack Overflow questions “Where does .NET place the String value?” and Jon Skeet’s excellent post on strings.
Whereas if you were to implement your own string class, like so:
public class MyString
{
int Length;
byte [] Data;
}
If would look like this in memory:
In this case, the actual string data would be held in the byte []
, located elsewhere in memory and would therefore require a pointer reference and lookup to locate it.
This is summarised nicely in the excellent BOTR, in in the mscorlib section:
The managed mechanism for calling into native code must also support the special managed calling convention used by String’s constructors, where the constructor allocates the memory used by the object (instead of the typical convention where the constructor is called after the GC allocates memory).
Implemented in un-managed code
Despite the String class being a managed C# source file, large parts of it are implemented in un-managed code, that is in C++ or even Assembly. For instance there are 15 methods in String.cs that have no method body, are marked as extern
with [MethodImplAttribute(MethodImplOptions.InternalCall)]
applied to them. This indicates that their implementations are provided elsewhere by the runtime. Again from the mscorlib section of the BOTR (emphasis mine)
We have two techniques for calling into the CLR from managed code. FCall allows you to call directly into the CLR code, and provides a lot of flexibility in terms of manipulating objects, though it is easy to cause GC holes by not tracking object references correctly. QCall allows you to call into the CLR via the P/Invoke, and is much harder to accidentally mis-use than FCall. FCalls are identified in managed code as extern methods with the MethodImplOptions.InternalCall bit set. QCalls are static extern methods that look like regular P/Invokes, but to a library called “QCall”.
Types with a Managed/Unmanaged Duality
A consequence of Strings being implemented in unmanaged and managed code is that they have to be defined in both and those definitions must be kept in sync:
Certain managed types must have a representation available in both managed & native code. You could ask whether the canonical definition of a type is in managed code or native code within the CLR, but the answer doesn’t matter – the key thing is they must both be identical. This will allow the CLR’s native code to access fields within a managed object in a very fast, easy to use manner. There is a more complex way of using essentially the CLR’s equivalent of Reflection over MethodTables & FieldDescs to retrieve field values, but this probably doesn’t perform as well as you’d like, and it isn’t very usable. For commonly used types, it makes sense to declare a data structure in native code & attempt to keep the two in sync.
So in String.cs we can see:
//NOTE NOTE NOTE NOTE
//These fields map directly onto the fields in an EE StringObject.
//See object.h for the layout.
[NonSerialized]private int m_stringLength;
[NonSerialized]private char m_firstChar;
Which corresponds to the following in object.h
private:
DWORD m_StringLength;
WCHAR m_Characters[0];
Fast String Allocations
In a typical .NET program, one of the most common ways that you would allocate strings dynamically is either via StringBuilder
or String.Format
(which uses StringBuilder
under the hood).
So you may have some code like this:
var builder = new StringBuilder();
...
builder.Append(valueX);
...
builder.Append("Some text")
...
var text = builder.ToString();
or
var text = string.Format("{0}, {1}", valueX, valueY);
Then, when the StringBuilder
ToString()
method is called, it internally calls the FastAllocateString on the String class, which is declared like so:
[System.Security.SecurityCritical] // auto-generated
[MethodImplAttribute(MethodImplOptions.InternalCall)]
internal extern static String FastAllocateString(int length);
This method is marked as extern
and has the [MethodImplAttribute(MethodImplOptions.InternalCall)]
attribute applied and as we saw earlier this implies it will be implemented in un-managed code by the CLR. It turns out that eventually the call stack ends up in a hand-written assembly function, called AllocateStringFastMP_InlineGetThread from JitHelpers_InlineGetThread.asm
This also shows something else we talked about earlier. The assembly code is actually allocating the memory needed for the string, based on the required length that was passed in by the calling code.
LEAF_ENTRY AllocateStringFastMP_InlineGetThread, _TEXT
; We were passed the number of characters in ECX
; we need to load the method table for string from the global
mov r9, [g_pStringClass]
; Instead of doing elaborate overflow checks, we just limit the number of elements
; to (LARGE_OBJECT_SIZE - 256)/sizeof(WCHAR) or less.
; This will avoid avoid all overflow problems, as well as making sure
; big string objects are correctly allocated in the big object heap.
cmp ecx, (ASM_LARGE_OBJECT_SIZE - 256)/2
jae OversizedString
mov edx, [r9 + OFFSET__MethodTable__m_BaseSize]
; Calculate the final size to allocate.
; We need to calculate baseSize + cnt*2,
; then round that up by adding 7 and anding ~7.
lea edx, [edx + ecx*2 + 7]
and edx, -8
PATCHABLE_INLINE_GETTHREAD r11, AllocateStringFastMP_InlineGetThread__PatchTLSOffset
mov r10, [r11 + OFFSET__Thread__m_alloc_context__alloc_limit]
mov rax, [r11 + OFFSET__Thread__m_alloc_context__alloc_ptr]
add rdx, rax
cmp rdx, r10
ja AllocFailed
mov [r11 + OFFSET__Thread__m_alloc_context__alloc_ptr], rdx
mov [rax], r9
mov [rax + OFFSETOF__StringObject__m_StringLength], ecx
ifdef _DEBUG
call DEBUG_TrialAllocSetAppDomain_NoScratchArea
endif ; _DEBUG
ret
OversizedString:
AllocFailed:
jmp FramedAllocateString
LEAF_END AllocateStringFastMP_InlineGetThread, _TEXT
There is also a less optimised version called AllocateStringFastMP from JitHelpers_Slow.asm. The reason for the different versions is explained in jinterfacegen.cpp and then at run-time the decision is made as to which one to use, depending on the state of the Thread-local storage
// These are the fastest(?) versions of JIT helpers as they have the code to
// GetThread patched into them that does not make a call.
EXTERN_C Object* JIT_TrialAllocSFastMP_InlineGetThread(CORINFO_CLASS_HANDLE typeHnd_);
EXTERN_C Object* JIT_BoxFastMP_InlineGetThread (CORINFO_CLASS_HANDLE type, void* unboxedData);
EXTERN_C Object* AllocateStringFastMP_InlineGetThread (CLR_I4 cch);
EXTERN_C Object* JIT_NewArr1OBJ_MP_InlineGetThread (CORINFO_CLASS_HANDLE arrayTypeHnd_, INT_PTR size);
EXTERN_C Object* JIT_NewArr1VC_MP_InlineGetThread (CORINFO_CLASS_HANDLE arrayTypeHnd_, INT_PTR size);
// This next set is the fast version that invoke GetThread but is still faster
// than the VM implementation (i.e. the "slow" versions).
EXTERN_C Object* JIT_TrialAllocSFastMP(CORINFO_CLASS_HANDLE typeHnd_);
EXTERN_C Object* JIT_TrialAllocSFastSP(CORINFO_CLASS_HANDLE typeHnd_);
EXTERN_C Object* JIT_BoxFastMP (CORINFO_CLASS_HANDLE type, void* unboxedData);
EXTERN_C Object* JIT_BoxFastUP (CORINFO_CLASS_HANDLE type, void* unboxedData);
EXTERN_C Object* AllocateStringFastMP (CLR_I4 cch);
EXTERN_C Object* AllocateStringFastUP (CLR_I4 cch);
Optimised String Length
The final example of the “special relationship” is shown by how the string Length
property is optimised by the run-time. Finding the length of a string is a very common operation and because .NET strings are immutable should also be very quick, because the value can be calculated once and then cached.
As we can see in the comment from String.cs, the CLR ensures that this is true by implementing it in such a way that the JIT can optimise for it:
// Gets the length of this string
//
/// This is a EE implemented function so that the JIT can recognise is specially
/// and eliminate checks on character fetches in a loop like:
/// for(int i = 0; i < str.Length; i++) str[i]
/// The actually code generated for this will be one instruction and will be inlined.
//
// Spec#: Add postcondition in a contract assembly. Potential perf problem.
public extern int Length {
[System.Security.SecuritySafeCritical] // auto-generated
[MethodImplAttribute(MethodImplOptions.InternalCall)]
get;
}
This code is implemented in stringnative.cpp, which in turn calls GetStringLength
:
FCIMPL1(INT32, COMString::Length, StringObject* str) {
FCALL_CONTRACT;
FC_GC_POLL_NOT_NEEDED();
if (str == NULL)
FCThrow(kNullReferenceException);
FCUnique(0x11);
return str->GetStringLength();
}
FCIMPLEND
Which is a simple method call that the JIT can inline:
DWORD GetStringLength() { LIMITED_METHOD_DAC_CONTRACT; return( m_StringLength );}
Why have a special relationship?
In one word performance, strings are widely used in .NET programs and therefore need to be as optimised, space efficient and cache-friendly as possible. That’s why the CLR developers have gone to great lengths to make this happen, including implementing methods in assembly and ensuring that the JIT can optimise code as much as possible.
Interestingly enough one of the .NET developers recently made a comment about this on a GitHub issue, in response to a query asking why more string functions weren’t implemented in managed code they said:
We have looked into this in the past and moved everything that could be moved without significant perf loss. Moving more depends on having pretty good managed optimizations for all coreclr architectures.
This makes sense to consider only once RyuJIT or better codegen is available for all architectures that coreclr runs on (x86, x64, arm, arm64).
Discuss this post on Hacker News or /r/programming
The post Strings and the CLR - a Special Relationship first appeared on my blog Performance is a Feature!
CodeProject
Tue, 31 May 2016, 12:00 am
Adventures in Benchmarking - Performance Golf
Recently Nick Craver one of the developers at Stack Overflow has been tweeting snippets of code from their source, the other week the following code was posted:
A daily screenshot from the Stack Overflow codebase (checking strings for tokens without allocations). #StackCode pic.twitter.com/sDPqviHgD0
— Nick Craver (@Nick_Craver)
April 20, 2016
This code is an optimised version of what you would normally write, specifically written to ensure that is doesn’t allocate memory. Previously Stack Overflow have encountered issues with large pauses caused by the .NET GC, so it appears that where appropriate, they make a concerted effort to write code that doesn’t needlessly allocate.
I also have to give Nick credit for making me aware of the term “Performance Golf”, I’ve heard of Code Golf, but not the Performance version.
Aside: If you want to see the full discussion and the code for all the different entries, take a look at this gist. Also for a really in-depth explanation of what the fastest version is actually doing, I really recommend checking out Kevin Montrose’s blog post “An Optimisation Exercise”, there’s some very cool tricks in there, although by this point he is basically writing C/C++ code rather than anything you would recognise as C#!
In this post I’m not going to concentrate too much on this particular benchmark, but instead I’m going to use it as an example of what I believe a good benchmarking library should provide for you. Full disclaimer, I’m one of the authors of BenchmarkDotNet, so I admit I might be biased!
I think that a good benchmarking tool should offer the following features:
Benchmark Scaffolding
By using BenchmarkDotNet, or indeed any benchmarking tool, you can just get on with the business of actually writing the benchmark and not worry about any of the mechanics of accurately measuring the code. This is important because often when someone has posted an optimisation and accompanying benchmark on Stack Overflow, several of the comments then point out why their measurements are inaccurate or plain wrong.
In the case of BenchmarkDotNet, it’s as simple as adding a [Benchmark]
attribute to the methods that you want to benchmark and then a few lines of code to launch the run:
[Benchmark(Baseline = true)]
public bool StringSplit()
{
var tokens = Value.Split(delimeter);
foreach (var token in tokens)
{
if (token == Match)
return true;
}
return false;
}
static void Main(string[] args)
{
var summary = BenchmarkRunner.Run();
}
It also offers a few more tools for advanced scenarios, for instance you can decorate a field/property with the [Params]
attribute like so:
[Params("Foo;Bar",
"Foo;FooBar;Whatever",
"Bar;blaat;foo",
"blaat;foo;Bar",
"foo;Bar;Blaat",
"foo;FooBar;Blaat",
"Bar1;Bar2;Bar3;Bar4;Bar",
"Bar1;Bar2;Bar3;Bar4;NoMatch",
"Foo;FooBar;Whatever",
"Some;Other;Really;Interesting;Tokens")]
public string Value { get; set; }
and then each benchmark will be run multiples times, with Value
set to the different strings. This gives you a really easy way of trying out benchmarks across different inputs. For instance some methods were consistently fast, whereas other performed badly on inputs that were a worse-case scenario for them.
Diagnose what is going on
If you state that the aim of optimising you code is to “check a string for tokens, without allocations”, you would really like to be able to prove if that is true or not. I’ve previously written about how BenchmarkDotNet can give you this information and in this case we get the following results (click for full-size image):
So you can see that the ContainTokenFransBouma
benchmark isn’t allocation free, which in the scenario is a problem.
Consistent, Reliable and Clear Results
Another important aspect is that you should be able to rely on the results. Part of this is trusting the tool and hopefully people will come to trust BenchmarkDotNet over time.
Also you should be able to get clear results, so in as well as providing a text-based result table that you can easily paste into a GitHub issue or Stack Overflow answer, BenchmarkDotNet will provide several graphs using the R statistics and graphing library. Sometimes a wall of text isn’t the easiest thing to interpret, but colourful graphs can help (click for full image).
Here we can see that the original ContainsToken
code is “slower” in some scenarios (although it’s worth pointing out that the Y-axis is in nanoseconds).
Summary
Would I recommend writing code like any of these optimisations for normal day-to-day scenarios? No.
Without exception the optimised versions of the code are less readable, harder to debug and probably contain more errors. Certainly, by the time you get to the fastest version you are no longer writing recognisable C# code, it’s basically C++/C masquerading as C#.
However, for the purposes of learning, a bit of fun or just because you like a spot of competition, then it’s fine. Just make sure you use a decent tool that lets you get on with the fun part of writing the most optimised code possible!
The post Adventures in Benchmarking - Performance Golf first appeared on my blog Performance is a Feature!
CodeProject
Mon, 16 May 2016, 12:00 am
Coz: Finding Code that Counts with Causal Profiling - An Introduction
A while ago I came across an interesting and very readable paper titled “COZ Finding Code that Counts with Causal Profiling” that was presented at SOSP 2015 (and was recipient of a Best Paper Award). This post is my attempt to provide an introduction to Causal Profiling for anyone who doesn’t want to go through the entire paper.
What is “Causal Profiling”
Here’s the explanation from the paper itself:
Unlike past profiling approaches, causal profiling indicates exactly where programmers should focus their optimization efforts, and quantifies their potential impact. Causal profiling works by running performance experiments during program execution.
Each experiment calculates the impact of any potential optimization by virtually speeding up code: inserting pauses that slow down all other code running concurrently. The key insight is that this slowdown has the same relative effect as running that line faster, thus “virtually” speeding it up.
Or if you prefer, below is an image from the paper explaining what it does (click to enlarge)
The key part is that it tries to find the effect of speeding up a given block of code on the overall running time of the program. But being able to speed up arbitrary pieces of code is very hard and if the authors could do that, then then would be better off making lots of money selling code optimisation tools. So instead of speeding up a given piece of code, they artificially slow-down all the other code that is running at the same time, which has exactly the same relative effect.
In the diagram above Coz is trying to determine the effect that optimising the code in block f
would have on the overall runtime. Instead of making f
run quicker, as shown in part (b), they instead make g
run slower by inserting pauses, see part (c). Then Coz is able to infer that the speed-up seen in (c) will have the same relative effect if f
was to run faster, therefore the “Actual Speedup” as shown in (b) is possible.
Unfortunately Coz doesn’t tell you how to speed up your code, that’s left up to you, but it does tell you which parts of the code you should focus on to get the best overall improvements. Or another way of saying it is, Coz tells you:
If you speed up a given block of code by this much, the program will run this much faster
Existing profilers
In the paper, the authors argue that existing profilers only tell you about:
- Frequently executed code (# of calls)
- Code that runs for a long time (% of total time)
What they don’t help you with is finding important code in parallel programs and this is the problem that Coz solves. The (contrived) example they give is:
void a() { // ˜6.7 seconds
for(volatile size_t x=0; x
Wed, 30 Mar 2016, 12:00 am
Adventures in Benchmarking - Method Inlining
In a previous post I looked at how you can use BenchmarkDotNet to help diagnose why one benchmark is running slower than another. The post outlined how ETW Events are used to give you an accurate measurement of the # of Bytes allocated and the # of GC Collections per benchmark.
Inlining
In addition to memory allocation, BenchmarkDotNet can also give you information about which methods were inlined by the JITter. Inlining is the process by which code is copied from one function (the inlinee) directly into the body of another function (the inliner). The reason for this is to save the overhead of a method call and the associated work that needs to be done when control is passed from one method to another.
To see this in action we are going to run the following benchmark:
[Benchmark]
public int Calc()
{
return WithoutStarg(0x11) + WithStarg(0x12);
}
private static int WithoutStarg(int value)
{
return value;
}
private static int WithStarg(int value)
{
if (value
Wed, 9 Mar 2016, 12:00 am
Adventures in Benchmarking - Memory Allocations
For a while now I’ve been involved in the Open Source BenchmarkDotNet library along with Andrey Akinshin the project owner. Our goal has been to produce a .NET Benchmarking library that is:
- Accurate
- Easy-to-use
- Helpful
First and foremost we do everything we can to ensure that BenchmarkDotNet gives you accurate measurements, everything else is just “sprinkles on the sundae”. That is, without accurate measurements, a benchmarking library is pretty useless, especially one that displays results in nanoseconds.
But once point 1) has been dealt with, 2) it a bit more subjective. Using BenchmarkDotNet involves little more than adding a [Benchmark]
attribute to your method and then running it as per the Step-by-step guide in the GitHub README. I’ll let you decide if that is easy-to-use or not, but again it’s something we strive for. Once you’re done with the “Getting Started” guide, there is also a complete set of Tutorial Benchmarks available, as well as some more real-word examples for you to take a look at.
Being “Helpful”
But this post isn’t going to be a general BenchmarkDotNet tutorial, instead I’m going to focus on some of the specific tools that it gives you to diagnose what is going on in a benchmark, or to put it another way, to help you answer the question “Why is Benchmark A slower than Benchmark B?”
String Concat vs StringBuilder
Let’s start with a simple benchmark:
public class Framework_StringConcatVsStringBuilder
{
[Params(1, 2, 3, 4, 5, 10, 15, 20)]
public int Loops;
[Benchmark]
public string StringConcat()
{
string result = string.Empty;
for (int i = 0; i
{
if (statsPerProcess.ContainsKey(gcData.ProcessID))
{
var genCounts = statsPerProcess[gcData.ProcessID].GenCounts;
if (gcData.Depth >= 0 && gcData.Depth
Wed, 17 Feb 2016, 12:00 am
Technically Speaking - Anniversary Mentoring
I’ve been reading the excellent Technically Speaking newsletter for a while now and when they announced they would be running a mentoring program, I jumped at the chance and applied straight away. The idea was that each applicant had to set themselves speaking goals or identify areas they wanted to improve and then if you were selected @techspeakdigest would set you up with a mentor.
I was fortunate enough to be chosen and assigned to Cate one of the authors of the newsletter, who is also a prolific conference speaker. As part of scheme I had to identify the areas that I wanted to improve during the hour-long mentoring session, which for me were:
- Turning an outline into a good abstract.
- Tips for getting a talk accepted via a CFP submission
I’ve previously done some talks and they seemed to be well received, but I wanted to expand the range of topics I talked about and try and speak at some other conferences.
Writing a Good Abstract
At the start of the session Cate looked through an existing submission and offered some advice, which started with the initial comment of:
Good idea, not well pitched
She then went onto offer some really great tips about what conferences were looking for and how I could develop my abstract. I’ve put the rest of my notes below and left them as I wrote them down, so they are a bit jumbled, but they reflect what happened during the conversation!
Tips for an abstract (after reading mine):
- Be pragmatic, too much “one true way” can put people off. Maybe a bit too opinionated.
- Don’t tie your talk to just one library, might alienate people too much.
Talk outline/structure
- Explain - what does it mean to write faster code
- Situate - optimisation - what is it? how do you do it? benchmark, etc
- Apply - specific examples
Other suggestions
If listeners (or conference organisation committee) agree with your assumptions, they might be more likely to choose your pitch
-
Be careful about being too specific in the abstract
-
Don’t put too much in the abstract, leave some specifics out
be compelling, but a little big vague
Finally, as well as offering general advice, Cate also took the time to help me re-write an existing abstract I’d put together. I’ve included the “before” and “after” below, so you can see the difference. Whilst it’s hard to see someone pick apart what you’re written, I do agree that the “after” reads much better and sounds more compelling than the “before”!
Before
Microbenchmarks and Optimisations
We all want to write faster code right, but how do we know it really is faster, how do we measure it correctly?
During this talk we will look at what mistakes to avoid when benchmarking .NET code and how to do it accurately. Along the way we will also discover some surprising code optimisations and explore why they are happening
After
Where the Wild Things Are - Finding Performance Problems Before They Bite You
You don’t want to prematurely optimize, but sometimes you want to optimize, the question is - where to start? Benchmarking can help you figure out what your application is doing and where performance problems could arise - allowing you to find (and fix!) them before your customers do.
If you aren’t already benchmarking your code this talk will offer some starting points. We’ll look at how to accurately benchmark in .NET and things to avoid. Along the way we’ll also discover some surprising code optimisations!
The End Result
After the mentoring with Cate took place I was accepted to talk at ProgSCon London 2016, so obviously the tips and re-write of my abstract made a big difference!!
So thanks to Chiu-Ki Chan and Cate for producing Technically Speaking every week, it’s certainly helped me out!
Tue, 16 Feb 2016, 12:00 am
Learning How Garbage Collectors Work - Part 1
This series is an attempt to learn more about how a real-life “Garbage Collector” (GC) works internally, i.e. not so much “what it does”, but “how it does it” at a low-level. I will be mostly be concentrating on the .NET GC, because I’m a .NET developer and also because it’s recently been Open Sourced so we can actually look at the code.
Note: If you do want to learn about what a GC does, I really recommend the talk Everything you need to know about .NET memory by Ben Emmett, it’s a fantastic talk that uses lego to explain what the .NET GC does (the slides are also available)
Well, trying to understand what the .NET GC does by looking at the source was my original plan, but if you go and take a look at the code on GitHub you will be presented with the message “This file has been truncated,…”:
This is because the file is 36,915 lines long and 1.19MB in size! Now before you send a PR to Microsoft that chops it up into smaller bits, you might want to read this discussion on reorganizing gc.cpp. It turns out you are not the only one who’s had that idea and your PR will probably be rejected, for some specific reasons.
Goals of the GC
So, as I’m not going to be able to read and understand a 36 KLOC .cpp source file any time soon, instead I tried a different approach and started off by looking through the excellent Book-of-the-Runtime (BOTR) section on the “Design of the Collector”. This very helpfully lists the following goals of the .NET GC (emphasis mine):
The GC strives to manage memory extremely efficiently and require very little effort from people who write managed code. Efficient means:
- GCs should occur often enough to avoid the managed heap containing a significant amount (by ratio or absolute count) of unused but allocated objects (garbage), and therefore use memory unnecessarily.
- GCs should happen as infrequently as possible to avoid using otherwise useful CPU time, even though frequent GCs would result in lower memory usage.
- A GC should be productive. If GC reclaims a small amount of memory, then the GC (including the associated CPU cycles) was wasted.
- Each GC should be fast. Many workloads have low latency requirements.
- Managed code developers shouldn’t need to know much about the GC to achieve good memory utilization (relative to their workload). – The GC should tune itself to satisfy different memory usage patterns.
So there’s some interesting points in there, in particular they twice included the goal of ensuring developers don’t have to know much about the GC to make it efficient. This is probably one of the main differences between the .NET and Java GC implementations, as explained in an answer to the Stack Overflow question “.Net vs Java Garbage Collector”
A difference between Oracle’s and Microsoft’s GC implementation ‘ethos’ is one of configurability.
Oracle provides a vast number of options (at the command line) to tweak aspects of the GC or switch it between different modes. Many options are of the -X or -XX to indicate their lack of support across different versions or vendors. The CLR by contrast provides next to no configurability; your only real option is the use of the server or client collectors which optimise for throughput verses latency respectively.
.NET GC Sample
So now we have an idea about what the goals of the GC are, lets take a look at how it goes about things. Fortunately those nice developers at Microsoft released a GC Sample that shows you, at a basic level, how you can use the full .NET GC in your own code. After building the sample (and finding a few bugs in the process), I was able to get a simple, single-threaded Workstation GC up and running.
What’s interesting about the sample application is that it clearly shows you what actions the .NET Runtime has to perform to make the GC work. So for instance, at a high-level the runtime needs to go through the following process to allocate an object:
AllocateObject(..)
- See below for the code and explanation of the allocation process
CreateGlobalHandle(..)
- If we want to store the object in a “strong handle/reference”, as opposed to a “weak” one. In C# code this would typically be a static variable. This is what tells the GC that the object is referenced, so that is can know that it shouldn’t be cleaned up when a GC collection happens.
ErectWriteBarrier(..)
- For more information see “Marking the Card Table” below
Allocating an Object
AllocateObject(..)
code from GCSample.cpp
Object * AllocateObject(MethodTable * pMT)
{
alloc_context * acontext = GetThread()->GetAllocContext();
Object * pObject;
size_t size = pMT->GetBaseSize();
uint8_t* result = acontext->alloc_ptr;
uint8_t* advance = result + size;
if (advance alloc_limit)
{
acontext->alloc_ptr = advance;
pObject = (Object *)result;
}
else
{
pObject = GCHeap::GetGCHeap()->Alloc(acontext, size, 0);
if (pObject == NULL)
return NULL;
}
pObject->RawSetMethodTable(pMT);
return pObject;
}
To understand what’s going on here, the BOTR again comes in handy as it gives us a clear overview of the process, from “Design of Allocator”:
When the GC gives out memory to the allocator, it does so in terms of allocation contexts. The size of an allocation context is defined by the allocation quantum.
- Allocation contexts are smaller regions of a given heap segment that are each dedicated for use by a given thread. On a single-processor (meaning 1 logical processor) machine, a single context is used, which is the generation 0 allocation context.
- The Allocation quantum is the size of memory that the allocator allocates each time it needs more memory, in order to perform object allocations within an allocation context. The allocation is typically 8k and the average size of managed objects are around 35 bytes, enabling a single allocation quantum to be used for many object allocations.
This shows how is is possible for the .NET GC to make allocating an object (or memory) such a cheap operation. Because of all the work that it has done in the background, the majority of the time an object allocation happens, it is just a case of incrementing a pointer by the number of bytes needed to hold the new object. This is what the code in the first half of the AllocateObject(..)
function (above) is doing, it’s bumping up the free-space pointer (acontext->alloc_ptr
) and giving out a pointer to the newly created space in memory.
It’s only when the current allocation context doesn’t have enough space that things get more complicated and potentially more expensive. At this point GCHeap::GetGCHeap()->Alloc(..)
is called which may in turn trigger a GC collection before a new allocation context can be provided.
Finally, it’s worth looking at the goals that the allocator was designed to achieve, again from the BOTR:
- Triggering a GC when appropriate: The allocator triggers a GC when the allocation budget (a threshold set by the collector) is exceeded or when the allocator can no longer allocate on a given segment. The allocation budget and managed segments are discussed in more detail later.
- Preserving object locality: Objects allocated together on the same heap segment will be stored at virtual addresses close to each other.
- Efficient cache usage: The allocator allocates memory in allocation quantum units, not on an object-by-object basis. It zeroes out that much memory to warm up the CPU cache because there will be objects immediately allocated in that memory. The allocation quantum is usually 8k.
- Efficient locking: The thread affinity of allocation contexts and quantums guarantee that there is only ever a single thread writing to a given allocation quantum. As a result, there is no need to lock for object allocations, as long as the current allocation context is not exhausted.
- Memory integrity: The GC always zeroes out the memory for newly allocated objects to prevent object references pointing at random memory.
- Keeping the heap crawlable: The allocator makes sure to make a free object out of left over memory in each allocation quantum. For example, if there is 30 bytes left in an allocation quantum and the next object is 40 bytes, the allocator will make the 30 bytes a free object and get a new allocation quantum.
One of the interesting items this highlights is an advantage of GC systems, namely that you get efficient CPU cache usage or good object locality because memory is allocated in units. This means that objects created one after the other (on the same thread), will sit next to each other in memory.
Marking the “Card Table”
The 3rd part of the process of allocating an object was a call to ErectWriteBarrier(..)
, which looks like this:
inline void ErectWriteBarrier(Object ** dst, Object * ref)
{
// if the dst is outside of the heap (unboxed value classes) then we simply exit
if (((uint8_t*)dst < g_lowest_address) || ((uint8_t*)dst >= g_highest_address))
return;
if ((uint8_t*)ref >= g_ephemeral_low && (uint8_t*)ref < g_ephemeral_high)
{
// volatile is used here to prevent fetch of g_card_table from being reordered
// with g_lowest/highest_address check above.
uint8_t* pCardByte = (uint8_t *)*(volatile uint8_t **)(&g_card_table) +
card_byte((uint8_t *)dst);
if(*pCardByte != 0xFF)
*pCardByte = 0xFF;
}
}
Now explaining what is going on here is probably an entire post on it’s own and fortunately other people have already done the work for me, if you are interested in finding our more take a look at the links at the end of this post.
But in summary, the card-table is an optimisation that allows the GC to collect a single Generation (e.g. Gen 0), but still know about objects that are referenced from other, older generations. For instance if you had an array, myArray = new MyClass[100]
that was in Gen 1 and you wrote the following code myArray[5] = new MyClass()
, a write barrier would be set-up to indicate that the MyClass
object was referenced by a given section of Gen 1 memory.
Then, when the GC wants to perform the mark phase for a Gen 0, in order to find all the live-objects it uses the card-table to tell it in which memory section(s) of other Generations it needs to look. This way it can find references from those older objects to the ones stored in Gen 0. This is a space/time tradeoff, the card-table represents 4KB sections of memory, so it still has to scan through that 4KB chunk, but it’s better than having to scan the entire contents of the Gen 1 memory when it wants to carry of a Gen 0 collection.
If it didn’t do this extra check (via the card-table), then any Gen 0 objects that were only referenced by older objects (i.e. those in Gen 1/2) would not be considered “live” and would then be collected. See the image below for what this looks like in practice:
Image taken from Back To Basics: Generational Garbage Collection
GC and Execution Engine Interaction
The final part of the GC sample that I will be looking at is the way in which the GC needs to interact with the .NET Runtime Execution Engine (EE). The EE is responsible for actually running or coordinating all the low-level things that the .NET runtime needs to-do, such as creating threads, reserving memory and so it acts as an interface to the OS, via Windows and Unix implementations.
To understand this interaction between the GC and the EE, it’s helpful to look at all the functions the GC expects the EE to make available:
void SuspendEE(GCToEEInterface::SUSPEND_REASON reason)
void RestartEE(bool bFinishedGC)
void GcScanRoots(promote_func* fn, int condemned, int max_gen, ScanContext* sc)
void GcStartWork(int condemned, int max_gen)
void AfterGcScanRoots(int condemned, int max_gen, ScanContext* sc)
void GcBeforeBGCSweepWork()
void GcDone(int condemned)
bool RefCountedHandleCallbacks(Object * pObject)
bool IsPreemptiveGCDisabled(Thread * pThread)
void EnablePreemptiveGC(Thread * pThread)
void DisablePreemptiveGC(Thread * pThread)
void SetGCSpecial(Thread * pThread)
alloc_context * GetAllocContext(Thread * pThread)
bool CatchAtSafePoint(Thread * pThread)
void AttachCurrentThread()
void GcEnumAllocContexts (enum_alloc_context_func* fn, void* param)
void SyncBlockCacheWeakPtrScan(HANDLESCANPROC, uintptr_t, uintptr_t)
void SyncBlockCacheDemote(int /*max_gen*/)
void SyncBlockCachePromotionsGranted(int /*max_gen*/)
If you want to see how the .NET Runtime performs these “tasks”, you can take a look at the real implementation. However in the GC Sample these methods are mostly stubbed out as no-ops. So that I could get an idea of the flow of the GC during a collection, I added simple print(..)
statements to each one, then when I ran the GC Sample I got the following output:
SuspendEE(SUSPEND_REASON = 1)
GcEnumAllocContexts(..)
GcStartWork(condemned = 0, max_gen = 2)
GcScanRoots(condemned = 0, max_gen = 2)
AfterGcScanRoots(condemned = 0, max_gen = 2)
GcScanRoots(condemned = 0, max_gen = 2)
GcDone(condemned = 0)
RestartEE(bFinishedGC = TRUE)
Which fortunately corresponds nicely with the GC phases for WKS GC with concurrent GC off as outlined in the BOTR:
- User thread runs out of allocation budget and triggers a GC.
- GC calls SuspendEE to suspend managed threads.
- GC decides which generation to condemn.
- Mark phase runs.
- Plan phase runs and decides if a compacting GC should be done.
- If so relocate and compact phase runs. Otherwise, sweep phase runs.
- GC calls RestartEE to resume managed threads.
- User thread resumes running.
If you want to find out any more information about Garbage Collectors, here is a list of useful links:
- General
- Marking the Card Table
GC Sample Code Layout (for reference)
GC Sample Code (under \sample)
- GCSample.cpp
- gcenv.h
- gcenv.ee.cpp
- gcenv.windows.cpp
- gcenv.unix.cpp
GC Sample Environment (under \env)
- common.cpp
- common.h
- etmdummy.g
- gcenv.base.h
- gcenv.ee.h
- gcenv.interlocked.h
- gcenv.interlocked.inl
- gcenv.object.h
- gcenv.os.h
- gcenv.structs.h
- gcenv.sync.h
GC Code (top-level folder)
- gc.cpp (36,911 lines long!!)
- gc.h
- gccommon.cpp
- gcdesc.h
- gcee.cpp
- gceewks.cpp
- gcimpl.h
- gcrecord.h
- gcscan.cpp
- gcscan.h
- gcsvr.cpp
- gcwks.cpp
- handletable.h
- handletable.inl
- handletablecache.cpp
- gandletablecore.cpp
- handletablepriv.h
- handletablescan.cpp
- objecthandle.cpp
- objecthandle.h
The post Learning How Garbage Collectors Work - Part 1 first appeared on my blog Performance is a Feature!
CodeProject
Thu, 4 Feb 2016, 12:00 am
Open Source .NET – 1 year later - Now with ASP.NET
In the previous post I looked at the community involvement in the year since Microsoft open-sourced large parts of the .NET framework.
As a follow-up I’m going to repeat that analysis, but this time focussing on the repositories that sit under the ASP.NET umbrella project:
- MVC - Model view controller framework for building dynamic web sites with clean separation of concerns, including the merged MVC, Web API, and Web Pages w/ Razor.
- DNX - The DNX (a .NET Execution Environment) contains the code required to bootstrap and run an application, including the compilation system, SDK tools, and the native CLR hosts.
- EntityFramework - Microsoft’s recommended data access technology for new applications in .NET.
- KestrelHttpServer - A web server for ASP.NET 5 based on libuv.
Methodology
In the first part I classified the Issues/PRs as Owner, Collaborator or Community. However this turned out to have some problems, as was pointed out to me in the comments. There are several people who are non Microsoft employees, but have been made “Collaborators” due to their extensive contributions to a particular repository, for instance @kangaroo and @benpye.
To address this, I decided to change to just the following 2 categories:
This is possible because (almost) all Microsoft employees have indicated where they work on their GitHub profile, for instance:
There are some notable exceptions, e.g. @shanselman clearly works at Microsoft, but it’s easy enough to allow for cases like this.
Results
So after all this analysis, what results did I get. Well overall, the Community involvement accounts for just over 60% over the “Issues Created” and 33% of the “Merged Pull Requests (PRs)”. However the amount of PRs is skewed by Entity Framework which has a much higher involvement from Microsoft employees, if this is ignored the Community proportion of PRs increases to 44%.
Issues Created (Nov 2013 - Dec 2015)
Project
Microsoft
Community
Total
aspnet/
MVC
716
1380
2096
aspnet/
dnx
897
1206
2103
aspnet/
EntityFramework
1066
1427
2493
aspnet/
KestrelHttpServer
89
176
265
Total
2768
4189
6957
Merged Pull Requests (Nov 2013 - Dec 2015)
Project
Microsoft
Community
Total
aspnet/
MVC
385
228
613
aspnet/
dnx
406
368
774
aspnet/
EntityFramework
937
225
1162
aspnet/
KestrelHttpServer
69
88
157
Total
1798
909
2706
Note: I included the Kestrel Http Server because it is an interesting case. Currently the #1 contributor is not a Microsoft employee, it is Ben Adams, who is doing a great job of improving the memory usage and in the process helping Kestrel handle more and more requests per/second.
By looking at the results over time, you can see that there is a clear and sustained Community involvement (the lighter section of the bars) over the past 2 years (Nov 2013 - Dec 2015) and it doesn’t look like it’s going to stop.
Issues Per Month - By Submitter (click for full-size image)
In addition, whilst the Community involvement is easier to see with the Issues per/month, it is still visible in the Merged PRs and again it looks like it has being sustained over the 2 years.
Merged Pull Request Per Month - By Submitter (click for full-size image)
Total Number of People Contributing
It’s also interesting to look at the total number of different people who contributed to each project. By doing this you get a real sense of the size of the Community contribution, it’s not just a small amount of people doing a lot of work, it’s spread across a large amount of people.
This table shows the number of different GitHub users (per project) who opened an Issue or created a PR that was Merged:
Project
Microsoft
Community
Total
aspnet/
MVC
39
395
434
aspnet/
dnx
46
421
467
aspnet/
EntityFramework
31
570
601
aspnet/
KestrelHttpServer
22
95
117
Total
138
1481
1619
FSharp
In the comments of my first post, Isaac Abraham correctly pointed out:
parts of .NET have been open source for quite a bit more than a year – the F# compiler and FSharp.Core have been for quite a while now.
So, to address this, I will take a quick look at the main FSharp repositories:
As Isaac explained, their relationship is:
… visualfsharp is the Microsoft-owned repo Visual F#. The other is the community owned one. The former one feeds directly into tools like Visual F# tooling in Visual Studio etc.; the latter feeds into things like Xamarin etc. There’s a (slightly out of date) diagram that explains the relationship, and this is another useful resource http://fsharp.github.io/.
FSharp - Issues Created (Dec 2010 - Dec 2015)
Project
Microsoft
Community
Total
fsharp/fsharp
9
312
321
microsoft/visualfsharp
161
367
528
Total
170
679
849
FSharp - Merged Pull Requests (May 2011 - Dec 2015)
Project
Microsoft
Community
Total
fsharp/fsharp
27
134
161
microsoft/visualfsharp
36
33
69
Total
63
167
230
Conclusion
I think that it’s fair to say that the Community has responded to Microsoft making more and more of their code Open Source. There have been a significant amount of Community contributions across several projects, over a decent amount of time. Whilst you could argue that it took Microsoft a long time to open source their code, it seems that .NET developers are happy they have done it, as shown by a sizeable Community response.
The post Open Source .NET – 1 year later - Now with ASP.NET first appeared on my blog Performance is a Feature!
CodeProject
Fri, 15 Jan 2016, 12:00 am
Open Source .NET – 1 year later
A little over a year ago Microsoft announced that they were open sourcing large parts of the .NET framework. At the time Scott Hanselman did a nice analysis of the source, using Microsoft Power BI. Inspired by this and now that a year has passed, I wanted to try and answer the question:
How much Community involvement has there been since Microsoft open sourced large parts of the .NET framework?
I will be looking at the 3 following projects, as they are all highly significant parts of the .NET ecosystem and are also some of the most active/starred/forked projects within the .NET Foundation:
- Roslyn - The .NET Compiler Platform (“Roslyn”) provides open-source C# and Visual Basic compilers with rich code analysis APIs.
- CoreCLR - the .NET Core runtime, called CoreCLR, and the base library, called mscorlib. It includes the garbage collector, JIT compiler, base .NET data types and many low-level classes.
- CoreFX the .NET Core foundational libraries, called CoreFX. It includes classes for collections, file systems, console, XML, async and many others.
Available Data
GitHub itself has some nice graphs built-in, for instance you can see the Commits per Month over an entire year:
Also you can get a nice dashboard showing the Monthly Pulse
However to answer the question above, I needed more data. Fortunately GitHub provides a really comprehensive API, which combined with the excellent Octokit.net library and the brilliant LINQPad, meant I was able to easily get all the data I needed. Here’s a sample LINQPad script if you want to start playing around with the API yourself.
However, knowing the “# of Issues” or “Merged Pull Requests” per/month on it’s own isn’t that useful, it doesn’t tell us anything about who created the issue or submitted the PR. Fortunately GitHub classifies users into categories, for instance in the image below from Roslyn Issue #670 we can see what type of user posted each comment, an “Owner”, “Collaborator” or blank which signifies a “Community” member, i.e. someone who (AFAICT) doesn’t work at Microsoft.
Results
So now that we can get the data we need, what results do we get.
Total Issues - By Submitter
Project
Owner
Collaborator
Community
Total
Roslyn
481
1867
1596
3944
CoreCLR
86
298
487
871
CoreFX
334
911
735
1980
Total
901
3076
2818
6795
Here you can see that the Owners and Collaborators do in some cases dominate, e.g. in Roslyn where almost 60% of the issues were opened by them. But in other cases the Community is very active, especially in CoreCLR where Community members are opening more issues than Owners/Collaborators combined. Part of the reason for this is the nature of the different repositories, CoreCLR is the most visible part of the .NET framework as it encompasses most of the libraries that .NET developers would use on a day-to-day basis, so it’s not surprising that the Community has lots of suggestions for improvements or bug fixes. In addition, the CoreCLR has been around for a much longer time and so the Community has had more time to use it and find out the parts it doesn’t like. Whereas Roslyn is a much newer project so there has been less time to use it, plus finding bugs in a compiler is by its nature harder to do.
Total Merged Pull Requests - By Submitter
Project
Owner
Collaborator
Community
Total
Roslyn
465
2093
118
2676
CoreCLR
378
567
201
1146
CoreFX
516
1409
464
2389
Total
1359
4069
783
6211
However if we look at Merged Pull Requests, we can see that that the overall amount of Community contributions across the 3 projects is much lower, only accounting for roughly 12%. This however isn’t that surprising, there’s a much higher bar for getting a pull request accepted. Firstly, if the project is using this mechanism, you have to pick an issue that is “up for grabs”, then you have to get any API changes through a review, then finally you have to meet any comparability/performance/correctness issues that come up during the code review itself. So actually 12% is a pretty good result as there is a non–trivial amount of work involved in getting your PR merged, especially considering most Community members will be working in their spare time.
Update: I was wrong about the “up for grabs” requirement, see this comment from David Kean and this tweet for more information. “Up for grabs” is a guideline and meant to help new users, but it is not a requirement, you can submit PRs for issues that don’t have that label.
Finally if you look at the amount per/month (see the 2 graphs below, click for larger images), it’s hard to pick up any definite trends or say if the Community is definitely contributing more or less over time. But you can say that over a year the Community has consistently contributed and it doesn’t look like that contribution is going to end. It is not just an initial burst that only happened straight after the projects were open sourced, it is a sustained level of contributions over an entire year.
Issues Per Month - By Submitter
Merged Pull Request Per Month - By Submitter
Top 20 Issue Labels
The last thing that I want to do whilst I have the data is to take a look at the most popular Issue Labels and see what they tell us about the type of work that has been going on since the 3 projects were open sourced.
Here are a few observations about the results:
- Having CodeGen so high on the list is not that surprising considering that RyuJIT - the next-gen .NET JIT Compiler was only released 2 years ago. However, it’s a bit worrying that were so many issues, especially considering that some of them have severe consequences as the devs at Stack Overflow found out! (On a related note, if you want to find out lots of low-level details about what the JIT does, just take a look at all the issues that @MikeDN has commented on, unbelievably for someone with that much knowledge he doesn’t actually work on the product itself, or even another team at Microsoft!!)
- It’s nice to see that all 3 projects have a lots of “Up for Grabs” issues, see Roslyn, CoreCLR and CoreFX, plus the Community seems to be grabbing them back!
- Finally, I love the fact that Performance and Optimisation are being taken seriously, after all Performance is a Feature!!
Discuss on /r/programming and Hacker News
The post Open Source .NET – 1 year later first appeared on my blog Performance is a Feature!
CodeProject
Tue, 8 Dec 2015, 12:00 am
The Stack Overflow Tag Engine – Part 3
This is the part 3 of a mini-series looking at what it might take to build the Stack Overflow Tag Engine, if you haven’t read part 1 or part 2, I recommend reading them first.
Complex boolean queries
One of the most powerful features of the Stack Overflow Tag Engine is that it allows you to do complex boolean queries against multiple Tag, for instance:
A simple way of implementing this is to write code like below, which makes use of a HashSet
to let us efficiently do lookups to see if a particular questions should be included or excluded.
var result = new List(pageSize);
var andHashSet = new HastSet(queryInfo[tag2]);
foreach (var id in queryInfo[tag1])
{
if (result.Count >= pageSize)
break;
baseQueryCounter++;
if (questions[id].Tags.Any(t => tagsToExclude.Contains(t)))
{
excludedCounter++;
}
else if (andHashSet.Remove(item))
{
if (itemsSkipped >= skip)
result.Add(questions[item]);
else
itemsSkipped++;
}
}
The main problem is that we have to scan through all the ids for tag1
until we have enough matches, i.e. foreach (var id in queryInfo[tag1])
. In addition we have to initially load up the HashSet
with all the ids for tag2
, so that we can check matches. So this method takes longer as we skip more and more questions, i.e. for larger value of skip
or if there are a large amount of tagsToExclude
(i.e. “Ignored Tags”), see Part 2 for more infomation.
Bitmaps
So can we do any better, well yes, there is a fairly established mechanism for doing these types of queries, known as Bitmap indexes. To use these you have to pre-calculate an index in which each bit is set to 1
to indicate a match and 0
otherwise. In our scenario this looks so:
Then it is just a case of doing the relevant bitwise operations against the bits (a byte
at a time), for example if you want to get the questions that have the C#
AND
Java
Tags, you do the following:
for (int i = 0; i
{0000000000000000000000000000000000000000000000000000000000010001}
31 0x00
1 words
[ 0]= 2199023255552, 1 bits set ->
{0000000000000000000000100000000000000000000000000000000000000000}
18 0x01
1 words
[ 0]= 64, 1 bits set ->
{0000000000000000000000000000000000000000000000000000000001000000}
48 0x01
3 words
[ 0]= 1048576, 1 bits set ->
{0000000000000000000000000000000000000000000100000000000000000000}
[ 1]= 9007199254740992, 1 bits set ->
{0000000000100000000000000000000000000000000000000000000000000000}
[ 2]= 9007199304740992, 13 bits set ->
{0000000000100000000000000000000000000010111110101111000010000000}
131 0x00
1 words
[ 0]= 536870912, 1 bits set ->
{0000000000000000000000000000000000100000000000000000000000000000}
....
To give an idea of the space savings that can be achieved, the table below shows the size in bytes for compressed Bitmaps that have varying amounts of individual bit set to 1
(for comparision uncompressed Bitmaps are 1,000,000 bytes or 0.95MB)
# Bits Set
Size in Bytes
1
24
10
168
25
408
50
808
100
1,608
200
3,208
400
6,408
800
12,808
1,600
25,608
32,000
512,008
64,000
1,000,008
128,000
1,000,008
As you can see it’s not until we get over 64,000 bits (62,016 to be precise) that we match the size of the regular Bitmaps. Note: in these tests I was setting the bits with an evenly spaced distribution across the entire range of 8 million possible bits. The compression is also dependant on which bits are set, so this is a worse case. The more the bits are clumped together (within the same byte
), the more it will be compressed.
So over the entire Stack Overflow data set of 32,000 Tags, the Bitmaps compress down to an impressive 1.17GB, compared to 149GB uncompressed!
Results
But do queries against compressed Bitmaps actually perform faster than the naive queries using HashSets
(see code above). Well yes they do and in some cases the difference is significant.
As you can see below, for AND NOT
queries they are much faster, especially compared to the worse-case where the regular/naive code takes over 150 ms and the compressed Bitmap code takes ~5 ms (the x-axis is # of excluded/skipped questions
and the y-axis is time in milliseconds
).
For reference there are 194,384 questions tagged with .net
and 528,490 tagged with jquery
.
To ensure I’m being fair, I should point out that the compressed Bitmap queries are slower for OR
queries, as shown below. But note the scale, they take ~5 ms compared to ~1-2 ms for the regular queries, so the compressed Bitmap queries are still fast! The nice things about the compressed Bitmap queries is that they take the same amount of time, regardless of how many questions we skip, whereas the regular queries get slower as # of excluded/skipped questions
increases.
If you are interested the results for all the query types are available:
Further Reading
Future Posts
But there’s still more things to implement, in future posts I hope to cover the following:
- Currently my implementation doesn’t play nicely with the Garbage Collector and it does lots of allocations. I will attempt to replicate the “no-allocations” rule that Stack Overflow have after their battle with the .NET GC
In October, we had a situation where a flood of crafted requests were causing high resource utilization on our Tag Engine servers, which is our internal application for associating questions and tags in a high-performance way.
The post The Stack Overflow Tag Engine – Part 3 first appeared on my blog Performance is a Feature!
CodeProject
Thu, 29 Oct 2015, 12:00 am
The Stack Overflow Tag Engine – Part 2
I’ve added a Resources and Speaking page to my site, check them out if you want to learn more. There’s also a video available of my NDC London 2014 talk “Performance is a Feature!”.
Recap of Stack Overflow Tag Engine
This is the long-delayed part 2 of a mini-series looking at what it might take to build the Stack Overflow Tag Engine, if you haven’t read part 1, I recommend reading it first.
Since the first part was published, Stack Overflow published a nice performance report, giving some more stats on the Tag Engine Servers. As you can see they run the Tag Engine on some pretty powerful servers, but only have a peak CPU usage of 10%, which means there’s plenty of overhead available. It’s a nice way of being able to cope with surges in demand or busy times of the day.
Ignored Tag Preferences
In part 1, I only really covered the simple things, i.e. a basic search for all the questions that contain a given tag, along with multiple sort orders (by score, view count, etc). But the real Tag Engine does much more than that, for instance:
What is he talking about here? Well any time you do a tag search, after the actual search has been done per-user exclusions can then be applied. These exclusions are configurable and allow you to set “Ignored Tags”, i.e. tags that you don’t want to see questions for. Then when you do a search, it will exclude these questions from the results.
Note: it will let you know if there were questions excluded due to your preferences, which is a pretty nice user-experience. If that happens, you get this message: (it can also be configured so that matching questions are greyed out instead):
Now most people probably have just a few exclusions and maybe 10’s at most, but fortunately @leppie a Stack Overflow power-user got in touch with me and shared his list of preferences.
You’ll need to scroll across to appreciate this full extent of this list, but here’s some statistics to help you:
- It contains 3,753 items, of which 210 are wildcards (e.g. cocoa* or *hibernate*)
- The tags and wildcards expand to 7,677 tags in total (out of a possible 30,529 tags)
- There are 6,428,251 questions (out of 7,990,787) that have at least one of the 7,677 tags in them!
Wildcards
If you want to see the wildcard expansion in action you can visit the url’s below:
- *java*
- [facebook-javascript-sdk] [java] [java.util.scanner] [java-7] [java-8] [javabeans] [javac] [javadoc] [java-ee] [java-ee-6] [javafx] [javafx-2] [javafx-8] [java-io] [javamail] [java-me] [javascript] [javascript-events] [javascript-objects] [java-web-start]
- .net*
- [.net] [.net-1.0] [.net-1.1] [.net-2.0] [.net-3.0] [.net-3.5] [.net-4.0] [.net-4.5] [.net-4.5.2] [.net-4.6] [.net-assembly] [.net-cf-3.5] [.net-client-profile] [.net-core] [.net-framework-version] [.net-micro-framework] [.net-reflector] [.net-remoting] [.net-security] [.nettiers]
Now a simple way of doing these matches is the following, i.e. loop through the wildcards and compare each one with every single tag to see if it could be expanded to match that tag. (IsActualMatch(..)
is a simple method that does a basic string StartsWith, EndsWith or Contains as appropriate)
var expandedTags = new HashSet();
foreach (var wildcard in wildcardsToExpand)
{
if (IsWildCard(tagToExpand))
{
var rawTagPattern = tagToExpand.Replace("*", "");
foreach (var tag in allTags)
{
if (IsActualMatch(tag, tagToExpand, rawTagPattern))
expandedTags.Add(tag);
}
}
else if (allTags.ContainsKey(tagToExpand))
{
expandedTags.Add(tagToExpand);
}
}
This works fine with a few wildcards, but it’s not very efficient. Even on a relatively small data-set containing 32,000 tags, it’s slow when comparing it to 210 wildcardsToExpand
, taking over a second. After chatting to a few of the Stack Overflow developers on Twitter, they consider a Tag Engine query that takes longer than 500 milliseconds to be slow, so a second just to apply the wildcards is unacceptable.
Trigram Index
So can we do any better? Well it turns out that that there is a really nice technique for doing Regular Expression Matching with a Trigram Index that is used in Google Code Search. I’m not going to explain all the details, the linked page has a very readable explanation. But basically what you do is create an inverted index of the tags and search the index instead. That way you aren’t affected so much by the amount of wilcards, because you are only searching via an index rather than a full search that runs over the whole list of tags.
For instance when using Trigrams, the tags are initially split into 3 letter chunks, for instance the expansion for the tag javascript is shown below (‘_’ is added to denote the start/end of a word):
_ja, jav, ava, vas, asc, scr, cri, rip, ipt, pt_
Next you create an index of all the tags as trigrams and include the position of tag they came from so that you can reference back to it later:
- _ja -> { 0, 5, 6 }
- jav -> { 0, 5, 12 }
- ava -> { 0, 5, 6 }
- va_ -> { 0, 5, 11, 13 }
- _ne -> { 1, 10, 12 }
- net -> { 1, 10, 12, 15 }
- …
For example if you want to match any tags that contain java any where in the tag, i.e. a *java* wildcard query, you fetch the index values for jav
and ava
, which gives you (from above) these 2 matching index items:
- jav -> { 0, 5, 12 }
- ava -> { 0, 5, 6 }
and you now know that the tags with index 0 and 5 are the only matches because they have jav
and ava
(6 and 12 don’t have both)
Results
On my laptop I get the results shown below, where Contains
is the naive way shown above and Regex
is an attempt to make it faster by using compiled Regex queries (which was actually slower)
Expanded to 7,677 tags (Contains), took 721.51 ms
Expanded to 7,677 tags (Regex), took 1,218.69 ms
Expanded to 7,677 tags (Trigrams), took 54.21 ms
As you can see, the inverted index using Trigrams is a clear winner. If you are interested, the source code is available on GitHub.
In this post I showed one way that the Tag Engine could implement wildcards matching. As I don’t work at Stack Overflow there’s no way of knowing if they use the same method or not, but at the very least my method is pretty quick!
Future Posts
But there’s still more things to implement, in future posts I hope to cover the following:
In October, we had a situation where a flood of crafted requests were causing high resource utilization on our Tag Engine servers, which is our internal application for associating questions and tags in a high-performance way.
The post The Stack Overflow Tag Engine – Part 2 first appeared on my blog Performance is a Feature!
CodeProject
Wed, 19 Aug 2015, 12:00 am
The Stack Overflow Tag Engine – Part 1
I’ve added a Resources and Speaking page to my site, check them out if you want to learn more.
Stack Overflow Tag Engine
I first heard about the Stack Overflow Tag engine of doom when I read about their battle with the .NET Garbage Collector. If you haven’t heard of it before I recommend reading the previous links and then this interesting case-study on technical debt.
But if you’ve ever visited Stack Overflow you will have used it, maybe without even realising. It powers the pages under stackoverflow.com/questions/tagged
, for instance you can find the questions tagged .NET, C# or Java and you get a page like this (note the related tags down the right-hand side):
Tag API
As well as simple searches, you can also tailor the results with more complex queries (you may need to be logged into the site for these links to work), so you can search for:
It’s worth noting that all these searches take your personal preferences into account. So if you have asked to have any tags excluded, questions containing these tags are filtered out. You can see your preferences by going to your account page and clicking on Preferences, the Ignored Tags are then listed at the bottom of the page. Apparently some power-users on the site have 100’s of ignored tags, so dealing with these is a non-trivial problem.
Publicly available Question Data set
As I said I wanted to see what was involved in building a version of the Tag Engine. Fortunately, data from all the Stack Exchange sites is available to download. To keep things simple I just worked with the posts (not their entire history of edits), so I downloaded stackoverflow.com-Posts.7z (warning direct link to 5.7 GB file), which appears to contain data up-to the middle of September 2014. To give an idea of what is in the data set, a typical question looks like the .xml below. For the Tag Engine we only need the items highlighted in red, because it is only providing an index into the actual questions themselves, so we ignore any content and just look at the meta-data.
Below is the output of the code that runs on start-up and processes the data, you can see there are just over 7.9 millions questions in the data set, taking up just over 2GB of memory, when read into a List
.
Took 00:00:31.623 to DE-serialise 7,990,787 Stack Overflow Questions, used 2136.50 MB
Took 00:01:14.229 (74,229 ms) to group all the tags, used 2799.32 MB
Took 00:00:34.148 (34,148 ms) to create all the "related" tags info, used 362.57 MB
Took 00:01:31.662 (91,662 ms) to sort the 191,025 arrays
After SETUP - Using 4536.21 MB of memory in total
So it takes roughly 31 seconds to de-serialise the data from disk (yay protobuf-net!) and another 3 1/2 minutes to process and sort it. At the end we are using roughly 4.5GB of memory.
Max LastActivityDate 14/09/2014 03:07:29
Min LastActivityDate 18/08/2008 03:34:29
Max CreationDate 14/09/2014 03:06:45
Min CreationDate 31/07/2008 21:42:52
Max Score 8596 (Id 11227809)
Min Score -147
Max ViewCount 1917888 (Id 184618)
Min ViewCount 1
Max AnswerCount 518 (Id 184618)
Min AnswerCount 0
Yes that’s right, there is actually a Stack Overflow questions with 1.9 million views, not surprisingly it’s locked for editing, but it’s also considered “not constructive”! The same question also has 518 answers, the most of any on the site and if you’re wondering, the question with the highest score has an impressive 8192 votes and is titled Why is processing a sorted array faster than an unsorted array?
Creating an Index
So what does the index actually look like, well it’s basically a series of sorted lists (List
) that contain an offset into the main List
that contains all the Question
data. Or in a diagram, something like this:
Note: This is very similar to the way that Lucene indexes data.
It turns out the the code to do this isn’t that complex:
// start with a copy of the main array, with Id's in order, { 0, 1, 2, 3, 4, 5, ..... }
tagsByLastActivityDate = new Dictionary(groupedTags.Count);
var byLastActivityDate = tag.Value.Positions.ToArray();
Array.Sort(byLastActivityDate, comparer.LastActivityDate);
Where the comparer is as simple as the following (note that is sorting the byLastActiviteDate
array, using the values in the question
array to determine the sort order.
public int LastActivityDate(int x, int y)
{
if (questions[y].LastActivityDate == questions[x].LastActivityDate)
return CompareId(x, y);
// Compare LastActivityDate DESCENDING, i.e. most recent is first
return questions[y].LastActivityDate.CompareTo(questions[x].LastActivityDate);
}
So once we’ve created the sorted list on the left and right of the diagram above (Last Edited
and Score
), we can just traverse them in order to get the indexes of the Questions
. For instance if we walk through the Score
array in order (1, 2, .., 7, 8)
, collecting the Id’s as we go, we end up with { 8, 4, 3, 5, 6, 1, 2, 7 }
, which are the array indexes for the corresponding Questions
. The code to do this is the following, taking account of the pageSize
and skip
values:
var result = queryInfo[tag]
.Skip(skip)
.Take(pageSize)
.Select(i => questions[i])
.ToList();
Once that’s all done, I ended up with an API that you can query in the browser. Note that the timing is the time taken on the server-side, but it is correct, basic queries against a single tag are lightening quick!
Next time
Now that the basic index is setup, next time I’ll be looking at how to handle:
- Complex boolean queries
.net or jquery- and c#
- Power users who have 100’s of excluded tags
and anything else that I come up with in the meantime.
The post The Stack Overflow Tag Engine – Part 1 first appeared on my blog Performance is a Feature!
CodeProject
Sat, 1 Nov 2014, 12:00 am
The Art of Benchmarking (Updated 2014-09-23)
tl;dr
Benchmarking is hard, it’s very easy to end up “not measuring, what you think you are measuring”
Update (2014-09-23): Sigh - I made a pretty big mistake in these benchmarks, fortunately Reddit user zvrba corrected me:
Yep, can’t argue with that, see Results and Resources below for the individual updates.
Intro to Benchmarks
To start with, lets clarify what types of benchmarks we are talking about. Below is a table from the DEVOXX talk by Aleksey Shipilev, who works on the Java Micro-benchmarking Harness (JMH)
- kilo: > 1000 s, Linpack
- ????: 1…1000 s, SPECjvm2008, SPECjbb2013
- milli: 1…1000 ms, SPECjvm98, SPECjbb2005
- micro: 1…1000 us, single webapp request
- nano: 1…1000 ns, single operations
- pico: 1…1000 ps, pipelining
He then goes on to say:
- Millibenchmarks are not really hard
- Microbenchmarks are challenging, but OK
- Nanobenchmarks are the damned beasts!
- Picobenchmarks…
This post is talking about micro and nano benchmarks, that is ones where the code we are measuring takes microseconds or nanoseconds to execute.
First attempt
Let’s start with a nice example available from Stack Overflow:
static void Profile(string description, int iterations, Action func)
{
// clean up
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
// warm up
func();
var watch = Stopwatch.StartNew();
for (int i = 0; i
{
// ... code being profiled
});
Now there is a lot of good things that this code sample is doing:
- Eliminating the overhead of the .NET GC (as much as possible), by making sure it has run before the timing takes place
- Calling the function that is being profiled, outside the timing loop, so that the overhead of the .NET JIT Compiler isn’t included in the benchmark itself. The first time a function is called the JITter steps in and converts the code from IL into machine code, so that it can actually be executed by the CPU.
- Using
Stopwatch
rather than DateTime.Now
, Stopwatch is a high-precision timer with a low-overhead, DateTime.Now isn’t!
- Running a lot of iterations of the code (100,000’s), to give an accurate measurement
Now far be it from me to criticise a highly voted Stack Overflow answer, but that’s exactly what I’m going to do! I should add that for a whole range of scenarios the Stack Overflow code is absolutely fine, but it does have it’s limitations. There are several situations where this code doesn’t work, because it fails to actually profile the code you want it to.
Baseline benchmark
But first let’s take a step back and look at the simplest possible case, with all the code inside the function. We’re going to measure the time that Math.Sqrt(..)
takes to execute, nice and simple:
static void ProfileDirect(string description, int iterations)
{
// clean up
GC.Collect();
GC.WaitForPendingFinalizers();
GC.Collect();
// warm up
Math.Sqrt(123.456);
var watch = new Stopwatch();
watch.Start();
for (int i = 0; i
Fri, 19 Sep 2014, 12:00 am
Stack Overflow - performance lessons (part 2)
In Part 1 I looked at some of the more general performance issues that can be learnt from Stack Overflow (the team/product), in Part 2 I’m looking at some of the examples of coding performance lessons.
Please don’t take these blog posts as blanket recommendations of techniques that you should go away and apply to your code base. They are specific optimisations that you can use if you want to squeeze every last drop of performance out of your CPU.
Also, don’t optimise anything unless you have measured and profiled first, you will probably optimise the wrong thing!
Battles with the .NET Garbage Collector
I first learnt about the performance work done in Stack Overflow (the site/company), when I read the post on their battles with the .NET Garbage Collector (GC). If you haven’t read it, the short summary is that they were experiencing page load times that would suddenly spike to the 100’s of msecs, compared to the normal sub 10 msecs they were use to. After investigating for a few days they narrowed the problem down to the behaviour of the GC. GC pauses are a real issue and even the new modes available in .NET 4.5 don’t fully eliminate them, see my previous investigation for more information.
One thing to remember is that to make this all happen, they needed the following items in place:
- Monitoring in production - these issues would only show up under load, once the application had been running for a while, so they would be very hard to recreate in staging or during development.
- Multiple measurements - they recorded both ASP.NET and IIS web server response times and were able to cross-reference them (see image below).
- Storing outliers - these spikes rarely happened so having detailed metrics was needed, averages hide too much information.
- Good knowledge of the .NET GC - according to the article, it took them 3 weeks to identify and fix this issue “So Marc and I set off on a 3 week adventure to resolve the memory pressure.”
You can read all the gory details of the fix and the follow-up in the posts below, but the tl;dr is that they removed of all the work that the .NET Garbage Collector had to do, thus eliminating the pauses:
Jil - A fast JSON (de)serializer, with a number of somewhat crazy optimization tricks.
But if you think that the struct
based code they wrote is crazy, their JSON serialisation library, Jil, takes things to a new level. This is all in the pursuit of the maximum performance and based on their benchmarks, it seems to be working!
Note: protobuf-net is a binary serialisation library, but doesn’t support JSON, it’s only included is a base-line:
For instance, instead of writing code like this
public T Serialise(string json, bool isJSONP)
{
if (isJSONP)
{
// code to handle JSONP
}
else
{
// code to handle regular JSON
}
}
They write code like this, which is a classic memory/speed trade-off.
public ISerialiser GetSerialiser(bool isJSONP)
{
if (isJSONP)
return new SerialiseWithJSONP();
else
return new Serialiser();
}
public class SerialiserWithJSONP : ISerialiser
{
private T Serialiser(string json)
{
// code to handle JSONP
}
}
public class Serialiser : ISerialiser
{
private T Serialise(string json)
{
// code to handle regular JSON
}
}
This means that during serialisation there doesn’t need to be any “feature switches”, they just emit the different versions of the code at creation time and based on the options you specify, hand you the correct one. Of course the classes (SerialiserWithJSONP
and Serialiser
in this case) are dynamically created just once and then cached for later re-use, so the cost of the dymanic code generation is only paid once.
By doing this the code plays nicely with CPU branch prediction, because it has a nice predictable pattern that the CPU can easily work with. It also has the added benefit of making the methods smaller, which may make then candidates for in-lining by the the .NET JITter.
For more examples of optimisations used, see the links below
Jil - Marginal Gains.
On top of this the measure everything to ensure that the optimisations actually work! These tests are all run as unit-tests, allowing easy generation of the results, take a look at ReorderMembers for instance.
Note: All the times are in milliseconds, but timed over 1000’s of runs, not per call.
Feature name
Original
Improved
Difference
ReorderMembers
2721
2712
9
SkipNumberFormatting
166
163
3
UseCustomIntegerToString
589
339
250
SkipDateTimeMathMethods
108
100
8
UseCustomISODateFormatting
399
269
130
UseFastLists
277
267
10
UseFastArrays
486
469
17
UseFastGuids
744
304
440
AllocationlessDictionaries
134
127
7
PropagateConstants
77
35
42
AlwaysUseCharBufferForStrings
63
56
7
UseHashWhenMatchingMembers
141
131
10
DynamicDeserializer_UseFastNumberParsing
94
51
43
DynamicDeserializer_UseFastIntegerConversion
131
131
2
UseHashWhenMatchingEnums
38
10
28
UseCustomWriteIntUnrolledSigned
2182
1765
417
This is very similar to the “Marginal Gains” approach that worked so well for British Cycling in the last Olympics:
There’s fitness and conditioning, of course, but there are other things that might seem on the periphery, like sleeping in the right position, having the same pillow when you are away and training in different places.
Do you really know how to clean your hands? Without leaving the bits between your fingers? If you do things like that properly, you will get ill a little bit less.
“They’re tiny things but if you clump them together it makes a big difference.”
Summary
All-in-all there is a lot to be learnt from code and blog posts that have come from Stack Overflow developers, I’m glad they’ve shared everything so openly. Also by having a high-profile website running on .NET, it stops the argument that .NET is inherently slow.
The post Stack Overflow - performance lessons (part 2) first appeared on my blog Performance is a Feature!
CodeProject
Fri, 5 Sep 2014, 12:00 am
Stack Overflow - performance lessons (part 1)
This post is part of a semi-regular series, you can find the other entries here and here
Before diving into any of the technical or coding aspects of performance, it is really important to understand that the main lesson to take-away from Stack Overflow (the team/product) is that they take performance seriously. You can see this from the blog post that Jeff Atwood wrote, it’s a part of their culture and has been from the beginning:
But anyone can come up with a catchy line like “Performance is a Feature!!”, it only means something if you actually carry it out. Well it’s clear that Stack Overflow have done just this, not only is it a Top 100 website, but they’ve done the whole thing with very few servers and several of those are running at only 15% of their capacity, so they can scale up if needed and/or deal with large traffic bursts.
Update (2/9/2014 9:25:35 AM): Nick Craver tweeted me to say that the High Scalability post is a bad summarisation (apparently they have got things wrong before), so take what it says with a grain of salt!
Aside: If you want even more information about their set-up, I definitely recommend reading the Hacker News discussion and this post from Nick Craver, one of the Stack Overflow developers.
Interestingly they have gone for scale-up rather than scale-out, by building their own servers instead of using cloud hosting. The reason for this, just to get better performance!
Why do I choose to build and colocate servers? Primarily to achieve maximum performance. That’s the one thing you consistently just do not get from cloud hosting solutions unless you are willing to pay a massive premium, per month, forever: raw, unbridled performance….
It’s also worth noting that they are even prepared to sacrifice the ability to unit test their code, because it gives them better performance.
- Garbage collection driven programming. SO goes to great lengths to reduce garbage collection costs, skipping practices like TDD, avoiding layers of abstraction, and using static methods. While extreme, the result is highly performing code. When you’re doing hundreds of millions of objects in a short window, you can actually measure pauses in the app domain while GC runs. These have a pretty decent impact on request performance.
Now, this isn’t for everyone and even suggesting that unit testing isn’t needed or useful tends to produce strong reactions. But you can see that they are making an informed trade-off and they are prepared to go against the conventional wisdom (“write code that is unit-testing friendly”), because it gives them the extra performance they want. One caveat is that they are in a fairly unique position, they have passionate users that are willing to act as beta-testers, so having less unit test might not harm them, not everyone has that option!
- To get around garbage collection problems, only one copy of a class used in templates are created and kept in a cache. Everything is measured, including GC operation, from statistics it is known that layers of indirection increase GC pressure to the point of noticeable slowness.
For a more detailed discussion on why this approach to coding can make a difference to GC pressure, see here and here.
Sharing and doing everything out in the open
Another non-technical lesson is that Stack Overflow are committed to doing things out in the open and sharing what they create as code or lessons-learnt blog posts. Their list of open source projects includes:
- MiniProfiler - which gives developers an overview of where the time is being spent when a page renders (front-end, back-end, database, etc)
- Dapper - developed because Entity Framework imposed too large an overhead when materialising the results of a SQL query into POCO’s
- Jil - a newly release JSON serialisation/library, developed so that they can get the best possible performance. JSON parsing and serialisation must be a very common operation across their web-servers, so shaving off microseconds from the existing libraries is justified.
- TagServer - a custom .NET service that was written to make the complex tag searches quicker than they would be if done directly in SQL Server.
- Opserver - fully featured monitoring tool, giving their operation engineers a deep-insight into what their servers are doing in production.
All these examples show that they are not afraid to write their own tools when the existing ones aren’t up-to scratch, don’t have the features they need or don’t give the performance they require.
Measure, profile and display
As shown by the development of Opserver, they care about measuring performance accurately even (or especially) in production. Take a look at the images below and you can see not only the detailed level of information they keep, but how it is displayed in a way that makes is easy to see what is going on (there are also more screenshots available).
Finally I really like their guidelines for achieving good observability in a production system. They serve as a really good check-list of things you need to do if you want to have any chance of knowing what your system up to in production. I would image these steps and the resulting screens they designed into Opserver have been built up over several years of monitoring and fixing issues in the Stack Overflow sites, so they are battle-hardened!
5 Steps to Achieving Good Observability:
In order to achieve good observability an SRE team (often in conduction with the rest of the organization) needs to do the following steps.
- Instrument your systems by publishing metrics and events
- Gather those metrics and events in a queryable data store(s)
- Make that data readily accessible
- Highlight metrics that are, or are trending towards abnormal or out of bounds behavior
- Establish the resources to drill down into abnormal or out of bounds behavior
Next time
Next time I’ll look at some concrete examples of performance lessons for the open source projects that SO have set-up, including the crazy tricks they use in Jil, their JSON serialisation library.
The post Stack Overflow - performance lessons (part 1) first appeared on my blog Performance is a Feature!
CodeProject
Mon, 1 Sep 2014, 12:00 am
How to mock sealed classes and static methods
Typemock & JustMock are 2 commercially available mocking tools that let you achieve something that should be impossible. Unlike all other mocking frameworks, they let you mock sealed classes, static and non-virtual methods, but how do they do this?
Dynamic Proxies
Firstly it’s worth covering how regular mocking frameworks work with virtual methods or interfaces. Suppose you have a class you want to mock, like so:
public class TestingMocking
{
public virtual void MockMe()
{
..
}
}
At runtime the framework will generate a mocked class like the one below. As it inherits from TestingMocking
you can use it instead of your original class, but the mocked method will be called instead.
public class DynamicProxy : TestingMocking
{
public override void MockMe()
{
..
}
}
This is achieved using the DynamicMethod class available in System.Reflection.Emit, this blog post contains a nice overview and Bill Wagner has put together a more complete example that gives you a better idea of what is involved. I found that once you discover dynamic code generation is possible, you realise that it is used everywhere, for instance:
BTW if you ever find yourself needing to dynamically emit IL code, I’d recommend using the Sigil library that was created by some of the developers at StackOverflow. It takes away a lot of the pain associated with writing and debugging IL.
However dynamically generated proxies will always run into the limitation that you can’t override non-virtual methods and they also can’t do anything with static methods or sealed class (i.e. classes that can’t be inherited).
.NET Profiling API and JITCompilationStarted() Method
How Typemock and JustMock achieve what they do is hinted at in a StackOverflow answer by a Typemock employee and is also discussed in this blog post. But they only talk about the solution, I wanted to actually write a small proof-of-concept myself, to see what is involved.
To start with the .NET profiling API is what makes this possible, but a word of warning, it is a C++ API and it requires you to write a COM component to be able to interact with it, you can’t work with it from C#. To get started I used the excellent profiler demo project from Shaun Wilde. If you want to learn more about the profiling API and in particular how you can use it to re-write methods, I really recommend looking at this code step-by-step and also reading the accompanying slides.
By using the profiling API and in particular the JITCompilationStarted method, we are able to modify the IL of any method being run by the CLR (user code or the .NET runtime), before the JITer compiles it to machine code and it is executed. This means that we can modify a method that originally looks like this:
public sealed class ClassToMock
{
public static int StaticMethodToMock()
{
Console.WriteLine("StaticMethodToMock called, returning 42");
return 42;
}
}
So that instead it does this:
public sealed class ClassToMock
{
public static int StaticMethodToMock()
{
// Inject the IL to do this instead!!
if (Mocked.ShouldMock("Profilier.ClassToMock.StaticMethodToMock"))
return Mocked.MockedMethod();
Console.WriteLine("StaticMethodToMock called, returning 42");
return 42;
}
}
For reference, the original IL looks like this:
IL_0000 ( 0) nop
IL_0001 ( 1) ldstr (70)00023F //"StaticMethodToMockWhatWeWantToDo called, returning 42"
IL_0006 ( 6) call (06)000006 //call Console.WriteLine(..)
IL_000B (11) nop
IL_000C (12) ldc.i4.s 2A //return 42;
IL_000E (14) stloc.0
IL_000F (15) br IL_0014
IL_0014 (20) ldloc.0
IL_0015 (21) ret
and after code injection, it ends up like this:
IL_0000 ( 0) ldstr (70)000135
IL_0005 ( 5) call (0A)00001B //call ShouldMock(string methodNameAndPath)
IL_000A (10) brfalse.s IL_0012
IL_000C (12) call (0A)00001C //call MockedMethod()
IL_0011 (17) ret
IL_0012 (18) nop
IL_0013 (19) ldstr (70)00023F //"StaticMethodToMockWhatWeWantToDo called, returning 42"
IL_0018 (24) call (06)000006 //call Console.WriteLine(..)
IL_001D (29) nop
IL_001E (30) ldc.i4.s 2A //return 42;
IL_0020 (32) stloc.0
IL_0021 (33) br IL_0026
IL_0026 (38) ldloc.0
IL_0027 (39) ret
And that is the basics of how you can modify any .NET method, it seems relatively simple when you know how! In my simple demo I just add in the relevant IL so that a mocked method is called instead, you can see the C++ code needed to achieve this here. Of course in reality it’s much more complicated, my simple demo only deals with a very simplistic scenario, a static method that returns an int
. The commercial products that do this are way more powerful and have to deal with all the issues that you can encounter when you are re-writing code at the IL level, for instance if you aren’t careful you get exceptions like this:
Running the demo code
If you want to run my demo, you need to open the solution file under step5_main_injected_method_object_array and set “ProfilerHost” as the “Start-up Project” (right-click on the project in VS) before you run. When you run it, you should see something like this:
You can see the C# code that controls the mocking below. At the moment the API in the demo is fairly limited, it only lets you turn mocking on/off and set the value that is returned from the mocked method.
static void Main(string[] args)
{
// Without mocking enabled (the default)
Console.WriteLine(new string('#', 90));
Console.WriteLine("Calling ClassToMock.StaticMethodToMock() (a static method in a sealed class)");
var result = ClassToMock.StaticMethodToMock();
Console.WriteLine("Result: " + result);
Console.WriteLine(new string('#', 90) + "n");
// With mocking enabled, doesn't call the static method, calls mocked version instead
Console.WriteLine(new string('#', 90));
Mocked.SetReturnValue = 1;
Console.WriteLine("Turning ON mocking of Profilier.ClassToMock.StaticMethodToMock");
Mocked.Configure("ProfilerTarget.ClassToMock.StaticMethodToMock", mockMethod: true);
Console.WriteLine("Calling ClassToMock.StaticMethodToMock() (a static method in a sealed class)");
result = ClassToMock.StaticMethodToMock();
Console.WriteLine("Result: " + result);
Console.WriteLine(new string('#', 90) + "n");
}
Other Uses for IL re-writing
Again once you learn about this mechanism, you realise that it is used in lots of places, for instance
Discuss on /r/csharp
The post How to mock sealed classes and static methods first appeared on my blog Performance is a Feature!
CodeProject
Thu, 14 Aug 2014, 12:00 am
Know thy .NET object memory layout (Updated 2014-09-03)
Apologies to Nitsan Wakart, from whom I shamelessly stole the title of this post!
tl;dr
The .NET port of HdrHistogram can control the field layout within a class, using the same technique that the original Java code does.
Recently I’ve spent some time porting HdrHistogram from Java to .NET, it’s been great to learn a bit more about Java and get a better understanding of some low-level code. In case you’re not familiar with it, the goals of HdrHistogram are to:
- Provide an accurate mechanism for measuring latency at a full-range of percentiles (99.9%, 99.99% etc)
- Minimising the overhead needed to perform the measurements, so as to not impact your application
You can find a full explanation of what is does and how point 1) is achieved in the project readme.
Minimising overhead
But it’s the 2nd of the points that I’m looking at in this post, by answering the question
How does HdrHistogram minimise its overhead?
But first it makes sense to start with the why, well it turns out it’s pretty simple. HdrHistogram is meant for measuring low-latency applications, if it had a large overhead or caused the GC to do extra work, then it would negatively affect the performance of the application is was meant to be measuring.
Also imagine for a minute that HdrHistogram took 1/10,000th of a second (0.1 milliseconds or 100,000 nanoseconds) to record a value. If this was the case you could only hope to accurately record events lasting down to a millisecond (1/1,000th of a second), anything faster would not be possible as the overhead of recording the measurement would take up too much time.
As it is HdrHistogram is much faster than that, so we don’t have to worry! From the readme:
Measurements show value recording times as low as 3-6 nanoseconds on modern (circa 2012) Intel CPUs.
So how does it achieve this, well it does a few things:
- It doesn't do any memory allocations when storing a value, all allocations are done up front when you create the histogram. Upon creation you have to specify the range of measurements you would like to record and the precision. For instance if you want to record timings covering the range from 1 nanosecond (ns) to 1 hour (3,600,000,000,000 ns), with 3 decimal places of resolution, you would do the following:
Histogram histogram = new Histogram(3600000000000L, 3);
- Uses a few low-level tricks to ensure that storing a value can be done as fast as possible. For instance putting the value in the right bucket (array location) is a constant lookup (no searching required) and on top of that it makes use of some nifty bit-shifting to ensure it happens as fast as possible.
- Implements a slightly strange class-hierarchy to ensure that fields are laid out in the right location. It you look at the source you have AbstractHistogram and then the seemingly redundant class AbstractHistogramBase, why split up the fields up like that? Well the comments give it away a little bit, it's due to false-sharing
False sharing
Update (2014-09-03): As pointed out by Nitsan in the comments, I got the wrong end of the stick with this entire section. It’s not about false-sharing at all, it’s the opposite, I’ll quote him to make sure I get it right this time!
The effort made in HdrHistogram towards controlling field ordering is not about False Sharing but rather towards ensuring certain fields are more likely to be loaded together as they are clumped together, thus avoiding a potential extra read miss.
So what is false sharing, to find out more I recommend reading Martin Thompson’s excellent post and this equally good one from Nitsan Wakart. But if you’re too lazy to do that, it’s summed up by the image below (from Martin’s post).
Image from the Mechanical Sympathy blog
The problem is that a CPU pulls data into its cache in lines, even if your code only wants to read a single variable/field. If 2 threads are reading from 2 fields (X and Y in the image) that are next to each other in memory, the CPU running a thread will invalidate the cache of the other CPU when it pulls in a line of memory. This invalidation costs time and in high-performance situations can slow down your program.
The opposite is also true, you can gain performance by ensuring that fields you know are accessed in succession are located together in memory. This means that once the first field is pulled into the CPU cache, subsequent accesses will be cheaper as the fields will be “Hot”. It is this scenario HdrHistogram is trying to achieve, but how do you know that fields in a .NET object are located together in memory?
Analysing the memory layout of a .NET Object
To do this you need to drop down into the debugger and use the excellent SOS or Son-of-Strike extension. This is because the .NET JITter is free to reorder fields as it sees fit, so the order you put the fields in your class does not determine the order they end up. The JITter changes the layout to minimise the space needed for the object and to make sure that fields are aligned on byte boundaries, it does this by packing them in the most efficient way.
To test out the difference between the Histogram with a class-hierarchy and without, the following code was written (you can find HistogramAllInOneClass in this gist):
Histogram testHistogram = new Histogram(3600000000000L, 3);
HistogramAllInOneClass combinedHistogram = new HistogramAllInOneClass();
Debugger.Launch();
GC.KeepAlive(combinedHistogram); // put a breakpoint on this line
GC.KeepAlive(testHistogram);
Then to actually test it, you need to perform the following steps:
- Set the build to Release and x86
- Build the test and then launch your .exe from OUTSIDE Visual Studio (VS), i.e. by double-clicking on it in Windows Explorer. You must not be debugging in VS when it starts up, otherwise the .NET JITter won't perform any optimisations.
- When the "Just-In-Time Debugger" prompt pops up, select the instance of VS that is already opened (not a NEW one)
- Then check "Manually choose the debugging engines." and click "Yes"
- Finally make sure "Managed (...)", "Native" AND "Managed Compatibility Mode" are checked
Once the debugger has connected back to VS, you can type the following commands in the “Immediate Window”:
.load sos
!DumpStackObjects
DumpObj
(where ADDRESS is the the value from the "Object" column in Step 2.)
If all that works, you will end up with an output like below:
Update (2014-09-03)
Since first writing this blog post, I came across a really clever technique for getting the offsets of fields in code, something that I initially thought was impossible. The full code to achieve this comes from the Jil JSON serialiser and was written to ensure that it accessed fields in the most efficient order.
It is based on a very clever trick, it dynamically emits IL code, making use of the Ldflda instruction. This is code you could not write in C#, but are able to write directly in IL.
The
ldflda instruction pushes the address of a field located in an object onto the stack. The object must be on the stack as an object reference (type O), a managed pointer (type &), an unmanaged pointer (type native int), a transient pointer (type *), or an instance of a value type. The use of an unmanaged pointer is not permitted in verifiable code. The object's field is specified by a metadata token that must refer to a field member.
By putting this code into my project, I was able to verify that it gives exactly the same field offsets that you can see when using the SOS technique (above). So it’s a nice technique and the only option if you want to get this information without having to drop-down into a debugger.
Results
After all these steps we end up with the results shown in the images below, where the rows are ordered by the “Offset” value.
AbstractHistogramBase.cs -> AbstractHistogram.cs -> Histogram.cs
You can see that with the class hierarchy in place, the fields remain grouped as we want them to (shown by the orange/green/blue highlighting). What is interesting is that the JITter has still rearranged fields within a single group, preferring to put Int64 (long) fields before Int32 (int) fields in this case. This is seen by comparing the ordering of the “Field” column with the “Offset” one, where the values in the “Field” column represent the original ordering of the fields as they appear in the source code.
However when we put all the fields in a single class, we lose the grouping:
Equivalent fields all in one class
Alternative Technique
To achieve the same effect you can use the StructLayout attribute, but this requires that you calculate all the offsets yourself, which can be cumbersome:
[StructLayout(LayoutKind.Explicit, Size = 28, CharSet = CharSet.Ansi)]
public class HistogramAllInOneClass
{
// "Cold" accessed fields. Not used in the recording code path:
[FieldOffset(0)]
internal long identity;
[FieldOffset(8)]
internal long highestTrackableValue;
[FieldOffset(16)]
internal long lowestTrackableValue;
[FieldOffset(24)]
internal int numberOfSignificantValueDigits;
...
}
If you are interested, the full results of this test are available
The post Know thy .NET object memory layout (Updated 2014-09-03) first appeared on my blog Performance is a Feature!
CodeProject
Fri, 4 Jul 2014, 12:00 am
Measuring the impact of the .NET Garbage Collector - An Update
tl;dr
Measuring performance accurately is hard. But it is a whole lot easier if someone with experience takes the time to explain your mistakes to you!!
This is an update to my previous post, if you haven’t read that, you might want to go back and read it first.
After I published that post, Gil Tene (@GilTene) the author of jHiccup, was kind enough to send me an email pointing out a few things I got wrong! It’s great that he took the time to do this and so (with his permission), I’m going to talk through his comments.
Firstly he pointed out that the premise for my investigation wasn’t in-line what jHiccup reports. So instead answering the question:
what % of pauses do what?
jHiccup answers a different question:
what % of my operations will see what minimum possible latency levels?
He also explained that I wasn’t measuring only GC pauses. This was something which I alluded to in my post, but didn’t explicitly point out.
...I suspect that your current data is somewhat contaminated by hiccups that are not GC pauses (normal blips of 2+ msec due to scheduling, etc.). Raising the 2 msec recording threshold (e.g. to 5 or 10msec) may help with that, but then you may miss some actual GC pauses in your report. There isn't really a good way around this, since "very short" GC pauses and "other system noise" overlap in magnitude.
So in summary, it is better to describe my tests as measuring any pauses in a program, not just GC pauses. Again quoting from Gil:
Over time (and based on experience), I think you may find that just using the jHiccup approach of
"whatever is stopping my apps from running" will become natural, and that you'll stop analyzing the pure "what percent of GC pauses do what" question (if you think about it, the answer to that question is meaningless to applications).
This is so true, it really doesn’t matter what is slowing your app down or causing the user to experience unacceptable pauses. What matters is finding out if and how often this is happening and then doing something about it.
Tweaks made
He also suggested some tweaks to make to the code (emphasis mine):
- Record everything (good and bad):
You current code only records pauses (measurements above 2msec). To report from a "% of operations" viewpoint, you need to record everything, unconditionally. As you probably see in jHiccup, what I record as hiccups is the measured time minus the expected sleep time. Recording everything will have the obvious effect of shifting the percentile levels to the right.
- Correct for coordinated omission.
My "well trained" eye sees clear evidence of coordinated omission in your current charts (which is fine for "what % of pauses" question, but not for a "what % of operations" question): any vertical jumps in latency on a percentile chart are a strong indication of coordinated omission. While it is possible to have such jumps be "valid" and happening without coordinated omission in cases where the concurrently measured transactions are "either fast or slow, without blocking anything else" (e.g. a web page takes either 5msec or 250msec, and never any other number in between), these are very rare in the wild, and never happen in a jHiccup-like measurement. Then, whenever you see a 200 msec measurement, it also means that you "should have seen" measurements with the values 198, 196, 194, ... 4, but never got a chance to.
Based on these 2 suggestions, the code to record the timings becomes the following:
var timer = new Stopwatch();
var sleepTimeInMsecs = 1;
while (true)
{
timer.Restart();
Thread.Sleep(sleepTimeInMsecs);
timer.Stop();
// Record the pause (using the old method, for comparison)
if (timer.ElapsedMilliseconds > 2)
_oldhistogram.recordValue(timer.ElapsedMilliseconds);
// more accurate method, correct for coordinated omission
_histogram.recordValueWithExpectedInterval(
timer.ElapsedMilliseconds - sleepTimeInMsecs, 1);
}
To see what difference this made to the graphs I re-ran the test, this time just in Server GC mode. You can see the changes on the graph below, the dotted lines are the original (inaccurate) mode and the solid lines show the results after they have been corrected for coordinated omission.
Correcting for Coordinated Omission
This is an interesting subject and after becoming aware of it, I’ve spent some time reading up on it and trying to understand it more deeply. One way to comprehend it, is to take a look at the code in HdrHistogram that handles it:
recordCountAtValue(count, value);
if (expectedIntervalBetweenValueSamples = expectedIntervalBetweenValueSamples;
missingValue -= expectedIntervalBetweenValueSamples)
{
recordCountAtValue(count, missingValue);
}
As you can see it fills in all the missing values, from 0 to the value you are actually storing.
It is comforting to know that I’m not alone in making this mistake, the authors of Disruptor and log4j2 both made the same mistake when measuring percentiles in their high-performance code.
Finally if you want some more information on Coordinated Omission and the issue it is trying to prevent, take a look at this post from the Java Advent calendar (you need to scroll down past the calendar to see the actual post). The main point is that without correcting for it, you will be getting inaccurate percentile values, which kind-of defeats the point of making accurate performance measurements in the first place!
The post Measuring the impact of the .NET Garbage Collector - An Update first appeared on my blog Performance is a Feature!
CodeProject
Mon, 23 Jun 2014, 12:00 am
Measuring the impact of the .NET Garbage Collector
There is an update to this post, based on feedback I received.
In my last post I talked about the techniques that the Roslyn team used to minimise the effect of the Garbage Collector (GC). Firstly I guess its worth discussing what the actual issue is.
GC Pauses and Latency
In early versions of the .NET CLR, garbage collection was a “Stop the world” event, i.e. before a GC could happen all the threads in your program had to be brought to a safe place and suspended. If your ASP.NET MVC app was in the middle of serving a request, it would not complete until after the GC finished and the latency for that user would be much higher than normal. This is exactly the issue that Stackoverflow ran into a few years ago, in their battles with the .NET Garbage Collector. If you look at the image below (from that blog post), you can see the spikes in response times of over 1 second, caused by Gen 2 collections.
However in the .NET framework 4.5 there were enhancements to the GC brought in that can help mitigate these (emphasis mine)
The new background server GC in the .NET Framework 4.5 offloads
much of the GC work associated with a full blocking collection to dedicated background GC threads that can run concurrently with user code, resulting in
much shorter (less noticeable) pauses. One customer reported a 70% decrease in GC pause times.
But as you can see from the quote, this doesn’t get rid of pauses completely, it just minimises them. Even the SustainedLowLatency mode isn’t enough, “The collector tries to perform only generation 0, generation 1, and concurrent generation 2 collections. Full blocking collections may still occur if the system is under memory pressure.” If you want a full understanding of the different modes, you can see some nice diagrams on this MSDN page.
I’m not in any way being critical or dismissive of these improvements. GC is a really hard engineering task, you need to detect and clean-up the unused memory of a program, whilst it’s running, ensuring that you don’t affect it’s correctness in any way and making sure you add as little overhead as possible. Take a look at this video for some idea of what’s involved. The .NET GC is a complex and impressive piece of engineering, but there are still some scenarios where it can introduce pauses to your program.
Aside: In the Java world there is a commercial Pauseless Garbage Collector available from Azul Systems. It uses a patented technique to offer “Predictable, consistent garbage collection (GC) behavior” and “Predictable, consistent application response times”, but there doesn’t seem to be anything like that in the .NET space.
Detecting GC Pauses
But how do you detect GC pauses, well the first thing to do is take a look at the properties of the process using the excellent Process Explorer tool from Sysinternals (imagine Task Manager on steroids). It will give you a summary like the one below, the number of Gen 0/1/2 Collections and % Time in GC are the most interesting values to look at.
But the limitation of this is that it has no context, what % of time in GC is too high, how many Gen 2 collections are too many? What effect does GC actually have on your program, in terms of pauses that a customer will experience?
jHiccup and HdrHistogram
To gain a better understanding, I’ve used some of the ideas from the excellent jHiccup Java tool. Very simply, it starts a new thread in which the following code runs:
var timer = new Stopwatch();
while (true)
{
timer.Restart();
Thread.Sleep(1);
timer.Stop();
// allow a little bit of leeway
if (timer.ElapsedMilliseconds > 2)
{
// Record the pause
_histogram.recordValue(timer.ElapsedMilliseconds);
}
}
Any pauses that this thread experiences will also be seen by the other threads running in the program and whilst these pauses aren’t guaranteed to be caused by the GC, it’s the most likely culprit.
Note: this uses the .NET port of the Java HdrHistogram, a full explanation of what HdrHistogram offers and how it works is available in the Readme. But the summary is that it offers a non-intrusive way of collecting samples in a histogram, so that you can then produce a graph of the 50%/99%/99.9%/99.99% percentiles. It does this by allocating all the memory it needs up front, so after start-up it performs no allocations during usage. The benefit of recording full percentile information like this is that you get a much fuller view of any outlying values, compared to just recording a simple average.
To trigger garbage collection, the test program also runs several threads, each executing the code below. In a loop, each thread creates a large string
and a byte array
, to simulate what a web server might be doing when generating a response to a web request (for instance from de-serialising some Json and creating a HTML page). Then to ensure that the objects are kept around long enough, they are both put into a Least Recently Used (LRU) cache, that holds the 2000 most recent items.
processingThreads[i] = new Thread(() =>
{
var threadCounter = 0;
while (true)
{
var text = new string((char)random.Next(start, end + 1), 1000);
stringCache.Set(text.GetHashCode(), text);
// Use 80K, If we are > 85,000 bytes = LOH and we don't want these there
var bytes = new byte[80 * 1024];
random.NextBytes(bytes);
bytesCache.Set(bytes.GetHashCode(), bytes);
threadCounter++;
Thread.Sleep(1); // So we don't thrash the CPU!!!!
}
});
Test Results
The test was left running for 10 mins, in each of the following GC modes:
- Workstation Batch (non-concurrent)
- Workstation Interactive (concurrent)
- Server Batch (non-concurrent)
- Server Interactive (concurrent)
The results are below, you can clearly see that Server modes offer lower pauses than the Workstation modes and that Interactive (concurrent) mode is also an improvement over Batch mode. The graph shows pause times on the Y axis (so lower is better) and the X axis plots the percentiles, scaled logarithmically.
If we take a closer look at just the 99% percentile, i.e. the value (at) which “1 in 100” pauses are less than, the difference is even clearer. Here you can see that the Workstation modes have pauses upto 25 milliseconds, compared to 10 milliseconds for the Server modes.
SustainedLowLatency Mode
As a final test, the program was run using the new SustainedLowLatency mode, to see what effect that has. In the graph below you can see this offers lower pause times, although it isn’t able to sustain these for an unlimited period of time. After 10 minutes we start to see longer pauses compared to those we saw when running the test for just 5 minutes.
It’s worth noting that there is a trade-off to take into account when using this mode, SustainedLowLatency mode is:
For applications that have time-sensitive operations for a contained but potentially longer duration of time during which interruptions from the garbage collector could be disruptive. For example, applications that need quick response times as market data changes during trading hours.
This mode results in a larger managed heap size than other modes. Because it does not compact the managed heap, higher fragmentation is possible. Ensure that sufficient memory is available.
All the data used in these tests can be found in the spreadsheet GC Pause Times - comparision
Discuss on the csharp sub-reddit
Discuss on Hacker News
The post Measuring the impact of the .NET Garbage Collector first appeared on my blog Performance is a Feature!
CodeProject
Wed, 18 Jun 2014, 12:00 am
Roslyn code base - performance lessons (part 2)
In my previous post, I talked about some of the general performance lessons that can be learnt from the Roslyn project. This post builds on that and looks at specific examples from the code base.
Generally the performance gains within Roslyn come down to one thing:
Ensuring the garbage collector does the least possible amount of work
.NET is a managed language and one of the features that it provides is memory management, via the garbage collector (GC). However GC doesn’t come for free, it has to find and inspect all the live objects (and their descendants) in the “mark” phrase, before cleaning up any dead objects in the “sweep” phase.
This is backed up by the guidance provided for contributing to Roslyn, from the Coding Conventions section:
- Avoid allocations in compiler hot paths:
- Avoid LINQ.
- Avoid using foreach over collections that do not have a struct enumerator.
- Consider using an object pool. There are many usages of object pools in the compiler to see an example.
It’s interesting to see LINQ specifically called out, I think it’s great and it does allow you to write much more declarative and readable code, in fact I’d find it hard to write C# code without it. But behind the scenes there are lots of hidden allocations going on and they are not always obvious. If you don’t believe me, have a go at Joe Duffy’s quiz (about 1/2 way through the blog post).
Techniques used
There are several techniques used in the Roslyn code base that either minimise or eliminate allocations, thus giving the GC less work to do. One important characteristic all of them share is that they are only applied to “Hot Paths” within the code. Optimising prematurely is never recommended, nor is using optimisations on parts of your code that are rarely exercised. You need to measure and identify the bottlenecks and understand what are the hot-paths through your code, before you apply any optimisations.
Avoiding allocations altogether
Within the .NET framework there are many methods that cause allocations, for instance String.Trim(..) or any LINQ methods. To combat this we can find several examples where code was specifically re-written, for example:
// PERF: Avoid calling string.Trim() because that allocates a new substring
// PERF: Expansion of "assemblies.Any(a => a.NamespaceNames.Contains(namespaceName))" to avoid allocating a lambda.
// PERF: Beware ImmutableArray.Builder.Sort allocates a Comparer wrapper object
Another good lesson is that each improvement is annotated with a “// PERF:
” comment to explain the reasoning, I guess this is to prevent another developer coming along and re-factoring the code to something more readable (at the expense of performance).
Object pooling with a Cache
Another strategy used is object pooling where rather than newing up objects each time, old ones are re-used. Again this helps relieve pressure on the GC as less objects are allocated and the ones that are, stick around for a while (often the life-time of the program). This is a sweet-spot for the .NET GC, as per the advice from Rico Mariani’s excellent Garbage Collector Basics and Performance Hints:
Too Many Almost-Long-Life Objects
Finally, perhaps the biggest pitfall of the generational garbage collector is the creation of many objects, which are neither exactly temporary nor are they exactly long-lived. These objects can cause a lot of trouble, because they will not be cleaned up by a gen0 collection (the cheapest), as they will still be necessary, and they might even survive a gen1 collection because they are still in use, but they soon die after that.
We can see how this was handled in Roslyn in the code below from StringBuilderPool, that makes use of the more generic ObjectPool infrastructure and helper classes. Obviously it was such a widely used pattern that they build a generic class to handle the bulk of the work, making it easy to write an implementation for a specific type, including StringBuilder, Dictionary, HashSet and Stream.
internal static class StringBuilderPool
{
public static StringBuilder Allocate()
{
return SharedPools.Default().AllocateAndClear();
}
public static void Free(StringBuilder builder)
{
SharedPools.Default().ClearAndFree(builder);
}
public static string ReturnAndFree(StringBuilder builder)
{
SharedPools.Default().ForgetTrackedObject(builder);
return builder.ToString();
}
}
Having a class like this makes sense, a large part of compiling is parsing and building strings. Not only do they use a StringBuilder to save lots of temporary String allocations, but they also re-use StringBuilder objects to save the GC the work of having to clean up these.
Interestingly enough this technique has also been used inside the .NET framework itself, you can see this in the code below from StringBuilderCache.cs. Again, the comment shows that the optimisation was debated and a trade-off between memory usage and efficiency was weighed up.
internal static class StringBuilderCache
{
// The value 360 was chosen in discussion with performance experts as a compromise between using
// as little memory (per thread) as possible and still covering a large part of short-lived
// StringBuilder creations on the startup path of VS designers.
private const int MAX_BUILDER_SIZE = 360;
[ThreadStatic]
private static StringBuilder CachedInstance;
public static StringBuilder Acquire(int capacity = StringBuilder.DefaultCapacity)
{
if(capacity
Tue, 10 Jun 2014, 12:00 am
Roslyn code base - performance lessons (part 1)
At Build 2014 Microsoft open source their next-generation C#/VB.NET compiler, called Roslyn. The project is hosted on codeplex and you can even browse the source, using the new Reference Source browser, which is itself powered by Roslyn (that’s some crazy, meta-recursion going on there!).
Easter Eggs
There’s also some nice info available, for instance you can get a summary of the number of lines of code, files etc, you can also list the projects and assemblies.
ProjectCount=50
DocumentCount=4,366
LinesOfCode=2,355,329
BytesOfCode=96,850,461
DeclaredSymbols=124,312
DeclaredTypes=6,649
PublicTypes=2,076
That’s ~2.3 million lines of code, across over 4300 files! (HT to Slaks for pointing out this functionality)
Being part of the process
If you are in any way interested in new C# language features or just want to find out how a compiler is built, this is really great news. On top of this, not only have Microsoft open sourced the code, the entire process is there for everyone to see. You can get a peek behind the scenes of the C# Design Meetings, debate possible new features with some of the designers and best of all, they seem genuinely interested in getting community feedback.
Taking performance seriously
But what I find really interesting is the performance lessons that can be learned. As outlined in this post, performance is something they take seriously. It’s not really surprising, the new compiler can’t afford to be slower than the old C++ one and developers are pretty demanding customers, so any performance issues would be noticed and complained about.
To give you an idea of what’s involved, here’s the list of scenarios that they measure the performance against.
- Build timing of small, medium, and (very) large solutions
- Typing speed when working in the above solutions, including “goldilocks” tests where we slow the typing entry to the speed of a human being
- IDE feature speed (navigation, rename, formatting, pasting, find all references, etc…)
- Peak memory usage for the above solutions
- All of the above for multiple configurations of CPU cores and available memory
And to make sure that they have accurate measurements and that they know as soon as performance has degraded (emphasis mine):
These are all assessed & reported daily, so that we can identify & repair any check-in that introduced a regression as soon as possible, before it becomes entrenched. Additionally, we don’t just check for the average time elapsed on a given metric; we also assess the 98th & 99.9th percentiles, because we want good performance all of the time, not just some of the time.
There’s lots of information about why just using averages is a bad idea, particularly when dealing with response times, so it’s good to see that they are using percentiles as well. But running performance tests as part of their daily builds and tracking those numbers over time, is a really good example of taking performance seriously, performance testing wasn’t left till the end, as an after-thought.
I’ve worked on projects where the performance targets were at best vague and ensuring they were met was left till right at the end, after all the features had been implemented. It’s much harder to introduce performance testing at this time, we certainly don’t do it with unit testing, so why with performance testing?
This ties in with Stack Overflow mantra:
Performance is a feature
Next time I’ll be looking at specific examples of performance enhancements made in the code base and what problems they are trying to solve.
The post Roslyn code base - performance lessons (part 1) first appeared on my blog Performance is a Feature!
CodeProject
Thu, 5 Jun 2014, 12:00 am