Analysing .NET start-up time with Flamegraphs
Recently I gave a talk at the NYAN Conference called ‘From ‘dotnet run’ to ‘hello world’:
In the talk I demonstrate how you can use PerfView to analyse where the .NET Runtime is spending it’s time during start-up:
From 'dotnet run' to 'hello world' from
Matt Warren
This post is a step-by-step guide to that demo.
Code Sample
For this exercise I delibrately only look at what the .NET Runtime is doing during program start-up, so I ensure the minimum amount of user code is runing, hence the following ‘Hello World’:
using System;
namespace HelloWorld
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Hello World!");
Console.WriteLine("Press to exit");
Console.ReadLine();
}
}
}
The Console.ReadLine()
call is added because I want to ensure the process doesn’t exit whilst PerfView is still collecting data.
Data Collection
PerfView is a very powerful program, but not the most user-friendly of tools, so I’ve put togerther a step-by-step guide:
- Download and run a recent version of ‘PerfView.exe’
- Click ‘Run a command’ or (Alt-R’) and “collect data while the command is running”
- Ensure that you’ve entered values for:
- “Command”
- “Current Dir”
- Tick ‘Cpu Samples’ if it isn’t already selected
- Set ‘Max Collect Sec’ to 15 seconds (because our ‘HelloWorld’ app never exits, we need to ensure PerfView stops collecting data at some point)
- Ensure that ‘.NET Symbol Collection’ is selected
- Hit ‘Run Command
If you then inspect the log you can see that it’s collecting data, obtaining symbols and then finally writing everything out to a .zip file. Once the process is complete you should see the newly created file in the left-hand pane of the main UI, in this case it’s called ‘PerfViewData.etl.zip’
Data Processing
Once you have your ‘.etl.zip’ file, double-click on it and you will see a tree-view with all the available data. Now, select ‘CPU Stacks’ and you’ll be presented with a view like this:
Notice there’s alot of ‘?’ characters in the list, this means that PerfView is not able to work out the method names as it hasn’t resolved the necessary symbols for the Runtime dlls. Lets fix that:
- Open ‘CPU Stacks’
- In the list, select the ‘HelloWorld’ process (PerfView collects data machine-wide)
- In the ‘GroupPats’ drop-down, select ‘[no grouping]’
- Optional, change the ‘Symbol Path’ from the default to something else
- In the ‘By name’ tab, hit ‘Ctrl+A’ to select all the rows
- Right-click and select ‘Lookup Symbols’ (or just hit ‘Alt+S’)
Now the ‘CPU Stacks’ view should look something like this:
Finally, we can get the data we want:
- Select the ‘Flame Graph’ tab
- Change ‘GroupPats’ to one of the following for a better flame graph:
- [group module entries] {%}!=>module $1
- [group class entries] {%!*}.%(=>class $1;{%!*}::=>class $1
- Change ‘Fold%’ to a higher number, maybe 3%, to get rid of any thin bars (any higher and you start to loose information)
Now, at this point I actually recommend exporting the PerfView data into a format that can be loaded into https://speedscope.app/ as it gives you a much better experience. To do this click File -> Save View As and then in the ‘Save as type’ box select Speed Scope Format. Once that’s done you can ‘browse’ that file at speedscope.app, or if you want you can just take a look at one I’ve already created.
Note: If you’ve never encountered ‘flamegraphs’ before, I really recommend reading this excellent explanation by Julia Evans:
perf & flamegraphs pic.twitter.com/duzWs2hoLT
— 🔎Julia Evans🔍 (@b0rk)
December 26, 2017
Anaylsis of .NET Runtime Startup
Finally, we can answer our original question:
Where does the .NET Runtime spend time during start-up?
Here’s the data from the flamegraph summarised as text, with links the corresponding functions in the ‘.NET Core Runtime’ source code:
- Entire Application - 100% - 233.28ms
- Everything except
helloworld!wmain
- 21%
helloworld!wmain
- 79% - 184.57ms
hostpolicy!create_hostpolicy_context
- 30% - 70.92ms here
hostpolicy!create_coreclr
- 22% - 50.51ms here
coreclr!CorHost2::Start
- 9% - 20.98ms here
coreclr!CorHost2::CreateAppDomain
- 10% - 23.52ms here
hostpolicy!runapp
- 20% - 46.20ms here, ends up calling into Assembly::ExecuteMainMethod
here
coreclr!RunMain
- 9.9% - 23.12ms here
coreclr!RunStartupHooks
- 8.1% - 19.00ms here
hostfxr!resolve_frameworks_for_app
- 3.4% - 7.89ms here
So, the main places that the runtime spends time are:
- 30% of total time is spent Launching the runtime, controlled via the ‘host policy’, which mostly takes place in
hostpolicy!create_hostpolicy_context
(30% of total time)
- 22% of time is spend on Initialisation of the runtime itself and the initial (and only) AppDomain it creates, this can be see in
CorHost2::Start
(native) and CorHost2::CreateAppDomain
(managed). For more info on this see The 68 things the CLR does before executing a single line of your code
- 20% was used JITting and executing the
Main
method in our ‘Hello World’ code sample, this started in Assembly::ExecuteMainMethod
above.
To confirm the last point, we can return to PerfView and take a look at the ‘JIT Stats Summary’ it produces. From the main menu, under ‘Advanced Group’ -> ‘JIT Stats’ we see that 23.1 ms or 9.1% of the total CPU time was spent JITing:
Tue, 3 Mar 2020, 12:00 am
Under the hood of "Default Interface Methods"
Background
‘Default Interface Methods’ (DIM) sometimes referred to as ‘Default Implementations in Interfaces’, appeared in C# 8. In case you’ve never heard of the feature, here’s some links to get you started:
Also, there are quite a few other blogs posts discussing this feature, but as you can see opinion is split on whether it’s useful or not:
But this post isn’t about what they are, how you can use them or if they’re useful or not. Instead we will be exploring how ‘Default Interface Methods’ work under-the-hood, looking at what the .NET Core Runtime has to do to make them work and how the feature was developed.
Table of Contents
Development Timeline and PRs
First of all, there are a few places you can go to get a ‘high-level’ understanding of what was done:
Initial work, Prototype and Timeline
Interesting PR’s done after the prototype (newest -> oldest)
Once the prototype was merged in, there was additional feature work done to ensure that DIM’s worked across different scenarios:
Bug fixes done since the Prototype (newest -> oldest)
In addition, there were various bugs fixes done to ensure that existing parts of the CLR played nicely with DIMs:
Possible future work
Finally, there’s no guarantee if or when this will be done, but here are the remaining issues associated with the project:
Default Interface Methods ‘in action’
Now that we’ve seen what was done, let’s look at what that all means, starting with this code that simply demonstrates ‘Default Interface Methods’ in action:
interface INormal {
void Normal();
}
interface IDefaultMethod {
void Default() => WriteLine("IDefaultMethod.Default");
}
class CNormal : INormal {
public void Normal() => WriteLine("CNormal.Normal");
}
class CDefault : IDefaultMethod {
// Nothing to do here!
}
class CDefaultOwnImpl : IDefaultMethod {
void IDefaultMethod.Default() => WriteLine("CDefaultOwnImpl.IDefaultMethod.Default");
}
// Test out the Normal/DefaultMethod Interfaces
INormal iNormal = new CNormal();
iNormal.Normal(); // prints "CNormal.Normal"
IDefaultMethod iDefault = new CDefault();
iDefault.Default(); // prints "IDefaultMethod.Default"
IDefaultMethod iDefaultOwnImpl = new CDefaultOwnImpl();
iDefaultOwnImpl.Default(); // prints "CDefaultOwnImpl.IDefaultMethod.Default"
The first way we can understand how they are implemented is by using Type.GetInterfaceMap(Type)
(which actually had to be fixed to work with DIMs), this can be done with code like this:
private static void ShowInterfaceMapping(Type @implemetation, Type @interface) {
InterfaceMapping map = @implemetation.GetInterfaceMap(@interface);
Console.WriteLine($"{map.TargetType}: GetInterfaceMap({map.InterfaceType})");
for (int counter = 0; counter TestApp.CNormal::Normal (different)
MethodHandle 0x7FF993916A80 --> MethodHandle 0x7FF993916B10
FunctionPtr 0x7FF99385FC50 --> FunctionPtr 0x7FF993861880
TestApp.CDefault: GetInterfaceMap(TestApp.IDefaultMethod)
TestApp.IDefaultMethod::Default --> TestApp.IDefaultMethod::Default (same)
MethodHandle 0x7FF993916BD8 --> MethodHandle 0x7FF993916BD8
FunctionPtr 0x7FF99385FC78 --> FunctionPtr 0x7FF99385FC78
TestApp.CDefaultOwnImpl: GetInterfaceMap(TestApp.IDefaultMethod)
TestApp.IDefaultMethod::Default --> TestApp.CDefaultOwnImpl::TestApp.IDefaultMethod.Default (different)
MethodHandle 0x7FF993916BD8 --> MethodHandle 0x7FF993916D10
FunctionPtr 0x7FF99385FC78 --> FunctionPtr 0x7FF9938663A0
So here we can see that in the case of IDefaultMethod
interface on the CDefault
class the interface and method implementations are the same. As you can see, in the other scenarios the interface method maps to a different method implementation.
But lets look at bit lower, making use of WinDBG and the SOS extension to get a peek into the internal ‘data structures’ that the runtime uses.
First, lets take a look at the MethodTable
(dumpmt
) for the INormal
interface:
> dumpmt -md 00007ff8bcc31dd8
EEClass: 00007FF8BCC2C420
Module: 00007FF8BCC0F788
Name: TestApp.INormal
mdToken: 0000000002000002
File: C:\DefaultInterfaceMethods\TestApp\bin\Debug\netcoreapp3.0\TestApp.dll
BaseSize: 0x0
ComponentSize: 0x0
Slots in VTable: 1
Number of IFaces in IFaceMap: 0
--------------------------------------
MethodDesc Table
Entry MethodDesc JIT Name
00007FF8BCB70580 00007FF8BCC31DC8 NONE TestApp.INormal.Normal()
So we can see that the interface has an entry for the Normal()
method, as expected, but lets look in more detail at the MethodDesc
(dumpmd
):
> dumpmd 00007FF8BCC31DC8
Method Name: TestApp.INormal.Normal()
Class: 00007ff8bcc2c420
MethodTable: 00007ff8bcc31dd8
mdToken: 0000000006000001
Module: 00007ff8bcc0f788
IsJitted: no
Current CodeAddr: ffffffffffffffff
Version History:
ILCodeVersion: 0000000000000000
ReJIT ID: 0
IL Addr: 0000000000000000
CodeAddr: 0000000000000000 (MinOptJitted)
NativeCodeVersion: 0000000000000000
So whilst the method exists in the interface definition, it’s clear that the method has not been jitted (IsJitted: no
) and in fact it never will, as it can never be executed.
Now lets compare that output with the one for the IDefaultMethod
interface, again the MethodTable
(dumpmt
) and the MethodDesc
(dumpmd
):
> dumpmt -md 00007ff8bcc31e68
EEClass: 00007FF8BCC2C498
Module: 00007FF8BCC0F788
Name: TestApp.IDefaultMethod
mdToken: 0000000002000003
File: C:\DefaultInterfaceMethods\TestApp\bin\Debug\netcoreapp3.0\TestApp.dll
BaseSize: 0x0
ComponentSize: 0x0
Slots in VTable: 1
Number of IFaces in IFaceMap: 0
--------------------------------------
MethodDesc Table
Entry MethodDesc JIT Name
00007FF8BCB70590 00007FF8BCC31E58 JIT TestApp.IDefaultMethod.Default()
> dumpmd 00007FF8BCC31E58
Method Name: TestApp.IDefaultMethod.Default()
Class: 00007ff8bcc2c498
MethodTable: 00007ff8bcc31e68
mdToken: 0000000006000002
Module: 00007ff8bcc0f788
IsJitted: yes
Current CodeAddr: 00007ff8bcb765c0
Version History:
ILCodeVersion: 0000000000000000
ReJIT ID: 0
IL Addr: 0000000000000000
CodeAddr: 00007ff8bcb765c0 (MinOptJitted)
NativeCodeVersion: 0000000000000000
Here we see something very different, the MethodDesc
entry in the MethodTable
actually has jitted, executable code associated with it.
Enabling Methods on an Interface
So we’ve seen that ‘default interface methods’ are wired up by the runtime, but how does that happen?
Firstly, it’s very illuminating to look at the initial prototype of the feature in CoreCLR PR #10505, because we can understand at the lowest level what the feature is actually enabling, from /src/vm/classcompat.cpp:
Here we see why DIM didn’t require any changes to the .NET ‘Intermediate Language’ (IL) op-codes, instead they are enabled by relaxing a previous restriction. Before this change, you weren’t able to add ‘virtual, non-abstract’ or ‘non-virtual’ methods to an interface:
- “Virtual Non-Abstract Interface Method.” (
BFA_VIRTUAL_NONAB_INT_METHOD
)
- “Nonvirtual Instance Interface Method.” (
BFA_NONVIRT_INST_INT_METHOD
)
This ties in with the proposed changes to the ECMA-335 specification, from the ‘Default interface methods’ design doc:
The major changes are:
- Interfaces are now allowed to have instance methods (both virtual and non-virtual). Previously we only allowed abstract virtual methods.
- Interfaces obviously still can’t have instance fields.
- Interface methods are allowed to MethodImpl other interface methods the interface requires (but we require the MethodImpls to be final to keep things simple) - i.e. an interface is allowed to provide (or override) an implementation of another interface’s method
However, just allowing ‘virtual, non-abstract’ or ‘non-virtual’ methods to exist on an interface is only the start, the runtime then needs to allow code to call those methods and that is far harder!
Resolving the Method Dispatch
In .NET, since version 2.0, all interface methods calls have taken place via a mechanism known as Virtual Stub Dispatch:
Virtual stub dispatching (VSD) is the technique of using stubs for virtual method invocations instead of the traditional virtual method table. In the past, interface dispatch required that interfaces had process-unique identifiers, and that every loaded interface was added to a global interface virtual table map. This requirement meant that all interfaces and all classes that implemented interfaces had to be restored at runtime in NGEN scenarios, causing significant startup working set increases. The motivation for stub dispatching was to eliminate much of the related working set, as well as distribute the remaining work throughout the lifetime of the process.
Although it is possible for VSD to dispatch both virtual instance and interface method calls, it is currently used only for interface dispatch.
For more information I recommend reading the section on C#’s slotmaps in the excellent article on ‘Interface Dispatch’ by Lukas Atkinson.
So, to make DIM work, the runtime has to wire up any ‘default methods’, so that they integrate with the ‘virtual stub dispatch’ mechanism. We can see this in action by looking at the call stack from the hand-crafted assembly stub (ResolveWorkerAsmStub
) all the way down to FindDefaultInterfaceImplementation(..)
which finds the correct method, given an interface (pInterfaceMD
) and the default method to call (pInterfaceMT
):
- coreclr.dll!MethodTable::FindDefaultInterfaceImplementation(MethodDesc *pInterfaceMD, MethodTable *pInterfaceMT, MethodDesc **ppDefaultMethod, int allowVariance, int throwOnConflict) Line 6985 C++
- coreclr.dll!MethodTable::FindDispatchImpl(unsigned int typeID, unsigned int slotNumber, DispatchSlot *pImplSlot, int throwOnConflict) Line 6851 C++
- coreclr.dll!MethodTable::FindDispatchSlot(unsigned int typeID, unsigned int slotNumber, int throwOnConflict) Line 7251 C++
- coreclr.dll!VirtualCallStubManager::Resolver(MethodTable *pMT, DispatchToken token, OBJECTREF *protectedObj, unsigned __int64 *ppTarget, int throwOnConflict) Line 2208 C++
- coreclr.dll!VirtualCallStubManager::ResolveWorker(StubCallSite *pCallSite, OBJECTREF *protectedObj, DispatchToken token, VirtualCallStubManager::StubKind stubKind) Line 1874 C++
- coreclr.dll!VSD_ResolveWorker(TransitionBlock *pTransitionBlock, unsigned __int64 siteAddrForRegisterIndirect, unsigned __int64 token, unsigned __int64 flags) Line 1683 C++
- coreclr.dll!ResolveWorkerAsmStub() Line 42 Unknown
If you want to explore the call-stack in more detail, you can follow the links below:
ResolveWorkerAsmStub
here
VSD_ResolveWorker(..)
here
VirtualCallStubManager::ResolveWorker(..)
here
VirtualCallStubManager::Resolver(..)
here
MethodTable::FindDispatchSlot(..)
here
[MethodTable::FindDispatchImpl(..)
here or here
- Finally ending up in
MethodTable::FindDefaultInterfaceImplementation(..)
here
Analysis of FindDefaultInterfaceImplementation(..)
So the code in FindDefaultInterfaceImplementation(..)
is at the heart of the feature, but what does it need to do and how does it do it? This list from Finalize override lookup algorithm #12753 gives us some idea of the complexity:
- properly detect diamond shape positive case (where I4 overrides both I2/I3 which both overrides I1) by keep tracking of a current list of best candidates. I went for the simplest algorithm and didn’t build any complex graph / DFS since the majority case the list of interfaces would be small, and interface dispatch cache would ensure majority of cases we don’t need to redo the (slow) dispatch. If needed we can revisit this to make it a proper topological sort.
- VerifyVirtualMethodsImplemented now properly validates default interface scenarios - it is happy if there is at least one implementation and early returns. It doesn’t worry about conflicting overrides, for performance reasons.
- NotSupportedException thrown in conflicting override scenario now has a proper error message
- properly supports GVM when detecting method impl overrides
- Revisited code that adds method impl for interfaces. added proper methodimpl validation and ensure methodimpl are virtual and final (and throw exception if it is not final)
- Added test scenario with method that has multiple method impl. found and fixed a bug where the slot array is not big enough when building method impls for interfaces.
In addition, the ‘two-pass’ algorithm was implemented in Implement two pass algorithm for variant interface dispatch #21355, which contains an interesting discussion of the edge-cases that need to be handled.
So onto the code, this is the high-level view of the algorithm:
- Which actually starts in
MethodTable::FindDispatchImpl(..)
here, where FindDefaultInterfaceImplementation
can be called twice:
- First time to try and find an ‘exact match’ (
allowVariance
=false)
- Then if that fails, it’s called again to try and find a ‘variant match’ (
allowVariance
=true)
- The entire
FindDefaultInterfaceImplementation
method is here, it’s fairly straight-forward and relatively easy to understand, plus there’s only ~270 LOC and they’re all very well commented. The high-level algorithm is the following:
- Walk interface from derived class to parent class here, this is a straight-forward implementation that may me revisited if it doesn’t scale well
- Then scan through each class looking for a match:
- an ‘exact match’
- a ‘generic variance match’, i.e. the interfaces match via ‘casting’, but ultimately have the same
TypeDef
- a ‘more specific interface’ that matches, this match is made more complicated by the fact that ‘generic instantiations’ are involved
- a ‘more specific interface’ matches, but without generics involved, so much simpler to calculate
- If the previous step produced a match, double-check that it is the most specific interface match seen so far, by keeping a ‘candidates list’ and classifying each scenario as:
- a ‘tie’ which is ignored, i.e. a ‘variant match’ on the same type
- a ‘more specific’ match, which is used to update the ‘candidates list’
- a ‘less-specific’ match, so no need to carry on with this candidate
- Finally, a scan is done to see if there are any conflicts here, which is acceptable when
allowVariance=true
, but otherwise throws an exception
- That’s it, the ‘best-candidate’ is then returned to the caller (assuming there is one)
Diamond Inheritance Problem
Finally, the ‘diamond inheritance problem’ was mentioned in a few of the PRs/Issues related to the feature, but what is it?
A good place to starts is one of the test cases, diamondshape.cs. However there’s a more concise example in the C#8 Language Proposal:
interface IA
{
void M();
}
interface IB : IA
{
override void M() { WriteLine("IB"); }
}
class Base : IA
{
void IA.M() { WriteLine("Base"); }
}
class Derived : Base, IB // allowed?
{
static void Main()
{
Ia a = new Derived();
a.M(); // what does it do?
}
}
So the issue is which of the matching interface methods should be used, in this case IB.M()
or Base.IA.M()
? The resolution, as outlined in the C#8 language proposal was to use the most specific override:
Closed Issue: Confirm the draft spec, above, for most specific override as it applies to mixed classes and interfaces (a class takes priority over an interface). See https://github.com/dotnet/csharplang/blob/master/meetings/2017/LDM-2017-04-19.md#diamonds-with-classes.
Which ties in with the ‘more-specific’ and ‘less-specific’ steps we saw in the outline of FindDefaultInterfaceImplementation
above.
Summary
So there you have it, an entire feature delivered end-to-end, yay for .NET (Core) being open source! Thanks to the runtime engineers for making their Issues and PRs easy to follow and for adding such great comments to their code! Also kudos to the language designers for making their proposals and meeting notes available for all to see (e.g. LDM-2017-04-19).
Whether you think they are useful or not, it’s hard to argue that ‘Default Interface Methods’ aren’t well designed and well implemented.
But what makes it even more unique feature is that it required the compiler and runtime teams working together to make it possible!
Wed, 19 Feb 2020, 12:00 am
Research based on the .NET Runtime
Over the last few years, I’ve come across more and more research papers based, in some way, on the ‘Common Language Runtime’ (CLR).
So armed with Google Scholar and ably assisted by Semantic Scholar, I put together the list below.
Note: I put the papers into the following categories to make them easier to navigate (papers in each category are sorted by date, newest -> oldest):
- Using the .NET Runtime as a case-study
- to prove its correctness, study how it works or analyse its behaviour
- Research carried out by Microsoft Research, the research subsidiary of Microsoft.
- “It was formed in 1991, with the intent to advance state-of-the-art computing and solve difficult world problems through technological innovation in collaboration with academic, government, and industry researchers” (according to Wikipedia)
- Papers based on the Mono Runtime
- a ‘Cross-Platform, open-source .NET framework’
- Using ‘Rotor’, real name ‘Shared Source CLI (SSCLI)’
- from Wikipedia “Microsoft provides the Shared Source CLI as a reference CLI implementation suitable for educational use”
Any papers I’ve missed? If so, please let me know in the comments or on Twitter
- .NET Runtime as a Case-Study
- Pitfalls of C# Generics and Their Solution Using Concepts (Belyakova & Mikhalkovich, 2015)
- Efficient Compilation of .NET Programs for Embedded Systems (Sallenaveab & Ducournaub, 2011)
- Type safety of C# and .Net CLR (Fruja, 2007)
- Modeling the .NET CLR Exception Handling Mechanism for a Mathematical Analysis (Fruja & Börger, 2006)
- Analysis of the .NET CLR Exception Handling Mechanism (Fruja & Börger, 2005)
- A Modular Design for the Common Language Runtime (CLR) Architecture (Fruja, 2005)
- Cross-language Program Slicing in the .NET Framework (Pócza, Biczó & Porkoláb, 2005)
- Design and Implementation of a high-level multi-language . NET Debugger (Strein, 2005)
- A High-Level Modular Definition of the Semantics of C# (Börger, Fruja, Gervasi & Stärk, 2004)
- An ASM Specification of C# Threads and the .NET Memory Model (Stärk and Börger, 2004)
- Common Language Runtime : a new virtual machine (Ferreira, 2004)
- JVM versus CLR: a comparative study (Singer, 2003)
- Runtime Code Generation with JVM And CLR (Sestoft, 2002)
- Microsoft Research
- Project Snowflake: Non-blocking safe manual memory management in .NET (Parkinson, Vaswani, Costa, Deligiannis, Blankstein, McDermott, Balkind & Vytiniotis, 2017)
- Simple, Fast and Safe Manual Memory Management (Kedia, Costa, Vytiniotis, Parkinson, Vaswani & Blankstein, 2017)
- Uniqueness and Reference Immutability for Safe Parallelism (Gordon, Parkinson, Parsons, Bromfield & Duffy, 2012)
- A study of concurrent real-time garbage collectors (Pizlo, Petrank & Steensgaard, 2008)
- Optimizing concurrency levels in the. net threadpool: A case study of controller design and implementation (Hellerstein, Morrison & Eilebrecht, 2008)
- Stopless: a real-time garbage collector for multiprocessors. (Pizlo, Frampton, Petrank & Steensgaard, 2007)
- Securing the .NET Programming Model (Kennedy, 2006)
- Combining Generics, Pre-compilation and Sharing Between Software-Based Processes (Syme & Kennedy, 2004)
- Formalization of Generics for the .NET Common Language Runtime (Yu, Kennedy & Syme, 2004)
- Runtime Verification of .NET Contracts (Barnett & Schulte, 2003)
- Design and Implementation of Generics for the .NET Common Language Runtime (Kennedy & Syme, 2001)
- Typing a Multi-Language Intermediate Code (Gordon & Syme, 2001)
- Mono Runtime
- Static and Dynamic Analysis of Android Malware and Goodware Written with Unity Framework (Shim, Lim, Cho, Han & Park, 2018)
- Reducing startup time of a deterministic virtualizing runtime environment (Däumler & Werner, 2013)
- Detecting Clones Across Microsoft .NET Programming Languages (Al-Omari, Keivanloo, Roy & Rilling, 2012)
- Language-independent sandboxing of just-in-time compilation and self-modifying code (Ansel & Marchenko, 2012)
- VMKit: a Substrate for Managed Runtime Environments (Geoffray, Thomas, Lawall, Muller & Folliot, 2010)
- MMC: the Mono Model Checker (Ruys & Aan de Brugh, 2007)
- Numeric performance in C, C# and Java (Sestoft, 2007)
- [Mono versus .Net: A Comparative Study of Performance for Distributed Processing. (Blajian, Eggen, Eggen & Pitts, 2006)]()
- Mono versus .Net: A Comparative Study of Performance for Distributed Processing. (Blajian, Eggen, Eggen & Pitts, 2006)
- Automated detection of performance regressions: the mono experience (Kalibera, Bulej & Tuma, 2005)
- Shared Source Common Language Infrastructure (SSCLI) - a.k.a ‘Rotor’
- Efficient virtual machine support of runtime structural reflection (Ortina, Redondoa & Perez-Schofield, 2009)
- Extending the SSCLI to Support Dynamic Inheritance (Redondo, Ortin & Perez-Schofield, 2008)
- Sampling profiler for Rotor as part of optimizing compilation system (Chilingarova & Safonov, 2006)
- To JIT or not to JIT: The effect of code-pitching on the performance of .NET framework (Anthony, Leung & Srisa-an, 2005)
- Adding structural reflection to the SSCLI (Ortin, Redondo, Vinuesa & Lovelle, 2005)
- Static Analysis for Identifying and Allocating Clusters of Immortal Objects (Ravindar & Srikant, 2005)
- An Optimizing Just-InTime Compiler for Rotor (Trindade & Silva, 2005)
- Software Interactions into the SSCLI platform (Charfi & Emsellem, 2004)
- Experience Integrating a New Compiler and a New Garbage Collector Into Rotor (Anderson, Eng, Glew, Lewis, Menon & Stichnoth, 2004)
.NET Runtime as a Case-Study
Abstract
In comparison with Haskell type classes and C ++ concepts, such object-oriented languages as C# and Java provide much limited mechanisms of generic programming based on F-bounded polymorphism. Main pitfalls of C# generics are considered in this paper. Extending C# language with concepts which can be simultaneously used with interfaces is proposed to solve the problems of generics; a design and translation of concepts are outlined.
Abstract
Compiling under the closed-world assumption (CWA) has been shown to be an appropriate way for implementing object-oriented languages such as Java on low-end embedded systems. In this paper, we explore the implications of using whole program optimizations such as Rapid Type Analysis (RTA) and coloring on programs targeting the .NET infrastructure. We extended RTA so that it takes into account .NET specific features such as (i) array covariance, a language feature also supported in Java, (ii) generics, whose specifications in .Net impacts type analysis and (iii) delegates, which encapsulate methods within objects. We also use an intraprocedural control flow analysis in addition to RTA . We eval-uated the optimizations that we implemented on programs written in C#. Preliminary results show a noticeable reduction of the code size, class hierarchy and polymorphism of the programs we optimize. Array covariance is safe in almost all cases, and some delegate calls can be implemented as direct calls.
Abstract
Type safety plays a crucial role in the security enforcement of any typed programming language. This thesis presents a formal proof of C#’s type safety. For this purpose, we develop an abstract
framework for C#, comprising formal specifications of the language’s grammar, of the statically correct programs, and of the static and operational semantics. Using this framework, we prove that C# is type-safe, by showing that the execution of statically correct C# programs does not lead to type errors.
Abstract
This work is part of a larger project which aims at establishing some important properties of C# and CLR by mathematical proofs. Examples are the correctness of the bytecode verifier of CLR, the type safety (along the lines of the first author’s correctness proof for the definite assignment rules) of C#, the correctness of a general compilation scheme.
Abstract
We provide a complete mathematical model for the exception handling mechanism of the Common Language Runtime (CLR), the virtual machine underlying the interpretation of .NET programs. The goal is to use this rigorous model in the corresponding part of the still-to-be-developed soundness proof for the CLR bytecode verifier.
Abstract
This paper provides a modular high-level design of the Common Language Runtime (CLR) architecture. Our design is given in terms of Abstract State Machines (ASMs) and takes the form of an interpreter. We describe the CLR as a hierarchy of eight submachines, which correspond to eight submodules into which the Common Intermediate Language (CIL) instruction set can be decomposed.
Abstract
Dynamic program slicing methods are very attractive for debugging because many statements can be ignored in the process of localizing a bug. Although language interoperability is a key concept in modern development platforms, current slicing techniques are still restricted to a single language. In this paper a cross-language dynamic program slicing technique is introduced for the .NET environment. The method is utilizing the CLR Debugging Services API, hence it can be applied to large multi-language applications.
Abstract
The Microsoft .NET Common Language Runtime (CLR) provides a low-level debugging application programmers interface (API), which can be used to implement traditional source code debuggers but can also be useful to implement other dynamic program introspection tools. This paper describes our experience in using this API for the implementation of a high-level debugger. The API is difficult to use from a technical point of view because it is implemented as a set of Component Object Model (COM) interfaces instead of a managed .NET API. Nevertheless, it is possible to implement a debugger in managed C# code using COM-interop. We describe our experience in taking this approach. We define a high-level debugging API and implement it in the C# language using COM-interop to access the low-level debugging API. Furthermore, we describe the integration of this high-level API in the multi-language development environment X-develop to enable source code debugging of .NET languages. This paper can be useful for anybody who wants to take the same approach to implement debuggers or other tools for dynamic program introspection.
Abstract
We propose a structured mathematical definition of the semantics of programs to provide a platform-independent interpreter view of the language for the programmer, which can also be used for a precise analysis of the ECMA standard of the language and as a reference model for teaching. The definition takes care to reflect directly and faithfully—as much as possible without becoming inconsistent or incomplete—the descriptions in the standard to become comparable with the corresponding models for Java in Stärk et al. (Java and Java Virtual Machine—Definition, Verification, Validation, Springer, Berlin, 2001) and to provide for implementors the possibility to check their basic design decisions against an accurate high-level model. The model sheds light on some of the dark corners of and on some critical differences between the ECMA standard and the implementations of the language.
Abstract
We present a high-level ASM model of C# threads and the .NET memory model. We focus on purely managed, fully portable threading features of C#. The sequential model interleaves the computation steps of the currently running threads and is suitable for uniprocessors. The parallel model addresses problems of true concurrency on multiprocessor systems. The models provide a sound basis for the development of multi-threaded applications in C#. The thread and memory models complete the abstract operational semantics of C# in.
Abstract
Virtual Machines provide a runtime execution platform combining bytecode portability with a performance close to native code. An overview of current approaches precedes an insight into Microsoft CLR (Common Language Runtime), comparing it to Sun JVM (Java Virtual Machine) and to a native execution environment (IA 32). A reference is also made to CLR in a Unix platform and to techniques on how CLR improves code execution.
Abstract
We present empirical evidence to demonstrate that there is little or no difference between the Java Virtual Machine and the .NET Common Language Runtime, as regards the compilation and execution of object-oriented programs. Then we give details of a case study that proves the superiority of the Common Language Runtime as a target for imperative programming language compilers (in particular GCC).
Abstract
Modern bytecode execution environments with optimizing just-in-time compilers, such as Sun’s Hotspot Java Virtual Machine, IBM’s Java Virtual Machine, and Microsoft’s Common Language Runtime, provide an infrastructure for generating fast code at runtime. Such runtime code generation can be used for efficient implementation of parametrized algorithms. More generally, with runtime code generation one can introduce an additional binding-time without performance loss. This permits improved performance and improved static correctness guarantees.
Microsoft Research
Abstract
Garbage collection greatly improves programmer productivity and ensures memory safety. Manual memory management on the other hand often delivers better performance but is typically unsafe and can lead to system crashes or security vulnerabilities. We propose integrating safe manual memory management with garbage collection in the .NET runtime to get the best of both worlds. In our design, programmers can choose between allocating objects in the garbage collected heap or the manual heap. All existing applications run unmodified, and without any performance degradation, using the garbage collected heap. Our programming model for manual memory management is flexible: although objects in the manual heap can have a single owning pointer, we allow deallocation at any program point and concurrent sharing of these objects amongst all the threads in the program. Experimental results from our .NET CoreCLR implementation on real-world applications show substantial performance gains especially in multithreaded scenarios: up to 3x savings in peak working sets and 2x improvements in runtime.
Abstract
Safe programming languages are readily available, but many applications continue to be written in unsafe languages, because the latter are more efficient. As a consequence, many applications continue to have exploitable memory safety bugs. Since garbage collection is a major source of inefficiency in the implementation of safe languages, replacing it with safe manual memory management would be an important step towards solving this problem.
Previous approaches to safe manual memory management use programming models based on regions, unique pointers, borrowing of references, and ownership types. We propose a much simpler programming model that does not require any of these concepts. Starting from the design of an imperative type safe language (like Java or C#), we just add a delete operator to free memory explicitly and an exception which is thrown if the program dereferences a pointer to freed memory. We propose an efficient implementation of this programming model that guarantees type safety. Experimental results from our implementation based on the C# native compiler show that this design achieves up to 3x reduction in peak working set and run time.
Abstract
A key challenge for concurrent programming is that side-effects (memory operations) in one thread can affect the behavior of another thread. In this paper, we present a type system to restrict the updates to memory to prevent these unintended side-effects. We provide a novel combination of immutable and unique (isolated) types that ensures safe parallelism (race freedom and deterministic execution). The type system includes support for polymorphism over type qualifiers, and can easily create cycles of immutable objects. Key to the system’s flexibility is the ability to recover immutable or externally unique references after violating uniqueness without any explicit alias tracking. Our type system models a prototype extension to C# that is in active use by a Microsoft team. We describe their experiences building large systems with this extension. We prove the soundness of the type system by an embedding into a program logic.
Abstract
Concurrent garbage collection is highly attractive for real-time systems, because offloading the collection effort from the executing threads allows faster response, allowing for extremely short deadlines at the microseconds level. Concurrent collectors also offer much better scalability over incremental collectors. The main problem with concurrent real-time collectors is their complexity. The first concurrent real-time garbage collector that can support fine synchronization, STOPLESS, has recently been presented by Pizlo et al. In this paper, we propose two additional (and different) algorithms for concurrent real-time garbage collection: CLOVER and CHICKEN. Both collectors obtain reduced complexity over the first collector STOPLESS, but need to trade a benefit for it. We study the algorithmic strengths and weaknesses of CLOVER and CHICKEN and compare them to STOPLESS. Finally, we have implemented all three collectors on the Bartok compiler and runtime for C# and we present measurements to compare their efficiency and responsiveness.
Abstract
This paper presents a case study of developing a hill climb-ing concurrency controller (HC 3) for the .NET ThreadPool. The intent of the case study is to provide insight into soft-ware considerations for controller design, testing, and imple-mentation. The case study is structured as a series of issues encountered and approaches taken to their resolution. Ex-amples of issues and approaches include: (a) addressing the need to combine a hill climbing control law with rule-based techniques by the use of hybrid control; (b) increasing the ef-ficiency and reducing the variability of the test environment by using resource emulation; and (c) effectively assessing design choices by using test scenarios for which the optimal concurrency level can be computed analytically and hence desired test results are known a priori. We believe that these issues and approaches have broad application to controllers for resource management of software systems.
Abstract
We present STOPLESS: a concurrent real-time garbage collector suitable for modern multiprocessors running parallel multithreaded applications. Creating a garbage-collected environment that sup- ports real-time on modern platforms is notoriously hard, especially if real-time implies lock-freedom. Known real-time collectors ei- ther restrict the real-time guarantees to uniprocessors only, rely on special hardware, or just give up supporting atomic operations (which are crucial for lock-free software). STOPLESS is the first collector that provides real-time responsiveness while preserving lock-freedom, supporting atomic operations, controlling fragmen- tation by compaction, and supporting modern parallel platforms. STOPLESS is adequate for modern languages such as C# or Java. It was implemented on top of the Bartok compiler and runtime for C# and measurements demonstrate high responsiveness (a factor of a 100 better than previously published systems), virtually no pause times, good mutator utilization, and acceptable overheads.
Abstract
The security of the .NET programming model is studied from the standpoint of fully abstract compilation of C#. A number of failures of full abstraction are identified, and fixes described. The most serious problems have recently been fixed for version 2.0 of the .NET Common Language Runtime.
Abstract
We describe problems that have arisen when combining the proposed design for generics for the Microsoft .NET Common Language Runtime (CLR) with two resource-related features supported by the Microsoft CLR implementation: application domains and pre-compilation. Application domains are “software based processes” and the interaction between application domains and generics stems from the fact that code and descriptors are generated on a pergeneric-instantiation basis, and thus instantiations consume resources which are preferably both shareable and recoverable. Pre-compilation runs at install-time to reduce startup overheads. This interacts with application domain unloading: compilation units may contain shareable generated instantiations. The paper describes these interactions and the different approaches that can be used to avoid or ameliorate the problems.
Abstract
We present a formalization of the implementation of generics in the .NET Common Language Runtime (CLR), focusing on two novel aspects of the implementation: mixed specialization and sharing, and efficient support for run-time types. Some crucial constructs used in the implementation are dictionaries and run-time type representations. We formalize these aspects type-theoretically in a way that corresponds in spirit to the implementation techniques used in practice. Both the techniques and the formalization also help us understand the range of possible implementation techniques for other languages, e.g., ML, especially when additional source language constructs such as run-time types are supported. A useful by-product of this study is a type system for a subset of the polymorphic IL proposed for the .NET CLR.
Abstract
We propose a method for implementing behavioral interface specifications on the .NET platform. Our interface specifications are expressed as executable model programs. Model programs can be run either as stand-alone simulations or used as contracts to check the conformance of an implementation class to its specification. We focus on the latter, which we call runtime verification.In our framework, model programs are expressed in the new specification language AsmL. We describe how AsmL can be used to describe contracts independently from any implementation language, how AsmL allows properties of component interaction to be specified using mandatory calls, and how AsmL is used to check the behavior of a component written in any of the .NET languages, such as VB, C#, or C++.
Abstract
The Microsoft .NET Common Language Runtime provides a shared type system, intermediate language and dynamic execution environment for the implementation and inter-operation of multiple source languages. In this paper we extend it with direct support for parametric polymorphism (also known as generics), describing the design through examples written in an extended version of the C# programming language, and explaining aspects of implementation by reference to a prototype extension to the runtime. Our design is very expressive, supporting parameterized types, polymorphic static, instance and virtual methods, “F-bounded” type parameters, instantiation at pointer and value types, polymorphic recursion, and exact run-time types. The implementation takes advantage of the dynamic nature of the runtime, performing justin-time type specialization, representation-based code sharing and novel techniques for efficient creation and use of run-time types. Early performance results are encouraging and suggest that programmers will not need to pay an overhead for using generics, achieving performance almost matching hand-specialized code.
Abstract
The Microsoft .NET Framework is a new computing architecture designed to support a variety of distributed applications and web-based services. .NET software components are typically distributed in an object-oriented intermediate language, Microsoft IL, executed by the Microsoft Common Language Runtime. To allow convenient multi-language working, IL supports a wide variety of high-level language constructs, including class-based objects, inheritance, garbage collection, and a security mechanism based on type safe execution. This paper precisely describes the type system for a substantial fragment of IL that includes several novel features: certain objects may be allocated either on the heap or on the stack; those on the stack may be boxed onto the heap, and those on the heap may be unboxed onto the stack; methods may receive arguments and return results via typed pointers, which can reference both the stack and the heap, including the interiors of objects on the heap. We present a formal semantics for the fragment. Our typing rules determine well-typed IL instruction sequences that can be assembled and executed. Of particular interest are rules to ensure no pointer into the stack outlives its target. Our main theorem asserts type safety, that well-typed programs in our IL fragment do not lead to untrapped execution errors. Our main theorem does not directly apply to the product. Still, the formal system of this paper is an abstraction of informal and executable specifications we wrote for the full product during its development. Our informal specification became the basis of the product team’s working specification of type-checking. The process of writing this specification, deploying the executable specification as a test oracle, and applying theorem proving techniques, helped us identify several security critical bugs during development.
Mono Runtime
Abstract
Unity is the most popular cross-platform development framework to develop games for multiple platforms such as Android, iOS, and Windows Mobile. While Unity developers can easily develop mobile apps for multiple platforms, adversaries can also easily build malicious apps based on the “write once, run anywhere” (WORA) feature. Even thoughmalicious apps were discovered among Android apps written with Unity framework (Unity apps), little research has been done on analysing the malicious apps. We propose static and dynamic reverse engineering techniques for malicious Unity apps. We first inspect the executable file format of a Unity app and present an effective static analysis technique of the Unity app. Then, we also propose a systematic technique to analyse dynamically the Unity app. Using the proposed techniques, the malware analyst can statically and dynamically analyse Java code, native code in C or C ++, and the Mono runtime layer where the C# code is running.
Abstract
Virtualized runtime environments like Java Virtual Machine (JVM) or Microsoft .NET’s Common Language Runtime (CLR) introduce additional challenges to real-time software development. Since applications for such environments are usually deployed in platform independent intermediate code, one issue is the timing of code transformation from intermediate code into native code. We have developed a solution for this problem, so that code transformation is suitable for real-time systems. It combines pre-compilation of intermediate code with the elimination of indirect references in native code. The gain of determinism comes with an increased application startup time. In this paper we present an optimization that utilizes an Ahead-of-Time compiler to reduce the startup time while keeping the real-time suitable timing behaviour. In an experiment we compare our approach with existing ones and demonstrate its benefits for certain application cases.
Abstract
The Microsoft .NET framework and its language family focus on multi-language development to support interoperability across several programming languages. The framework allows for the development of similar applications in different languages through the reuse of core libraries. As a result of such a multi-language development, the identification and trace ability of similar code fragments (clones) becomes a key challenge. In this paper, we present a clone detection approach for the .NET language family. The approach is based on the Common Intermediate Language, which is generated by the .NET compiler for the different languages within the .NET framework. In order to achieve an acceptable recall while maintaining the precision of our detection approach, we define a set of filtering processes to reduce noise in the raw data. We show that these filters are essential for Intermediate Language-based clone detection, without significantly affecting the precision of the detection approach. Finally, we study the quantitative and qualitative performance aspects of our clone detection approach. We evaluate the number of reported candidate clone-pairs, as well as the precision and recall (using manual validation) for several open source cross-language systems, to show the effectiveness of our proposed approach.
Abstract
When dealing with dynamic, untrusted content, such as on the Web, software behavior must be sandboxed, typically through use of a language like JavaScript. However, even for such specially-designed languages, it is difficult to ensure the safety of highly-optimized, dynamic language runtimes which, for efficiency, rely on advanced techniques such as Just-In-Time (JIT) compilation, large libraries of native-code support routines, and intricate mechanisms for multi-threading and garbage collection. Each new runtime provides a new potential attack surface and this security risk raises a barrier to the adoption of new languages for creating untrusted content. Removing this limitation, this paper introduces general mechanisms for safely and efficiently sandboxing software, such as dynamic language runtimes, that make use of advanced, low-level techniques like runtime code modification. Our language-independent sandboxing builds on Software-based Fault Isolation (SFI), a traditionally static technique. We provide a more flexible form of SFI by adding new constraints and mechanisms that allow safety to be guaranteed despite runtime code modifications. We have added our extensions to both the x86-32 and x86-64 variants of a production-quality, SFI-based sandboxing platform; on those two architectures SFI mechanisms face different challenges. We have also ported two representative language platforms to our extended sandbox: the Mono common language runtime and the V8 JavaScript engine. In detailed evaluations, we find that sandboxing slowdown varies between different benchmarks, languages, and hardware platforms. Overheads are generally moderate and they are close to zero for some important benchmark/platform combinations.
Abstract
Managed Runtime Environments (MREs), such as the JVM and the CLI, form an attractive environment for program execution, by providing portability and safety, via the use of a bytecode language and automatic memory management, as well as good performance, via just-in-time (JIT) compilation. Nevertheless, developing a fully featured MRE, including e.g. a garbage collector and JIT compiler, is a herculean task. As a result, new languages cannot easily take advantage of the benefits of MREs, and it is difficult to experiment with extensions of existing MRE based languages. This paper describes and evaluates VMKit, a first attempt to build a common substrate that eases the development of high-level MREs. We have successfully used VMKit to build two MREs: a Java Virtual Machine and a Common Language Runtime. We provide an extensive study of the lessons learned in developing this infrastructure, and assess the ease of implementing new MREs or MRE extensions and the resulting performance. In particular, it took one of the authors only one month to develop a Common Language Runtime using VMKit. VMKit furthermore has performance comparableto the well established open source MREs Cacao, Apache Harmony and Mono, and is 1.2 to 3 times slower than JikesRVM on most of the Dacapo benchmarks.
Abstract
The Mono Model Checker (mmc) is a software model checker for cil bytecode programs. mmc has been developed on the Mono platform. mmc is able to detect deadlocks and assertion violations in cil programs. The design of mmc is inspired by the Java PathFinder (jpf), a model checker for Java programs. The performance of mmc is comparable to jpf. This paper introduces mmc and presents its main architectural characteristics.
Abstract
We compare the numeric performance of C, C# and Java on three small cases.
Abstract
Microsoft has released .NET, a platform dependent standard for the C#,programming language. Sponsored by Ximian/Novell, Mono, the open source development platform based on the .NET framework, has been developed to be a platform independent version of the C#,programming environment. While .NET is platform dependent, Mono allows developers to build Linux and crossplatform applications. Mono’s .NET implementation is based on the ECMA standards for C#. This paper examines both of these programming environments with the goal of evaluating the performance characteristics of each. Testing is done with various algorithms. We also assess the trade-offs associated with using a cross-platform versus a platform.
Abstract
Engineering a large software project involves tracking the impact of development and maintenance changes on the software performance. An approach for tracking the impact is regression benchmarking, which involves automated benchmarking and evaluation of performance at regular intervals. Regression benchmarking must tackle the nondeterminism inherent to contemporary computer systems and execution environments and the impact of the nondeterminism on the results. On the example of a fully automated regression benchmarking environment for the mono open-source project, we show how the problems associated with nondeterminism can be tackled using statistical methods.
Shared Source Common Language Infrastructure (SSCLI) - a.k.a ‘Rotor’
Abstract
Increasing trends towards adaptive, distributed, generative and pervasive software have made object-oriented dynamically typed languages become increasingly popular. These languages offer dynamic software evolution by means of reflection, facilitating the development of dynamic systems. Unfortunately, this dynamism commonly imposes a runtime performance penalty. In this paper, we describe how to extend a production JIT-compiler virtual machine to support runtime object-oriented structural reflection offered by many dynamic languages. Our approach improves runtime performance of dynamic languages running on statically typed virtual machines. At the same time, existing statically typed languages are still supported by the virtual machine.
We have extended the .Net platform with runtime structural reflection adding prototype-based object-oriented semantics to the statically typed class-based model of .Net, supporting both kinds of programming languages. The assessment of runtime performance and memory consumption has revealed that a direct support of structural reflection in a production JIT-based virtual machine designed for statically typed languages provides a significant performance improvement for dynamically typed languages.
Abstract
This paper presents a step forward on a research trend focused on increasing runtime adaptability of commercial JIT-based virtual machines, describing how to include dynamic inheritance into this kind of platforms. A considerable amount of research aimed at improving runtime performance of virtual machines has converted them into the ideal support for developing different types of software products. Current virtual machines do not only provide benefits such as application interoperability, distribution and code portability, but they also offer a competitive runtime performance.
Since JIT compilation has played a very important role in improving runtime performance of virtual machines, we first extended a production JIT-based virtual machine to support efficient language-neutral structural reflective primitives of dynamically typed programming languages. This article presents the next step in our research work: supporting language-neutral dynamic inheritance for both statically and dynamically typed programming languages. Executing both kinds of programming languages over the same platform provides a direct interoperation between them.
Abstract
This paper describes a low-overhead self-tuning sampling-based runtime profiler integrated into SSCLI virtual machine. Our profiler estimates how “hot” a method is and builds a call context graph based on managed stack samples analysis. The frequency of sampling is tuned dynamically at runtime, based on the information of how often the same activation record appears on top of the stack. The call graph is presented as a novel Call Context Map (CC-Map) structure that combines compact representation and accurate information about the context. It enables fast extraction of data helpful in making compilation decisions, as well as fast placing data into the map. Sampling mechanism is integrated with intrinsic Rotor mechanisms of thread preemption and stack walk. A separate system thread is responsible for organizing data in the CC-Map. This thread gathers and stores samples quickly queued by managed threads, thus decreasing the time they must hold up their user-scheduled job
Abstract
The.NET Compact Framework is designed to be a highperformance virtual machine for mobile and embedded devices that operate on Windows CE (version 4.1 and later). It achieves fast execution time by compiling methods dynamically instead of using interpretation. Once compiled, these methods are stored in a portion of the heap called code-cache and can be reused quickly to satisfy future method calls. While code-cache provides a high-level of reusability, it can also use a large amount of memory. As a result, the Compact Framework provides a “code pitching ” mechanism that can be used to discard the previously compiled methods as needed. In this paper, we study the effect of code pitching on the overall performance and memory utilization of.NET applications. We conduct our experiments using Microsoft’s Shared-Source Common Language Infrastructure (SSCLI). We profile the access behavior of the compiled methods. We also experiment with various code-cache configurations to perform pitching. We find that programs can operate efficiently with a small code-cache without incurring substantial recompilation and execution overheads.
Abstract
Although dynamic languages are becoming widely used due to the flexibility needs of specific software prod- ucts, their major drawback is their runtime performance. Compiling the source program to an abstract machine’s intermediate language is the current technique used to obtain the best performance results. This intermediate code is then executed by a virtual machine developed as an interpreter. Although JIT adaptive optimizing com- pilation is currently used to speed up Java and .net intermediate code execution, this practice has not been em- ployed successfully in the implementation of dynamically adaptive platforms yet. We present an approach to improve the runtime performance of a specific set of structural reflective primitives, extensively used in adaptive software development. Looking for a better performance, as well as interaction with other languages, we have employed the Microsoft Shared Source CLI platform, making use of its JIT compiler. The SSCLI computational model has been enhanced with semantics of the prototype-based object-oriented com- putational model. This model is much more suitable for reflective environments. The initial assessment of per- formance results reveals that augmenting the semantics of the SSCLI model, together with JIT generation of native code, produces better runtime performance than the existing implementations.
Abstract
Long living objects lengthen the trace time which is a critical phase of the garbage collection process. However, it is possible to recognize object clusters i.e. groups of long living objects having approximately the same lifetime and treat them separately to reduce the load on the garbage collector and hence improve overall performance. Segregating objects this way leaves the heap for objects with shorter lifetimes and now a typical collection can nd more garbage than before. In this paper, we describe a compile time analysis strategy to identify object clusters in programs. The result of the compile time analysis is the set of allocation sites that contribute towards allocating objects belonging to such clusters. All such allocation sites are replaced by a new allocation method that allocates objects into the cluster area rather than the heap. This study was carried out for a concurrent collector which we developed for Rotor, Microsoft’s Shared Source Implementation of .NET. We analyze the performance of the program with combina- tions of the cluster and stack allocation optimizations. Our results show that the clustering optimization reduces the number of collections by 66.5% on average, even eliminating the need for collection in some programs. As a result, the total pause time reduces by 62.8% on average. Using both stack allocation and the cluster optimizations brings down the number of collections by 91.5% thereby improving the total pause time by 79.33%.
Abstract
The Shared Source CLI (SSCLI), also known as Rotor, is an implementation of the CLI released by Microsoft in source code. Rotor includes a single pass just-in-time compiler that generates non-optimized code for Intel IA-32 and IBM PowerPC processors. We extend Rotor with an optimizing justin-time compiler for IA-32. This compiler has three passes: control flow graph generation, data dependence graph generation and final code generation. Dominance relations in the control flow graph are used to detect natural loops. A number of optimizations are performed during the generation of the data dependence graph. During native code generation, the rich address modes of IA32 are used for instruction folding, reducing code size and usage of register names. Despite the overhead of three passes and optimizations, this compiler is only 1.4 to 1.9 times slower than the original SSCLI compiler and generates code that runs 6.4 to 10 times faster.
Abstract
By using an Interaction Specification Language (ISL), interactions between components can be expressed in a language independent way. At class level, interaction pattern specified in ISLrepresent model s of future interactions when applied on some component instances. The Interaction Server is in charge of managing the life cycle of interactions (interaction pattern registration and instantiation, destruction of interactions, merging). It acts as a central repository that keeps the global coherency of the adaptations realized on the component instances.The Interaction service allows creati ng interactions between heterogeneous components. Noah is an implementation of this Interaction Service. It can be thought as a dynamic aspect repository with a weaver that uses an aspect composition mechanism that insures commutable and associative adaptations. In this paper, we propose the implementation of the Interaction Service in the SSCLI. In contrast to other implementations such as Java where interaction management represents an additional layer, SSCLI enables us to integrate Interaction Management as in intrinsic part of the CLI runtime.
Abstract
Microsoft’s Rotor is a shared-source CLI implementation intended for use as a research platform. It is particularly attractive for research because of its complete implementation and extensive libraries, and because its modular design allows dierent implementations of certain components such as just-in-time compilers (JITs). Our group has independently developed our own high-performance JIT and garbage collector (GC) and wanted to take advantage of Rotor to experiment with these components in a CLI environment. In this paper, we describe our experience integrating these components into Rotor and evaluate the flexibility of Rotor’s design toward this goal. We found it easier to integrate our JIT than our GC because Rotor has a well-defined interface for the former but not the latter. However, our JIT integration still required significant changes to both Rotor and our JIT. For example, we modified Rotor to support multiple JITs. We also added support for a second JIT manager in Rotor, and implemented a new code manager compatible with our JIT. We had to change our JIT compiler to support Rotor’s calling conventions, helper functions, and exception model. Our GC integration was complicated by the many places in Rotor where components make assumptions about how its garbage collector is implemented, as well as Rotor’s lack of a well-defined GC interface. We also had to reconcile the dierent assumptions made by Rotor and our garbage collector about the layout of objects, virtual-method tables, and thread structures.
Fri, 25 Oct 2019, 12:00 am
"Stubs" in the .NET Runtime
As the saying goes:
“All problems in computer science can be solved by another level of indirection”
- David Wheeler
and it certainly seems like the ‘.NET Runtime’ Engineers took this advice to heart!
‘Stubs’, as they’re known in the runtime (sometimes ‘Thunks’), provide a level of indirection throughout the source code, there’s almost 500 mentions of them!
This post will explore what they are, how they work and why they’re needed.
Table of Contents
What are stubs?
In the context of the .NET Runtime, ‘stubs’ look something like this:
Call-site Callee
+--------------+ +---------+ +-------------+
| | | | | |
| +---------->+ Stub + - - - - ->+ |
| | | | | |
+--------------+ +---------+ +-------------+
So they sit between a method ‘call-site’ (i.e. code such as var result = Foo(..);
) and the ‘callee’ (where the method itself is implemented, the native/assembly code) and I like to think of them as doing tidy-up or fix-up work. Note that moving from the ‘stub’ to the ‘callee’ isn’t another full method call (hence the dotted line), it’s often just a single jmp
or call
assembly instruction, so the 2nd transition doesn’t involve all the same work that was initially done at the call-site (pushing/popping arguments into registers, increasing the stack space, etc).
The stubs themselves can be as simple as just a few assembly instructions or something more complicated, we’ll look at individual examples later on in this post.
Now, to be clear, not all method calls require a stub, if you’re doing a regular call to an static or instance method that just goes directly from the ‘call-site’ to the ‘callee’. But once you involve virtual methods, delegates or generics things get a bit more complicated.
Why are stubs needed?
There are several reasons that stubs need to be created by the runtime:
- Required Functionality
- For instance Delegates and Arrays must be provided but the runtime, their method bodies are not generated by the C#/F#/VB.NET compiler and neither do they exist in the Base-Class Libraries. This requirement is outlined in the ECMA 355 Spec, for instance ‘Partition I’ in section ‘8.9.1 Array types’ says:
Exact array types are created automatically by the VES when they are required. Hence, the operations on an array type are defined by the CTS. These generally are: allocating the array based on size and lower-bound information, indexing the array to read and write a value, computing the address of an element of the array (a managed pointer), and querying for the rank, bounds, and the total number of values stored in the array.
Likewise for delegates, which are covered in ‘I.8.9.3 Delegates’:
While, for the most part, delegates appear to be simply another kind of user-defined class, they are tightly controlled. The implementations of the methods are provided by the VES, not user code. The only additional members that can be defined on delegate types are static or instance methods.
- Performance
- Consistent method calls
- A final factor is that having ‘stubs’ makes the work of the JIT compiler easier. As we will see in the rest of the post, stubs deal with a variety of different types of method calls. This means the the JIT can generate more straightforward code for any given ‘call site’, because it (mostly) doesn’t care whats happening in the ‘callee’. If stubs didn’t exist, for a given method call the JIT would have to generate different code depending on whether generics where involved or not, if it was a virtual or non-virtual call, if it was going via a delegate, etc. Stubs abstact a lot of this behaviour away from the JIT, allowing it to deal with a more simple ‘Application Binary Interface’ (ABI).
CLR ‘Application Binary Interface’ (ABI)
Therefore, another way to think about ‘stubs’ is that they are part of what makes the CLR-specific ‘Application Binary Interface’ (ABI) work.
All code needs to work with the ABI or ‘calling convention’ of the CPU/OS that it’s running on, for instance by following the x86 calling convention, x64 calling convention or System V ABI. This applies across runtimes, for more on this see:
As an aside, if you want more information about ‘calling conventions’ here’s some links that I found useful:
However, on-top of what the CLR has to support due to the CPU/OS conventions, it also has it’s own extended ABI for .NET-specific use cases, including:
- “this” pointer:
The managed “this” pointer is treated like a new kind of argument not covered by the native ABI, so we chose to always pass it as the first argument in (AMD64) RCX
or (ARM, ARM64) R0
.
AMD64-only: Up to .NET Framework 4.5, the managed “this” pointer was treated just like the native “this” pointer (meaning it was the second argument when the call used a return buffer and was passed in RDX
instead of RCX
). Starting with .NET Framework 4.5, it is always the first argument.
- Generics or more specifically to handle ‘Shared generics’:
In cases where the code address does not uniquely identify a generic instantiation of a method, then a ‘generic instantiation parameter’ is required. Often the “this” pointer can serve dual-purpose as the instantiation parameter. When the “this” pointer is not the generic parameter, the generic parameter is passed as an additional argument..
- Hidden Parameters, covering ‘Stub dispatch’, ‘Fast Pinvoke’, ‘Calli Pinvoke’ and ‘Normal PInvoke’. For instance, here’s why ‘PInvoke’ has a hidden parameter:
Normal PInvoke - The VM shares IL stubs based on signatures, but wants the right method to show up in call stack and exceptions, so the MethodDesc for the exact PInvoke is passed in the (x86) EAX
/ (AMD64) R10
/ (ARM, ARM64) R12
(in the JIT: REG_SECRET_STUB_PARAM
). Then in the IL stub, when the JIT gets CORJIT_FLG_PUBLISH_SECRET_PARAM
, it must move the register into a compiler temp.
Not all of these scenarios need a stub, for instance the ‘this’ pointer is handled directly by the JIT, but many do as we’ll see in the rest of the post.
Stub Management
So we’ve seen why stubs are needed and what type of functionality they can provide. But before we look at all the specific examples that exist in the CoreCLR source, I just wanted to take some time to understand the common or shared concerns that apply to all stubs.
Stubs in the CLR are snippets of assembly code, but they have to be stored in memory and have their life-time managed. Also, they have to play nice with the debugger, from What Every CLR Developer Must Know Before Writing Code:
2.8 Is your code compatible with managed debugging?
- ..
- If you add a new stub (or way to call managed code), make sure that you can source-level step-in (F11) it under the debugger. The debugger is not psychic. A source-level step-in needs to be able to go from the source-line before a call to the source-line after the call, or managed code developers will be very confused. If you make that call transition be a giant 500 line stub, you must cooperate with the debugger for it to know how to step-through it. (This is what StubManagers are all about. See src\vm\stubmgr.h). Try doing a step-in through your new codepath under the debugger.
So every type of stub has a StubManager
which deals with the allocation, storage and lookup of the stubs. The lookup is significant, as it provides the mapping from an arbitrary memory address to the type of stub (if any) that created the code. As an example, here’s what the CheckIsStub_Internal(..)
method here and DoTraceStub(..)
method here look like for the DelegateInvokeStubManager
:
BOOL DelegateInvokeStubManager::CheckIsStub_Internal(PCODE stubStartAddress)
{
LIMITED_METHOD_DAC_CONTRACT;
bool fIsStub = false;
#ifndef DACCESS_COMPILE
#ifndef _TARGET_X86_
fIsStub = fIsStub || (stubStartAddress == GetEEFuncEntryPoint(SinglecastDelegateInvokeStub));
#endif
#endif // !DACCESS_COMPILE
fIsStub = fIsStub || GetRangeList()->IsInRange(stubStartAddress);
return fIsStub;
}
BOOL DelegateInvokeStubManager::DoTraceStub(PCODE stubStartAddress, TraceDestination *trace)
{
LIMITED_METHOD_CONTRACT;
LOG((LF_CORDB, LL_EVERYTHING, "DelegateInvokeStubManager::DoTraceStub called\n"));
_ASSERTE(CheckIsStub_Internal(stubStartAddress));
// If it's a MC delegate, then we want to set a BP & do a context-ful
// manager push, so that we can figure out if this call will be to a
// single multicast delegate or a multi multicast delegate
trace->InitForManagerPush(stubStartAddress, this);
LOG_TRACE_DESTINATION(trace, stubStartAddress, "DelegateInvokeStubManager::DoTraceStub");
return TRUE;
}
The code to initialise the various stub managers is here in SystemDomain::Attach()
and by working through the list we can get a sense of what each category of stub does (plus the informative comments in the code help!)
PrecodeStubManager
implemented here
- ‘Stub manager functions & globals’
DelegateInvokeStubManager
implemented here
- ‘Since we don’t generate delegate invoke stubs at runtime on IA64, we can’t use the StubLinkStubManager for these stubs. Instead, we create an additional DelegateInvokeStubManager instead.’
JumpStubStubManager
implemented here
- ‘Stub manager for jump stubs created by ExecutionManager::jumpStub() These are currently used only on the 64-bit targets IA64 and AMD64’
RangeSectionStubManager
implemented here
- ‘Stub manager for code sections. It forwards the query to the more appropriate stub manager, or handles the query itself.’
ILStubManager
implemented here
- ‘This is the stub manager for IL stubs’
InteropDispatchStubManager
implemented here
- ‘This is used to recognize GenericComPlusCallStub, VarargPInvokeStub, and GenericPInvokeCalliHelper.’
StubLinkStubManager
implemented here
ThunkHeapStubManager
implemented here
- ‘Note, the only reason we have this stub manager is so that we can recgonize UMEntryThunks for IsTransitionStub. ..’
TailCallStubManager
implemented here
- ‘This is the stub manager to help the managed debugger step into a tail call. It helps the debugger trace through JIT_TailCall().’ (from stubmgr.h)
ThePreStubManager
implemented here (in prestub.cpp)
- ‘The following code manages the PreStub. All method stubs initially use the prestub.’
VirtualCallStubManager
implemented here (in virtualcallstub.cpp)
Finally, we can also see the ‘StubManagers’ in action if we use the eeheap
SOS command to inspect the ‘heap dump’ of a .NET Process, as it helps report the size of the different ‘stub heaps’:
> !eeheap -loader
Loader Heap:
--------------------------------------
System Domain: 704fd058
LowFrequencyHeap: Size: 0x0(0)bytes.
HighFrequencyHeap: 002e2000(8000:1000) Size: 0x1000(4096)bytes.
StubHeap: 002ea000(2000:1000) Size: 0x1000(4096)bytes.
Virtual Call Stub Heap:
- IndcellHeap: Size: 0x0(0)bytes.
- LookupHeap: Size: 0x0(0)bytes.
- ResolveHeap: Size: 0x0(0)bytes.
- DispatchHeap: Size: 0x0(0)bytes.
- CacheEntryHeap: Size: 0x0(0)bytes.
Total size: 0x2000(8192)bytes
--------------------------------------
(output taken from .NET Generics and Code Bloat (or its lack thereof))
You can see that in this case the entire ‘stub heap’ is taking up 4096 bytes and in addition there are more in-depth statistics covering the heaps used by virtual call dispatch.
Types of stubs
The different stubs used by the runtime fall into 3 main categories:
Most stubs are wired up in MethodDesc::DoPrestub(..)
, in this section of code or this section for COM Interop. The stubs generated include the following (definitions taken from BOTR - ‘Kinds of MethodDescs’, also see enum MethodClassification
here):
- Instantiating in (
FEATURE_SHARE_GENERIC_CODE
, on by default) in MakeInstantiatingStubWorker(..)
here
- Used for less common IL methods that have generic instantiation or that do not have preallocated slot in method table.
- P/Invoke (a.k.a NDirect) in
GetStubForInteropMethod(..)
here
- P/Invoke methods. These are methods marked with DllImport attribute.
- FCall methods in
ECall::GetFCallImpl(..)
here
- Internal methods implemented in unmanaged code. These are methods marked with
MethodImplAttribute(MethodImplOptions.InternalCall)
attribute, delegate constructors and tlbimp constructors.
- Array methods in
GenerateArrayOpStub(..)
here
- Array methods whose implementation is provided by the runtime (Get, Set, Address)
- EEImpl in
PCODE COMDelegate::GetInvokeMethodStub(EEImplMethodDesc* pMD)
here
- Delegate methods, implementation provided by the runtime
- COM Interop (
FEATURE_COMINTEROP
, on by default) in GetStubForInteropMethod(..)
here
- COM interface methods. Since the non-generic interfaces can be used for COM interop by default, this kind is usually used for all interface methods.
- Unboxing in
Stub * MakeUnboxingStubWorker(MethodDesc *pMD)
here
Right, now lets look at the individual stub in more detail.
Precode
First up, we’ll take a look at ‘precode’ stubs, because they are used by all other types of stubs, as explained in the BotR page on Method Descriptors:
The precode is a small fragment of code used to implement temporary entry points and an efficient wrapper for stubs. Precode is a niche code-generator for these two cases, generating the most efficient code possible. In an ideal world, all native code dynamically generated by the runtime would be produced by the JIT. That’s not feasible in this case, given the specific requirements of these two scenarios. The basic precode on x86 may look like this:
mov eax,pMethodDesc // Load MethodDesc into scratch register
jmp target // Jump to a target
Efficient Stub wrappers: The implementation of certain methods (e.g. P/Invoke, delegate invocation, multi dimensional array setters and getters) is provided by the runtime, typically as hand-written assembly stubs. Precode provides a space-efficient wrapper over stubs, to multiplex them for multiple callers.
The worker code of the stub is wrapped by a precode fragment that can be mapped to the MethodDesc and that jumps to the worker code of the stub. The worker code of the stub can be shared between multiple methods this way. It is an important optimization used to implement P/Invoke marshalling stubs.
By providing a ‘pointer’ to the MethodDesc class, the precode allows any subsequent stub to have access to a lot of information about a method call and it’s containing Type
via the MethodTable (‘hot’) and EEClass (‘cold’) data structures. The MethodDesc data-structure is one of the most fundamental types in the runtime, hence why it has it’s own BotR page.
Each ‘precode’ is created in MethodDesc::GetOrCreatePrecode()
here and there are several different types as we can see in this enum
from /vm/precode.h:
enum PrecodeType {
PRECODE_INVALID = InvalidPrecode::Type,
PRECODE_STUB = StubPrecode::Type,
#ifdef HAS_NDIRECT_IMPORT_PRECODE
PRECODE_NDIRECT_IMPORT = NDirectImportPrecode::Type,
#endif // HAS_NDIRECT_IMPORT_PRECODE
#ifdef HAS_FIXUP_PRECODE
PRECODE_FIXUP = FixupPrecode::Type,
#endif // HAS_FIXUP_PRECODE
#ifdef HAS_THISPTR_RETBUF_PRECODE
PRECODE_THISPTR_RETBUF = ThisPtrRetBufPrecode::Type,
#endif // HAS_THISPTR_RETBUF_PRECODE
};
As always, the BotR page describes the different types in great detail, but in summary:
- StubPrecode - .. is the basic precode type. It loads MethodDesc into a scratch register and then jumps. It must be implemented for precodes to work. It is used as fallback when no other specialized precode type is available.
- FixupPrecode - .. is used when the final target does not require MethodDesc in scratch register. The FixupPrecode saves a few cycles by avoiding loading MethodDesc into the scratch register. The most common usage of FixupPrecode is for method fixups in NGen images.
- ThisPtrRetBufPrecode - .. is used to switch a return buffer and the this pointer for open instance delegates returning valuetypes. It is used to convert the calling convention of
MyValueType Bar(Foo x)
to the calling convention of MyValueType Foo::Bar()
.
- NDirectImportPrecode (a.k.a P/Invoke) - .. is used for lazy binding of unmanaged P/Invoke targets. This precode is for convenience and to reduce amount of platform specific plumbing.
Finally, to give you an idea of some real-world scenarios for ‘precode’ stubs, take a look at this comment from the DoesSlotCallPrestub(..)
method (AMD64):
// AMD64 has the following possible sequences for prestub logic:
// 1. slot -> temporary entrypoint -> prestub
// 2. slot -> precode -> prestub
// 3. slot -> precode -> jumprel64 (jump stub) -> prestub
// 4. slot -> precode -> jumprel64 (NGEN case) -> prestub
‘Just-in-time’ (JIT) and ‘Tiered’ Compilation
However, another piece of functionality that ‘precodes’ provide is related to ‘just-in-time’ (JIT) compilation, again from the BotR page:
Temporary entry points: Methods must provide entry points before they are jitted so that jitted code has an address to call them. These temporary entry points are provided by precode. They are a specific form of stub wrappers.
This technique is a lazy approach to jitting, which provides a performance optimization in both space and time. Otherwise, the transitive closure of a method would need to be jitted before it was executed. This would be a waste, since only the dependencies of taken code branches (e.g. if statement) require jitting.
Each temporary entry point is much smaller than a typical method body. They need to be small since there are a lot of them, even at the cost of performance. The temporary entry points are executed just once before the actual code for the method is generated.
So these ‘temporary entry points’ provide something concrete that can be referenced before a method has been JITted. They then trigger the JIT-compilation which does the job of generating the native code for a method. The entire process looks like this (dotted lines represent a pointer indirection, solid lines are a ‘control transfer’ e.g. a jmp/call assembly instruction):
Before JITing
Here we see the ‘temporary entry point’ pointing to the ‘fixup precode’, which ultimately calls into the PrestubWorker()
function here.
After JIting
Once the method has been JITted, we can see that the PrestubWorker
is now out of the picture and instead we have the native code for the function. In addition, there is now a ‘stable entry point’ that can be used by any other code that wants to execute the function. Also, we can see that the ‘fixup precode’ has been ‘backpatched’ to also point at the ‘native code’. For an idea of how this ‘back-patching’ works, see the StubPrecode ::SetTargetInterlocked(..)
method here (ARM64).
After JIting - Tiered Compilation
However, there is also another ‘after’ scenario, now that .NET Core has ‘Tiered Compilation’. Here we see that the ‘stable entry point’ still goes via the ‘fixup precode’, it doesn’t directly call into the ‘native code’. This is because ‘tiered compilation’ counts how many times a method is called and once it decides the method is ‘hot’, it re-compiles a more optimised version that will give better performance. This ‘call counting’ takes place in this code in MethodDesc::DoPrestub(..)
which calls into CodeVersionManager::PublishNonJumpStampVersionableCodeIfNecessary(..)
here and then if shouldCountCalls
is true, it ends up calling CallCounter::OnMethodCodeVersionCalledSubsequently(..)
here.
What’s been interesting to watch during the development of ‘tiered compilation’ is that (not surprisingly) there has been a significant amount of work to ensure that the extra level of indirection doesn’t make the entire process slower, for instance see Patch vtable slots and similar when tiering is enabled #21292.
Like all the other stubs, ‘precodes’ have different versions for different CPU architectures. As a reference, the list below contains links to all of them:
Precodes
(a.k.a ‘Precode Fixup Thunk’):
ThePreStub
:
PreStubWorker(..)
in /vm/prestub.cpp
MethodDesc::DoPrestub(..)
here
MethodDesc::DoBackpatch(..)
here
Finally, for even more information on the JITing process, see:
Stubs-as-IL
‘Stubs as IL’ actually describes several types of individual stubs, but what they all have in common is they’re generated from ‘Intermediate Language’ (IL) which is then compiled by the JIT, in exactly the same way it handles the code we write (after it’s first been compiled from C#/F#/VB.NET into IL by another compiler).
This makes sense, it’s far easier to write the IL once and then have the JIT worry about compiling it for different CPU architectures, rather than having to write raw assembly each time (for x86/x64/arm/etc). However all stubs were hand-written assembly in .NET Framework 1.0:
What you have described is how it actually works. The only difference is that the shuffle thunk is hand-emitted in assembly and not generated by the JIT for historic reasons. All stubs (including all interop stubs) were hand-emitted like this in .NET Framework 1.0. Starting with .NET Framework 2.0, we have been converting the stubs to be generated by the JIT (the runtime generates IL for the stub, and then the JIT compiles the IL as regular method). The shuffle thunk is one of the few remaining ones not converted yet. Also, we have the IL path on some platforms but not others - FEATURE_STUBS_AS_IL
is related to it.
In the CoreCLR source code, ‘stubs as IL’ are controlled by the feature flag FEATURE_STUBS_AS_IL
, with the following additional flags for each specific type:
StubsAsIL
ArrayStubAsIL
MulticastStubAsIL
On Windows
only some features are implemented with IL stubs, see this code, e.g. ‘ArrayStubAsIL’ is disabled on ‘x86’, but enabled elsewhere.
true
true
true
...
On Unix
they are all done in IL, regardless of CPU Arch, as this code shows:
...
true
true
true
Finally, here’s the complete list of stubs that can be implemented in IL from /vm/ilstubresolver.h:
enum ILStubType
{
Unassigned = 0,
CLRToNativeInteropStub,
CLRToCOMInteropStub,
CLRToWinRTInteropStub,
NativeToCLRInteropStub,
COMToCLRInteropStub,
WinRTToCLRInteropStub,
#ifdef FEATURE_ARRAYSTUB_AS_IL
ArrayOpStub,
#endif
#ifdef FEATURE_MULTICASTSTUB_AS_IL
MulticastDelegateStub,
#endif
#ifdef FEATURE_STUBS_AS_IL
SecureDelegateStub,
UnboxingILStub,
InstantiatingStub,
#endif
};
But the usage of IL stubs has grown over time and it seems that they are the preferred mechanism where possible as they’re easier to write and debug. See [x86/Linux] Enable FEATURE_ARRAYSTUB_AS_IL, Switch multicast delegate stub on Windows x64 to use stubs-as-il and Fix GenerateShuffleArray to support cyclic shuffles #26169 (comment) for more information.
P/Invoke, Reverse P/Invoke and ‘calli’
All these stubs have one thing in common, they allow a transition between ‘managed’ and ‘un-managed’ (or native) code. To make this safe and to preserve the guarantees that the .NET runtime provides, stubs are used every time the transition is made.
This entire process is outlined in great detail in the BotR page CLR ABI - PInvokes, from the ‘Per-call-site PInvoke work’ section:
- For direct calls, the JITed code sets
InlinedCallFrame->m_pDatum
to the MethodDesc of the call target.
- For JIT64, indirect calls within IL stubs sets it to the secret parameter (this seems redundant, but it might have changed since the per-frame initialization?).
- For JIT32 (ARM) indirect calls, it sets this member to the size of the pushed arguments, according to the comments. The implementation however always passed 0.
- For JIT64/AMD64 only: Next for non-IL stubs, the InlinedCallFrame is ‘pushed’ by setting
Thread->m_pFrame
to point to the InlinedCallFrame (recall that the per-frame initialization already set InlinedCallFrame->m_pNext
to point to the previous top). For IL stubs this step is accomplished in the per-frame initialization.
- The Frame is made active by setting
InlinedCallFrame->m_pCallerReturnAddress
.
- The code then toggles the GC mode by setting
Thread->m_fPreemptiveGCDisabled = 0
.
- Starting now, no GC pointers may be live in registers. RyuJit LSRA meets this requirement by adding special refPositon
RefTypeKillGCRefs
before unmanaged calls and special helpers.
- Then comes the actual call/PInvoke.
- The GC mode is set back by setting
Thread->m_fPreemptiveGCDisabled = 1
.
- Then we check to see if
g_TrapReturningThreads
is set (non-zero). If it is, we call CORINFO_HELP_STOP_FOR_GC
.
- For ARM, this helper call preserves the return register(s):
R0
, R1
, S0
, and D0
.
- For AMD64, the generated code must manually preserve the return value of the PInvoke by moving it to a non-volatile register or a stack location.
- Starting now, GC pointers may once again be live in registers.
- Clear the
InlinedCallFrame->m_pCallerReturnAddress
back to 0.
- For JIT64/AMD64 only: For non-IL stubs ‘pop’ the Frame chain by resetting
Thread->m_pFrame
back to InlinedCallFrame.m_pNext
.
Saving/restoring all the non-volatile registers helps by preventing any registers that are unused in the current frame from accidentally having a live GC pointer value from a parent frame. The argument and return registers are ‘safe’ because they cannot be GC refs. Any refs should have been pinned elsewhere and instead passed as native pointers.
For IL stubs, the Frame chain isn’t popped at the call site, so instead it must be popped right before the epilog and right before any jmp calls. It looks like we do not support tail calls from PInvoke IL stubs?
As you can see, quite a bit of the work is to keep the Garbage Collector (GC) happy. This makes sense because once execution moves into un-managed/native code the .NET runtime has no control over what’s happening, so it needs to ensure that the GC doesn’t clean up or move around objects that are being used in the native code. It achives this by constraining what the GC can do (on the current thread) from the time execution moves into un-managed code and keeps that in place until it returns back to the mamanged side.
On top of that, there needs to be support for allowing ‘stack walking’ or ‘unwinding, to allowing debugging and produce meaningful stack traces. This is done by setting up frames that are put in place when control transitions from managed -> un-managed, before being removed (‘popped’) when transitioning back. Here’s a list of the different scenarios that are covered, from /vm/frames.h:
This is the list of Interop stubs & transition helpers with information
regarding what (if any) Frame they used and where they were set up:
P/Invoke:
JIT inlined: The code to call the method is inlined into the caller by the JIT.
InlinedCallFrame is erected by the JITted code.
Requires marshaling: The stub does not erect any frames explicitly but contains
an unmanaged CALLI which turns it into the JIT inlined case.
Delegate over a native function pointer:
The same as P/Invoke but the raw JIT inlined case is not present (the call always
goes through an IL stub).
Calli:
The same as P/Invoke.
PInvokeCalliFrame is erected in stub generated by GenerateGetStubForPInvokeCalli
before calling to GetILStubForCalli which generates the IL stub. This happens only
the first time a call via the corresponding VASigCookie is made.
ClrToCom:
Late-bound or eventing: The stub is generated by GenerateGenericComplusWorker
(x86) or exists statically as GenericComPlusCallStub[RetBuffArg] (64-bit),
and it erects a ComPlusMethodFrame frame.
Early-bound: The stub does not erect any frames explicitly but contains an
unmanaged CALLI which turns it into the JIT inlined case.
ComToClr:
Normal stub:
Interpreted: The stub is generated by ComCall::CreateGenericComCallStub
(in ComToClrCall.cpp) and it erects a ComMethodFrame frame.
Prestub:
The prestub is ComCallPreStub (in ComCallableWrapper.cpp) and it erects a ComPrestubMethodFrame frame.
Reverse P/Invoke (used for C++ exports & fixups as well as delegates
obtained from function pointers):
Normal stub:
x86: The stub is generated by UMEntryThunk::CompileUMThunkWorker
(in DllImportCallback.cpp) and it is frameless. It calls directly
the managed target or to IL stub if marshaling is required.
non-x86: The stub exists statically as UMThunkStub and calls to IL stub.
Prestub:
The prestub is generated by GenerateUMThunkPrestub (x86) or exists statically
as TheUMEntryPrestub (64-bit), and it erects an UMThkCallFrame frame.
Reverse P/Invoke AppDomain selector stub:
The asm helper is IJWNOADThunkJumpTarget (in asmhelpers.asm) and it is frameless.
The P/Invoke IL stubs are wired up in the MethodDesc::DoPrestub(..)
method (note that P/Invoke is also known as ‘NDirect’), in addition they are also created here when being used for ‘COM Interop’. That code then calls into GetStubForInteropMethod(..)
in /vm/dllimport.cpp, before branching off to handle each case:
- P/Invoke calls into
NDirect::GetStubForILStub(..)
here
- Reverse P/Invoke calls into another overload of
NDirect::GetStubForILStub(..)
here
- COM Interop goes to
ComPlusCall::GetStubForILStub(..)
here in /vm/clrtocomcall.cpp
- EE implemented methods end up in
COMDelegate::GetStubForILStub(..)
here (for more info on EEImpl
methods see ‘Kinds of MethodDescs’)
There are also hand-written assembly stubs for the differents scenarios, such as JIT_PInvokeBegin
, JIT_PInvokeEnd
and VarargPInvokeStub
, these can be seen in the files below:
As an example, calli
method calls (see OpCodes.Calli) end up in GenericPInvokeCalliHelper
, which has a nice bit of ASCII art in the i386 version:
// stack layout at this point:
//
// | ... |
// | stack arguments | ESP + 16
// +----------------------+
// | VASigCookie* | ESP + 12
// +----------------------+
// | return address | ESP + 8
// +----------------------+
// | CALLI target address | ESP + 4
// +----------------------+
// | stub entry point | ESP + 0
// ------------------------
However, all these stubs can have an adverse impact on start-up time, see Large numbers of Pinvoke stubs created on startup for example. This impact has been mitigated by compiling the stubs ‘Ahead-of-Time’ (AOT) and storing them in the ‘Ready-to-Run’ images (replacement format for NGEN (Native Image Generator)). From R2R ilstubs:
IL stub generation for interop takes measurable time at startup, and it is possible to generate some of them in an ahead of time
This change introduces ahead of time R2R compilation of IL stubs
Related work was done in Enable R2R compilation/inlining of PInvoke stubs where no marshalling is required and PInvoke stubs for Unix platforms (‘Enables inlining of PInvoke stubs for Unix platforms’).
Finally, for even more information on the issues involved, see:
Marshalling
However, dealing with the ‘managed’ to ‘un-managed’ transition is only one part of the story. The other is that there are also stubs created to deal with the ‘marshalling’ of arguments between the 2 sides. This process of ‘Interop Marshalling’ is explained nicely in the Microsoft docs:
Interop marshaling governs how data is passed in method arguments and return values between managed and unmanaged memory during calls. Interop marshaling is a run-time activity performed by the common language runtime’s marshaling service.
Most data types have common representations in both managed and unmanaged memory. The interop marshaler handles these types for you. Other types can be ambiguous or not represented at all in managed memory.
Like many stubs in the CLR, the marshalling stubs have evolved over time. As we can read in the excellent post Improvements to Interop Marshaling in V4: IL Stubs Everywhere:
History
The 1.0 and 1.1 versions of the CLR had several different techniques for creating and executing these stubs that were each designed for marshaling different types of signatures. These techniques ranged from directly generated x86 assembly instructions for simple signatures to generating specialized ML (an internal marshaling language) and running them through an internal interpreter for the most complicated signatures. This system worked well enough – although not without difficulties – in 1.0 and 1.1 but presented us with a serious maintenance problem when 2.0, and its support for multiple processor architectures, came around.
That’s right, there was an internal interpreter built into early version of the .NET CLR that had the job of running the ‘marshalling language’ (ML) code!
However, it then goes on to explain why this process wasn’t sustainable:
We realized early in the process of adding 64 bit support to 2.0 that this approach was not sustainable across multiple architectures. Had we continued with the same strategy we would have had to create parallel marshaling infrastructures for each new architecture we supported (remember in 2.0 we introduced support for both x64 and IA64) which would, in addition to the initial cost, at least triple the cost of every new marshaling feature or bug fix. We needed one marshaling stub technology that would work on multiple processor architectures and could be efficiently executed on each one: enter IL stubs.
The solution was to implement all stubs using ‘Intermediate Language’ (IL) that is CPU-agnostic. Then the JIT-compiler is used to convert the IL into machine code for each CPU architecture, which makes sense because it’s exactly what the JIT is good at. Also worth noting is that this work still continues today, for instance see Implement struct marshalling via IL Stubs instead of via FieldMarshalers #26340.
Finally, there is a really nice investigation into the whole process in PInvoke: beyond the magic (also Compile time marshalling). What’s also nice is that you can use PerfView to see the stubs that the runtime generates.
Generics
It is reasonably well known that generics in .NET use ‘code sharing’ to save space. That is, given a generic method such as public void Insert(..)
, one method body of ‘native code’ will be created and shared by the instantiated types of Insert(..)
and Insert(..)
(assumning that Foo
and Bar
are references types), but different versions will be created for Insert(..)
and Insert(..)
(as int
/double
are value types). This is possible, for the reasons outlined by Jon Skeet in a StackOverflow question:
.. consider what the CLR needs to know about a type. It includes:
- The size of a value of that type (i.e. if you have a variable of some type, how much space will that memory need?)
- How to treat the value in terms of garbage collection: is it a reference to an object, or a value which may in turn contain other references?
For all reference types, the answers to these questions are the same. The size is just the size of a pointer, and the value is always just a reference (so if the variable is considered a root, the GC needs to recursively descend into it).
For value types, the answers can vary significantly.
But, this poses a problem. What about if the ‘shared’ method needs to do something specific for each type, like call typeof(T)
?
This whole issue is explained in these 2 great posts, which I really recommend you take the time to read:
I’m not going to repeat what they cover here, except to say that (not surprisingly) ‘stubs’ are used to solve this issue, in conjunction with a ‘hidden’ parameter. These stubs are known as ‘instantiating’ stubs and we can find out more about them in this comment:
Instantiating Stubs - Return TRUE if this is this a special stub used to implement an instantiated generic method or per-instantiation static method. The action of an instantiating stub is - pass on a MethodTable
or InstantiatedMethodDesc
extra argument to shared code
The different scenarios are handled in MakeInstantiatingStubWorker(..)
in /vm/prestub.cpp, you can see the check for HasMethodInstantiation
and the fall-back to a ‘per-instantiation static method’:
// It's an instantiated generic method
// Fetch the shared code associated with this instantiation
pSharedMD = pMD->GetWrappedMethodDesc();
_ASSERTE(pSharedMD != NULL && pSharedMD != pMD);
if (pMD->HasMethodInstantiation())
{
extraArg = pMD;
}
else
{
// It's a per-instantiation static method
extraArg = pMD->GetMethodTable();
}
Stub *pstub = NULL;
#ifdef FEATURE_STUBS_AS_IL
pstub = CreateInstantiatingILStub(pSharedMD, extraArg);
#else
CPUSTUBLINKER sl;
_ASSERTE(pSharedMD != NULL && pSharedMD != pMD);
sl.EmitInstantiatingMethodStub(pSharedMD, extraArg);
pstub = sl.Link(pMD->GetLoaderAllocator()->GetStubHeap());
#endif
As a reminder, FEATURE_STUBS_AS_IL
is defined for all Unix versions of the CoreCLR, but on Windows it’s only used with ARM64.
- When
FEATURE_STUBS_AS_IL
is defined, the code calls into CreateInstantiatingILStub(..)
here. To get an overview of what it’s doing, we can take a look at the steps called-out in the code comments:
// 1. Build the new signature
here
// 2. Emit the method body
here
// 2.2 Push the rest of the arguments for x86
here
// 2.3 Push the hidden context param
here
// 2.4 Push the rest of the arguments for not x86
here
// 2.5 Push the target address
here
// 2.6 Do the calli
here
- When
FEATURE_STUBS_AS_IL
is note defined, per CPU/OS versions of EmitInstantiatingMethodStub(..)
are used, they exist for:
In the last case, (EmitInstantiatingMethodStub(..)
on ARM), the stub shares code with the instantiating version of the unboxing stub, so the heavy-lifting is done in StubLinkerCPU::ThumbEmitCallWithGenericInstantiationParameter(..)
here. This method is over 400 lines for fairly complex code, althrough there is also a nice piece of ASCII art (for info on why this ‘complex’ case is needed see this comment):
// Complex case where we need to emit a new stack frame and copy the arguments.
// Calculate the size of the new stack frame:
//
// +------------+
// SP -> | | srcofs & ShuffleEntry::OFSMASK));
}
else if (pEntry->dstofs & ShuffleEntry::REGMASK)
{
// source must be on the stack
_ASSERTE(!(pEntry->srcofs & ShuffleEntry::REGMASK));
EmitLoadStoreRegImm(eLOAD, IntReg(pEntry->dstofs & ShuffleEntry::OFSMASK), RegSp, pEntry->srcofs * sizeof(void*));
}
else
{
// source must be on the stack
_ASSERTE(!(pEntry->srcofs & ShuffleEntry::REGMASK));
// dest must be on the stack
_ASSERTE(!(pEntry->dstofs & ShuffleEntry::REGMASK));
EmitLoadStoreRegImm(eLOAD, IntReg(9), RegSp, pEntry->srcofs * sizeof(void*));
EmitLoadStoreRegImm(eSTORE, IntReg(9), RegSp, pEntry->dstofs * sizeof(void*));
}
}
// Tailcall to target
// br x16
EmitJumpRegister(IntReg(16));
}
Unboxing
I’ve written about this type of ‘stub’ before in A look at the internals of ‘boxing’ in the CLR, but in summary the unboxing stub needs to handle steps 2) and 3) from the diagram below:
1. MyStruct: [0x05 0x00 0x00 0x00]
| Object Header | MethodTable | MyStruct |
2. MyStruct (Boxed): [0x40 0x5b 0x6f 0x6f 0xfe 0x7 0x0 0x0 0x5 0x0 0x0 0x0]
^
object 'this' pointer |
| Object Header | MethodTable | MyStruct |
3. MyStruct (Boxed): [0x40 0x5b 0x6f 0x6f 0xfe 0x7 0x0 0x0 0x5 0x0 0x0 0x0]
^
adjusted 'this' pointer |
Key to the diagram
- Original
struct
, on the stack
- The
struct
being boxed into an object
that lives on the heap
- Adjustment made to this pointer so
MyStruct::ToString()
will work
These stubs make is possible for ‘value types’ (structs) to override methods from System.Object
, such as ToString()
and GetHashCode()
. The fix-up is needed because structs don’t have an ‘object header’, but when they’re boxed into an Object
they do. So the stub has the job of moving or adjusting the ‘this’ pointer so that the code in the ToString()
method can work the same, regardless of whether it’s operating on a regular ‘struct’ or one that’s been boxed into an ‘object.
The unboxing stubs are created in MethodDesc::DoPrestub(..)
here, which in turn calls into MakeUnboxingStubWorker(..)
here
- when
FEATURE_STUBS_AS_IL
is disabled it then calls EmitUnboxMethodStub(..)
to create the stub, there are per-CPU versions:
- when
FEATURE_STUBS_AS_IL
is enabled is instead calls into CreateUnboxingILStubForSharedGenericValueTypeMethods(..)
here
For more information on some of the internal details of unboxing stubs and how they interact with ‘generic instantiations’ see this informative comment and one in the code for MethodDesc::FindOrCreateAssociatedMethodDesc(..)
here.
Arrays
As discussed at the beginning, the method bodies for arrays is provided by the runtime, that is the array access methods, ‘get’ and ‘set’, that allow var a = myArray[5]
and myArray[7] = 5
to work. Not surprisingly, these are done as stubs to allow them to be as small and efficient as possible.
Here is the flow for wiring up ‘array stubs’. It all starts up in MethodDesc::DoPrestub(..)
here:
- If
FEATURE_ARRAYSTUB_AS_IL
is defined (see ‘Stubs-as-IL’), it happens in GenerateArrayOpStub(ArrayMethodDesc* pMD)
here
- Then
ArrayOpLinker::EmitStub()
here, which is responsible for generating 3 types of stubs { ILSTUB_ARRAYOP_GET, ILSTUB_ARRAYOP_SET, ILSTUB_ARRAYOP_ADDRESS }
.
- Before calling
ILStubCache::CreateAndLinkNewILStubMethodDesc(..)
here
- Finally ending up in
JitILStub(..)
here
- When
FEATURE_ARRAYSTUB_AS_IL
isn’t defined, happens in another version of GenerateArrayOpStub(ArrayMethodDesc* pMD)
lower down
- Then
void GenerateArrayOpScript(..)
here
- Followed by a call to
StubCacheBase::Canonicalize(..)
here, that ends up in ArrayStubCache::CompileStub(..)
here.
- Eventually, we end up in
StubLinkerCPU::EmitArrayOpStub(..)
here, which does the heavy lifting (despite being under ‘\src\vm\i386' seems to support x86 and AMD64?)
I’m not going to include the code for the ‘stub-as-IL’ (ArrayOpLinker::EmitStub()
) or the assembly code (StubLinkerCPU::EmitArrayOpStub(..)
) versions of the array stubs because they’re both 100’s of lines long, dealing with type and bounds checking, computing address, multi-dimensional arrays and mode. But to give an idea of the complexities, take a look at this comment from StubLinkerCPU::EmitArrayOpStub(..)
here:
// Register usage
//
// x86 AMD64
// Inputs:
// managed array THIS_kREG (ecx) THIS_kREG (rcx)
// index 0 edx rdx
// index 1/value r8
// index 2/value r9
// expected element type for LOADADDR eax rax rdx
// Working registers:
// total (accumulates unscaled offset) edi r10
// factor (accumulates the slice factor) esi r11
Finally, these stubs are still being improved, for example see Use unsigned index extension in muldi-dimensional array stubs.
Tail Calls
The .NET runtime provides a nice optimisation when doing ‘tail calls’, that (amoung other things) will prevent StackoverflowExceptions
in recursive scenarios. For more on why these tail call optimisations are useful and how they work, take a look at:
In summary, a tail call optimisation allows the same stack frame to be re-used if in the caller, there is no work done after the function call to the callee (see Tail call JIT conditions (2007) for a more precise definition).
And why is this beneficial? From Tail Call Improvements in .NET Framework 4:
The primary reason for a tail call as an optimization is to improve data locality, memory usage, and cache usage. By doing a tail call the callee will use the same stack space as the caller. This reduces memory pressure. It marginally improves the cache because the same memory is reused for subsequent callers and thus can stay in the cache, rather than evicting some older cache line to make room for a new cache line.
To make this clear, the code below can benefit from the optimisation, because both functions return straight after calling each the other:
public static long Ping(int cnt, long val)
{
if (cnt-- == 0)
return val;
return Pong(cnt, val + cnt);
}
public static long Pong(int cnt, long val)
{
if (cnt-- == 0)
return val;
return Ping(cnt, val + cnt);
}
However, if the code was changed to the version below, the optimisation would no longer work because PingNotOptimised(..)
does some extra work between calling Pong(..)
and when it returns:
public static long PingNotOptimised(int cnt, long val)
{
if (cnt-- == 0)
return val;
var result = Pong(cnt, val + cnt);
result += 1; // prevents the Tail-call optimization
return result;
}
public static long Pong(int cnt, long val)
{
if (cnt-- == 0)
return val;
return PingNotOptimised(cnt, val + cnt);
}
You can see the difference in the code emitted by the JIT compiler for the different scenarios in SharpLab.
But where do the ‘tail call optimisation stubs’ come into play? Helpfully there is a tail call related design doc that explains, from ‘current way of handling tail-calls’:
Fast tail calls
These are tail calls that are handled directly by the jitter and no runtime cooperation is needed. They are limited to cases where:
- Return value and call target arguments are all either primitive types, reference types, or valuetypes with a single primitive type or reference type fields
- The aligned size of call target arguments is less or equal to aligned size of caller arguments
So, the stubs aren’t always needed, sometimes the work can be done by the JIT, if there scenario is simple enough.
However for the more complex cases, a ‘helper’ stub is needed:
Tail calls using a helper
Tail calls in cases where we cannot perform the call in a simple way are implemented using a tail call helper. Here is a rough description of how it works:
- For each tail call target, the jitter asks runtime to generate an assembler argument copying routine. This routine reads vararg list of arguments and places the arguments in their proper slots in the CONTEXT or on the stack. Together with the argument copying routine, the runtime also builds a list of offsets of references and byrefs for return value of reference type or structs returned in a hidden return buffer and for structs passed by ref. The gc layout data block is stored at the end of the argument copying thunk.
- At the time of the tail call, the caller generates a vararg list of all arguments of the tail called function and then calls
JIT_TailCall
runtime function. It passes it the copying routine address, the target address and the vararg list of the arguments.
- The
JIT_TailCall
then performs the following:
…
To see the rest of the steps that JIT_TailCall
takes you can read the design doc or if you’re really keen you can look at the code in /vm/jithelpers.cpp. Also, there’s a useful explanation of what it needs to handle in the JIT code, see here and here.
However, we’re just going to focus on the stubs, refered to as an ‘assembler argument copying routine’. Firstly, we can see that they have their own stub manager, TailCallStubManager
, which is implemented here and allows the stubs to play nicely with the debugger. Also interesting to look at is the TailCallFrame
here that is used to ensure that the ‘stack walker’ can work well with tail calls.
Now, onto the stubs themselves, the ‘copying routines’ are provided by the runtime via a call to CEEInfo::getTailCallCopyArgsThunk(..)
in /vm/jitinterface.cpp. This in turn calls the CPU specific versions of CPUSTUBLINKER::CreateTailCallCopyArgsThunk(..)
:
These routines have the complex and hairy job of dealing with the CPU registers and calling conventions. They achieve this by dynamicially emitting assembly instructions, to create a function that looks like the following pseudo code (X86 version):
// size_t CopyArguments(va_list args, (RCX)
// CONTEXT *pCtx, (RDX)
// DWORD64 *pvStack, (R8)
// size_t cbStack) (R9)
// {
// if (pCtx != NULL) {
// foreach (arg in args) {
// copy into pCtx or pvStack
// }
// }
// return ;
// }
In addition there is one other type of stub that is used. Known as the TailCallHelperStub
, they also come in per-CPU versions:
Going forward, there are several limitations of to this approach of using per-CPU stubs, as the design doc explains:
- It is expensive to port to new platforms
- Parsing the vararg list is not possible to do in a portable way on Unix. Unlike on Windows, the list is not stored a linear sequence of the parameter data bytes in memory. va_list on Unix is an opaque data type, some of the parameters can be in registers and some in the memory.
- Generating the copying asm routine needs to be done for each target architecture / platform differently. And it is also very complex, error prone and impossible to do on platforms where code generation at runtime is not allowed.
- It is slower than it has to be
- The parameters are copied possibly twice - once from the vararg list to the stack and then one more time if there was not enough space in the caller’s stack frame.
RtlRestoreContext
restores all registers from the CONTEXT
structure, not just a subset of them that is really necessary for the functionality, so it results in another unnecessary memory accesses.
- Stack walking over the stack frames of the tail calls requires runtime assistance.
Fortunately, it then goes into great depth discussing how a new approach could be implemented and how it would solve these issues. Even better, work has already started and we can follow along in Implement portable tailcall helpers #26418 (currently sitting at ‘31 of 55’ tasks completed, with over 50 files modified, it’s not a small job!).
Finally, for other PRs related to tail calls, see:
Virtual Stub Dispatch (VSD)
I’ve saved the best for last, ‘Virtual Stub Dispatch’ or VSD is such an in-depth topic, that it an entire BotR page devoted to it!! From the introduction:
Virtual stub dispatching (VSD) is the technique of using stubs for virtual method invocations instead of the traditional virtual method table. In the past, interface dispatch required that interfaces had process-unique identifiers, and that every loaded interface was added to a global interface virtual table map. This requirement meant that all interfaces and all classes that implemented interfaces had to be restored at runtime in NGEN scenarios, causing significant startup working set increases. The motivation for stub dispatching was to eliminate much of the related working set, as well as distribute the remaining work throughout the lifetime of the process.
It then goes on to say:
Although it is possible for VSD to dispatch both virtual instance and interface method calls, it is currently used only for interface dispatch.
So despite having the work ‘virtual’ in the title, it’s not actually used for C# methods with the virtual
modifier on them. However, if you look at the IL for interface methods you can see why they are also known as ‘virtual’.
Virtual Stub Dispatch is so complex, it actually has several different stub types, from /vm/virtualcallstub.h:
enum StubKind {
SK_UNKNOWN,
SK_LOOKUP, // Lookup Stubs are SLOW stubs that simply call into the runtime to do all work.
SK_DISPATCH, // Dispatch Stubs have a fast check for one type otherwise jumps to runtime. Works for monomorphic sites
SK_RESOLVE, // Resolve Stubs do a hash lookup before fallling back to the runtime. Works for polymorphic sites.
SK_VTABLECALL, // Stub that jumps to a target method using vtable-based indirections. Works for non-interface calls.
SK_BREAKPOINT
};
So there are the following types (these are links to the AMD64
versions, x86
versions are in /vm/i386/virtualcallstubcpu.hpp):
- Lookup Stubs:
// Virtual and interface call sites are initially setup to point at LookupStubs. This is because the runtime type of the pointer is not yet known, so the target cannot be resolved.
- Dispatch Stubs:
// Monomorphic and mostly monomorphic call sites eventually point to DispatchStubs. A dispatch stub has an expected type (expectedMT), target address (target) and fail address (failure). If the calling frame does in fact have the type be of the expected type, then control is transfered to the target address, the method implementation. If not, then control is transfered to the fail address, a fail stub (see below) where a polymorphic lookup is done to find the correct address to go to.
- There’s also specific versions, DispatchStubShort and DispatchStubLong, see this comment for why they are both needed.
- Resolve Stubs:
// Polymorphic call sites and monomorphic calls that fail end up in a ResolverStub. There is only one resolver stub built for any given token, even though there may be many call sites that use that token and many distinct types that are used in the calling call frames. A resolver stub actually has two entry points, one for polymorphic call sites and one for dispatch stubs that fail on their expectedMT test. There is a third part of the resolver stub that enters the ee when a decision should be made about changing the callsite.
- V-Table or Virtual Call Stubs
//These are jump stubs that perform a vtable-base virtual call. These stubs assume that an object is placed in the first argument register (this pointer). From there, the stub extracts the MethodTable pointer, followed by the vtable pointer, and finally jumps to the target method at a given slot in the vtable.
The below diagram shows the general control flow between these stubs
(Image from ‘Design of Virtual Stub Dispatch’)
Finally, if you want even more in-depth information see this comment.
However, these stubs come at a cost, which makes virtual method calls more expensive than direct ones. This is why de-virtualization is so important, i.e. the process of the .NET JIT detecting when a virtual call can instead be replaced by a direct one. There has been some work done in .NET Core to improve this, see Simple devirtualization #9230 which covers sealed
classes/methods and when the object type is known exactly. However there is still more to be done, as shown in JIT: devirtualization next steps #9908, where ‘5 of 23’ tasks have been completed.
Other Types of Stubs
This post is already way too long, so I don’t intend to offer any analysis of the following stubs. Instead I’ve just included some links to more information so you can read up on any that interest you!
‘Jump’ stubs
‘Function Pointer’ stubs
- ‘Function Pointer’ Stubs, see /vm/fptrstubs.cpp and /vm/fptrstubs.h
// FuncPtrStubs contains stubs that is used by GetMultiCallableAddrOfCode() if the function has not been jitted. Using a stub decouples ldftn from the prestub, so prestub does not need to be backpatched. This stub is also used in other places which need a function pointer
‘Thread Hijacking’ stubs
From the BotR page on ‘Threading’:
- If fully interruptable, it is safe to perform a GC at any point, since the thread is, by definition, at a safe point. It is reasonable to leave the thread suspended at this point (because it’s safe) but various historical OS bugs prevent this from working, because the CONTEXT retrieved earlier may be corrupt). Instead, the thread’s instruction pointer is overwritten, redirecting it to a stub that will capture a more complete CONTEXT, leave cooperative mode, wait for the GC to complete, reenter cooperative mode, and restore the thread to its previous state.
- If partially-interruptable, the thread is, by definition, not at a safe point. However, the caller will be at a safe point (method transition). Using that knowledge, the CLR “hijacks” the top-most stack frame’s return address (physically overwrite that location on the stack) with a stub similar to the one used for fully-interruptable code. When the method returns, it will no longer return to its actual caller, but rather to the stub (the method may also perform a GC poll, inserted by the JIT, before that point, which will cause it to leave cooperative mode and undo the hijack).
Done with the OnHijackTripThread
method in /vm/amd64/AsmHelpers.asm, which calls into OnHijackWorker(..)
in /vm/threadsuspend.cpp.
‘NGEN Fixup’ stubs
From CLR Inside Out - The Performance Benefits of NGen (2006):
Throughput of NGen-compiled code is lower than that of JIT-compiled code primarily for one reason: cross-assembly references. In JIT-compiled code, cross-assembly references can be implemented as direct calls or jumps since the exact addresses of these references are known at run time. For statically compiled code, however, cross-assembly references need to go through a jump slot that gets populated with the correct address at run time by executing a method pre-stub. The method pre-stub ensures, among other things, that the native images for assemblies referenced by that method are loaded into memory before the method is executed. The pre-stub only needs to be executed the first time the method is called; it is short-circuited out for subsequent calls. However, every time the method is called, cross-assembly references do need to go through a level of indirection. This is principally what accounted for the 5-10 percent drop in throughput for NGen-compiled code when compared to JIT-compiled code.
Also see the ‘NGEN’ section of the ‘jump stub’ design doc.
Stubs in the Mono Runtime
Mono refers to ‘Stubs’ as ‘Trampolines’ and they’re widely used in the source code.
The Mono docs have an excellent page all about ‘Trampolines’, that lists the following types:
- JIT Trampolines
- Virtual Call Trampolines
- Jump Trampolines
- Class Init Trampolines
- Generic Class Init Trampoline
- RGCTX Lazy Fetch Trampolines
- AOT Trampolines
- Delegate Trampolines
- Monitor Enter/Exit Trampolines
Also the docs page on Generic Sharing has some good, in-depth information.
Conclusion
So it turns out that ‘stubs’ are way more prevelant in the .NET Core Runtime that I imagined when I first started on this post. They are an interesting technique and they contain a fair amount of complexity. In addition, I only covered each stub in isolation, in reality many of them have to play nicely together, for instance imagine a delegate
calling a virtual
method that has generic
type parameters and you can see that things start to get complex! (that scenario might contain 3 seperate stubs, although they are also shared where possible). If you were then to add array
methods, P/Invoke
marshalling and un-boxing
to the mix, things get even more hairy and even more complex!
If anyone has read this far and wants a fun challenge, try and figure out what’s the most stubs you can force a single method call to go via! If you do, let me know in the comments or via twitter
Finally, by knowing where and when stubs are involved in our method calls, we can start to understand the overhead of each scenario. For instance, it explains why delegate
method calls are a bit slower than calling a method directly and why ‘de-virtualization’ is so important. Having the JIT be able to perform extra analysis to determine that a virtual call can be converted into a direct one skips an entire level of indirection, for more on this see:
Thu, 26 Sep 2019, 12:00 am
ASCII Art in .NET Code
Who doesn’t like a nice bit of ‘ASCII Art’? I know I certainly do!
To see what Matt’s CLR was all about you can watch the recording of my talk ‘From ‘dotnet run’ to ‘Hello World!’’ (from about ~24:30 in)
So armed with a trusty regex /\*(.*?)\*/|//(.*?)\r?\n|"((\\[^\n]|[^"\n])*)"|@("[^"]*")+
, I set out to find all the interesting ASCII Art used in source code comments in the following .NET related repositories:
- dotnet/CoreCLR - “the runtime for .NET Core. It includes the garbage collector, JIT compiler, primitive data types and low-level classes.”
- Mono - “open source ECMA CLI, C# and .NET implementation.”
- dotnet/CoreFX - “the foundational class libraries for .NET Core. It includes types for collections, file systems, console, JSON, XML, async and many others.”
- dotnet/Roslyn - “provides C# and Visual Basic languages with rich code analysis APIs”
- aspnet/AspNetCore - “a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.”
Note: Yes, I shamelessly ‘borrowed’ this idea from John Regehr, I was motivated to write this because his excellent post ‘Explaining Code using ASCII Art’ didn’t have any .NET related code in it!
If you’ve come across any interesting examples I’ve missed out, please let me know!
Table of Contents
To make the examples easier to browse, I’ve split them up into categories:
Dave Cutler
There’s no art in this one, but it deserves it’s own category as it quotes the amazing Dave Cutler who led the development of Windows NT. Therefore there’s no better person to ask a deep, technical question about how Thread Suspension works on Windows, from coreclr/src/vm/threadsuspend.cpp
// Message from David Cutler
/*
After SuspendThread returns, can the suspended thread continue to execute code in user mode?
[David Cutler] The suspended thread cannot execute any more user code, but it might be currently "running"
on a logical processor whose other logical processor is currently actually executing another thread.
In this case the target thread will not suspend until the hardware switches back to executing instructions
on its logical processor. In this case even the memory barrier would not necessarily work - a better solution
would be to use interlocked operations on the variable itself.
After SuspendThread returns, does the store buffer of the CPU for the suspended thread still need to drain?
Historically, we've assumed that the answer to both questions is No. But on one 4/8 hyper-threaded machine
running Win2K3 SP1 build 1421, we've seen two stress failures where SuspendThread returns while writes seem to still be in flight.
Usually after we suspend a thread, we then call GetThreadContext. This seems to guarantee consistency.
But there are places we would like to avoid GetThreadContext, if it's safe and legal.
[David Cutler] Get context delivers a APC to the target thread and waits on an event that will be set
when the target thread has delivered its context.
Chris.
*/
For more info on Dave Cutler, see this excellent interview ‘Internets of Interest #6: Dave Cutler on Dave Cutler’ or ‘The engineer’s engineer: Computer industry luminaries salute Dave Cutler’s five-decade-long quest for quality’
Syntax Trees
The inner workings of the .NET ‘Just-in-Time’ (JIT) Compiler have always been a bit of a mystery to me. But, having informative comments like this one from coreclr/src/jit/lsra.cpp go some way to showing what it’s doing
// For example, for this tree (numbers are execution order, lower is earlier and higher is later):
//
// +---------+----------+
// | GT_ADD (3) |
// +---------+----------+
// |
// / \
// / \
// / \
// +-------------------+ +----------------------+
// | x (1) | "tree" | y (2) |
// +-------------------+ +----------------------+
//
// generate this tree:
//
// +---------+----------+
// | GT_ADD (4) |
// +---------+----------+
// |
// / \
// / \
// / \
// +-------------------+ +----------------------+
// | GT_RELOAD (3) | | y (2) |
// +-------------------+ +----------------------+
// |
// +-------------------+
// | x (1) | "tree"
// +-------------------+
There’s also a more in-depth example in coreclr/src/jit/morph.cpp
Also from roslyn/src/Compilers/VisualBasic/Portable/Semantics/TypeInference/RequiredConversion.vb
'// These restrictions form a partial order composed of three chains: from less strict to more strict, we have:
'// [reverse chain] [None] < AnyReverse < ReverseReference < Identity
'// [middle chain] None < [Any,AnyReverse] < AnyConversionAndReverse < Identity
'// [forward chain] [None] < Any < ArrayElement < Reference < Identity
'//
'// = KEY:
'// / | \ = Identity
'// / | \ +r Reference
'// -r | +r -r ReverseReference
'// | +-any | +-any AnyConversionAndReverse
'// | /|\ +arr +arr ArrayElement
'// | / | \ | +any Any
'// -any | +any -any AnyReverse
'// \ | / none None
'// \ | /
'// none
'//
Timelines
This example from coreclr/src/vm/comwaithandle.cpp was unique! I didn’t find another example of ASCII Art used to illustrate time-lines, it’s a really novel approach.
// In case the CLR is paused inbetween a wait, this method calculates how much
// the wait has to be adjusted to account for the CLR Freeze. Essentially all
// pause duration has to be considered as "time that never existed".
//
// Two cases exists, consider that 10 sec wait is issued
// Case 1: All pauses happened before the wait completes. Hence just the
// pause time needs to be added back at the end of wait
// 0 3 8 10
// |-----------|###################|------>
// 5-sec pause
// ....................>
// Additional 5 sec wait
// |=========================>
//
// Case 2: Pauses ended after the wait completes.
// 3 second of wait was left as the pause started at 7 so need to add that back
// 0 7 10
// |---------------------------|###########>
// 5-sec pause 12
// ...................>
// Additional 3 sec wait
// |==================>
//
// Both cases can be expressed in the same calculation
// pauseTime: sum of all pauses that were triggered after the timer was started
// expDuration: expected duration of the wait (without any pauses) 10 in the example
// actDuration: time when the wait finished. Since the CLR is frozen during pause it's
// max of timeout or pause-end. In case-1 it's 10, in case-2 it's 12
Logic Tables
A sweet-spot for ASCII Art seems to be tables, there are so many examples. Starting with coreclr/src/vm/methodtablebuilder.cpp (bonus points for combining comments and code together!)
// | Base type
// Subtype | mdPrivateScope mdPrivate mdFamANDAssem mdAssem mdFamily mdFamORAssem mdPublic
// --------------+-------------------------------------------------------------------------------------------------------
/*mdPrivateScope | */ { { e_SM, e_NO, e_NO, e_NO, e_NO, e_NO, e_NO },
/*mdPrivate | */ { e_SM, e_YES, e_NO, e_NO, e_NO, e_NO, e_NO },
/*mdFamANDAssem | */ { e_SM, e_YES, e_SA, e_NO, e_NO, e_NO, e_NO },
/*mdAssem | */ { e_SM, e_YES, e_SA, e_SA, e_NO, e_NO, e_NO },
/*mdFamily | */ { e_SM, e_YES, e_YES, e_NO, e_YES, e_NSA, e_NO },
/*mdFamORAssem | */ { e_SM, e_YES, e_YES, e_SA, e_YES, e_YES, e_NO },
/*mdPublic | */ { e_SM, e_YES, e_YES, e_YES, e_YES, e_YES, e_YES } };
Also coreclr/src/jit/importer.cpp which shows how the JIT deals with boxing/un-boxing
/*
----------------------------------------------------------------------
| \ helper | | |
| \ | | |
| \ | CORINFO_HELP_UNBOX | CORINFO_HELP_UNBOX_NULLABLE |
| \ | (which returns a BYREF) | (which returns a STRUCT) |
| opcode \ | | |
|---------------------------------------------------------------------
| UNBOX | push the BYREF | spill the STRUCT to a local, |
| | | push the BYREF to this local |
|---------------------------------------------------------------------
| UNBOX_ANY | push a GT_OBJ of | push the STRUCT |
| | the BYREF | For Linux when the |
| | | struct is returned in two |
| | | registers create a temp |
| | | which address is passed to |
| | | the unbox_nullable helper. |
|---------------------------------------------------------------------
*/
Finally, there’s some other nice examples showing the rules for operator overloading in the C# (Roslyn) Compiler and which .NET data-types can be converted via the System.ToXXX()
functions.
Class Hierarchies
Of course, most IDE’s come with tools that will generate class-hierarchies for you, but it’s much nicer to see them in ASCII, from coreclr/src/vm/object.h
* COM+ Internal Object Model
*
*
* Object - This is the common base part to all COM+ objects
* | it contains the MethodTable pointer and the
* | sync block index, which is at a negative offset
* |
* +-- code:StringObject - String objects are specialized objects for string
* | storage/retrieval for higher performance
* |
* +-- BaseObjectWithCachedData - Object Plus one object field for caching.
* | |
* | +- ReflectClassBaseObject - The base object for the RuntimeType class
* | +- ReflectMethodObject - The base object for the RuntimeMethodInfo class
* | +- ReflectFieldObject - The base object for the RtFieldInfo class
* |
* +-- code:ArrayBase - Base portion of all arrays
* | |
* | +- I1Array - Base type arrays
* | | I2Array
* | | ...
* | |
* | +- PtrArray - Array of OBJECTREFs, different than base arrays because of pObjectClass
* |
* +-- code:AssemblyBaseObject - The base object for the class Assembly
There’s also an even larger one that I stumbled across when writing “Stack Walking” in the .NET Runtime.
Component Diagrams
When you have several different components in a code-base it’s always nice to see how they fit together. From coreclr/src/vm/codeman.h we can see how the top-level parts of the .NET JIT work together
ExecutionManager
|
+-----------+---------------+---------------+-----------+--- ...
| | | |
CodeType | CodeType |
| | | |
v v v v
+---------------+ +--------+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of previous chunk (if P = 1) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |P|
| Size of this chunk 1| +-+
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
+- -+
| |
+- -+
| :
+- size - sizeof(size_t) available payload bytes -+
: |
chunk-> +- -+
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|
| Size of next chunk (may or may not be in use) | +-+
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
And if it's free, it looks like this:
chunk-> +- -+
| User payload (must be in use, or we would have merged!) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |P|
| Size of this chunk 0| +-+
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Prev pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| :
+- size - sizeof(struct chunk) unused bytes -+
: |
chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Size of this chunk |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0|
| Size of next chunk (must be in use, or we would have merged)| +-+
mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| :
+- User payload -+
: |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|0|
+-+
Also, from corefx/src/Common/src/CoreLib/System/MemoryExtensions.cs we can see how overlapping memory regions are detected:
// Visually, the two sequences are located somewhere in the 32-bit
// address space as follows:
//
// [----------------------------------------------) normal address space
// 0 2³²
// [------------------) first sequence
// xRef xRef + xLength
// [--------------------------) . second sequence
// yRef . yRef + yLength
// : . . .
// : . . .
// . . .
// . . .
// . . .
// [----------------------------------------------) relative address space
// 0 . . 2³²
// [------------------) : first sequence
// x1 . x2 :
// -------------) [------------- second sequence
// y2 y1
State Machines
This comment from mono/benchmark/zipmark.cs gives a great over-view of the implementation of RFC 1951 - DEFLATE Compressed Data Format Specification
/*
* The Deflater can do the following state transitions:
*
* (1) -> INIT_STATE ----> INIT_FINISHING_STATE ---.
* / | (2) (5) |
* / v (5) |
* (3)| SETDICT_STATE ---> SETDICT_FINISHING_STATE |(3)
* \ | (3) | ,-------'
* | | | (3) /
* v v (5) v v
* (1) -> BUSY_STATE ----> FINISHING_STATE
* | (6)
* v
* FINISHED_STATE
* \_____________________________________/
* | (7)
* v
* CLOSED_STATE
*
* (1) If we should produce a header we start in INIT_STATE, otherwise
* we start in BUSY_STATE.
* (2) A dictionary may be set only when we are in INIT_STATE, then
* we change the state as indicated.
* (3) Whether a dictionary is set or not, on the first call of deflate
* we change to BUSY_STATE.
* (4) -- intentionally left blank -- :)
* (5) FINISHING_STATE is entered, when flush() is called to indicate that
* there is no more INPUT. There are also states indicating, that
* the header wasn't written yet.
* (6) FINISHED_STATE is entered, when everything has been flushed to the
* internal pending output buffer.
* (7) At any time (7)
*
*/
This might be pushing the definition of ‘state machine’ a bit far, but I wanted to include it because it shows just how complex ‘exception handling’ can be, from coreclr/src/jit/jiteh.cpp
// fgNormalizeEH: Enforce the following invariants:
//
// 1. No block is both the first block of a handler and the first block of a try. In IL (and on entry
// to this function), this can happen if the "try" is more nested than the handler.
//
// For example, consider:
//
// try1 ----------------- BB01
// | BB02
// |--------------------- BB03
// handler1
// |----- try2 ---------- BB04
// | | BB05
// | handler2 ------ BB06
// | | BB07
// | --------------- BB08
// |--------------------- BB09
//
// Thus, the start of handler1 and the start of try2 are the same block. We will transform this to:
//
// try1 ----------------- BB01
// | BB02
// |--------------------- BB03
// handler1 ------------- BB10 // empty block
// | try2 ---------- BB04
// | | BB05
// | handler2 ------ BB06
// | | BB07
// | --------------- BB08
// |--------------------- BB09
//
RFC’s and Specs
Next up, how the Kestrel web-server handles RFC 7540 - Hypertext Transfer Protocol Version 2 (HTTP/2).
Firstly, from aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2Frame.cs
/* https://tools.ietf.org/html/rfc7540#section-4.1
+-----------------------------------------------+
| Length (24) |
+---------------+---------------+---------------+
| Type (8) | Flags (8) |
+-+-------------+---------------+-------------------------------+
|R| Stream Identifier (31) |
+=+=============================================================+
| Frame Payload (0...) ...
+---------------------------------------------------------------+
*/
and then in aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2Frame.Headers.cs
/* https://tools.ietf.org/html/rfc7540#section-6.2
+---------------+
|Pad Length? (8)|
+-+-------------+-----------------------------------------------+
|E| Stream Dependency? (31) |
+-+-------------+-----------------------------------------------+
| Weight? (8) |
+-+-------------+-----------------------------------------------+
| Header Block Fragment (*) ...
+---------------------------------------------------------------+
| Padding (*) ...
+---------------------------------------------------------------+
*/
There are other notable examples in aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2FrameReader.cs and aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2FrameWriter.cs.
Also RFC 3986 - Uniform Resource Identifier (URI) is discussed in corefx/src/Common/src/System/Net/IPv4AddressHelper.Common.cs
Finally, RFC 7541 - HPACK: Header Compression for HTTP/2, is covered in aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/HPack/HPackDecoder.cs
// http://httpwg.org/specs/rfc7541.html#rfc.section.6.1
// 0 1 2 3 4 5 6 7
// +---+---+---+---+---+---+---+---+
// | 1 | Index (7+) |
// +---+---------------------------+
private const byte IndexedHeaderFieldMask = 0x80;
private const byte IndexedHeaderFieldRepresentation = 0x80;
// http://httpwg.org/specs/rfc7541.html#rfc.section.6.2.1
// 0 1 2 3 4 5 6 7
// +---+---+---+---+---+---+---+---+
// | 0 | 1 | Index (6+) |
// +---+---+-----------------------+
private const byte LiteralHeaderFieldWithIncrementalIndexingMask = 0xc0;
private const byte LiteralHeaderFieldWithIncrementalIndexingRepresentation = 0x40;
// http://httpwg.org/specs/rfc7541.html#rfc.section.6.2.2
// 0 1 2 3 4 5 6 7
// +---+---+---+---+---+---+---+---+
// | 0 | 0 | 0 | 0 | Index (4+) |
// +---+---+-----------------------+
private const byte LiteralHeaderFieldWithoutIndexingMask = 0xf0;
private const byte LiteralHeaderFieldWithoutIndexingRepresentation = 0x00;
// http://httpwg.org/specs/rfc7541.html#rfc.section.6.2.3
// 0 1 2 3 4 5 6 7
// +---+---+---+---+---+---+---+---+
// | 0 | 0 | 0 | 1 | Index (4+) |
// +---+---+-----------------------+
private const byte LiteralHeaderFieldNeverIndexedMask = 0xf0;
private const byte LiteralHeaderFieldNeverIndexedRepresentation = 0x10;
// http://httpwg.org/specs/rfc7541.html#rfc.section.6.3
// 0 1 2 3 4 5 6 7
// +---+---+---+---+---+---+---+---+
// | 0 | 0 | 1 | Max size (5+) |
// +---+---------------------------+
private const byte DynamicTableSizeUpdateMask = 0xe0;
private const byte DynamicTableSizeUpdateRepresentation = 0x20;
// http://httpwg.org/specs/rfc7541.html#rfc.section.5.2
// 0 1 2 3 4 5 6 7
// +---+---+---+---+---+---+---+---+
// | H | String Length (7+) |
// +---+---------------------------+
private const byte HuffmanMask = 0x80;
Dates & Times
It is pretty widely accepted that dates and times are hard and that’s reflected in the amount of comments explaining different scenarios. For example from corefx/src/Common/src/CoreLib/System/TimeZoneInfo.cs
// startTime and endTime represent the period from either the start of DST to the end and
// ***does not include*** the potentially overlapped times
//
// -=-=-=-=-=- Pacific Standard Time -=-=-=-=-=-=-
// April 2, 2006 October 29, 2006
// 2AM 3AM 1AM 2AM
// | +1 hr | | -1 hr |
// | | | |
// [========== DST ========>)
//
// -=-=-=-=-=- Some Weird Time Zone -=-=-=-=-=-=-
// April 2, 2006 October 29, 2006
// 1AM 2AM 2AM 3AM
// | -1 hr | | +1 hr |
// | | | |
// [======== DST ========>)
//
Also, from corefx/src/Common/src/CoreLib/System/TimeZoneInfo.Unix.cs we see some details on how ‘leap-years’ are handled:
// should be n Julian day format which we don't support.
//
// This specifies the Julian day, with n between 0 and 365. February 29 is counted in leap years.
//
// n would be a relative number from the begining of the year. which should handle if the
// the year is a leap year or not.
//
// In leap year, n would be counted as:
//
// 0 30 31 59 60 90 335 365
// |-------Jan--------|-------Feb--------|-------Mar--------|....|-------Dec--------|
//
// while in non leap year we'll have
//
// 0 30 31 58 59 89 334 364
// |-------Jan--------|-------Feb--------|-------Mar--------|....|-------Dec--------|
//
//
// For example if n is specified as 60, this means in leap year the rule will start at Mar 1,
// while in non leap year the rule will start at Mar 2.
//
// If we need to support n format, we'll have to have a floating adjustment rule support this case.
Finally, this comment from corefx/src/System.Runtime/tests/System/TimeZoneInfoTests.cs discusses invalid and ambiguous times that are covered in tests:
// March 26, 2006 October 29, 2006
// 2AM 3AM 2AM 3AM
// | +1 hr | | -1 hr |
// | | | |
// *========== DST ========>*
//
// * 00:59:59 Sunday March 26, 2006 in Universal converts to
// 01:59:59 Sunday March 26, 2006 in Europe/Amsterdam (NO DST)
//
// * 01:00:00 Sunday March 26, 2006 in Universal converts to
// 03:00:00 Sunday March 26, 2006 in Europe/Amsterdam (DST)
//
Stack Layouts
To finish off, I wanted to look at ‘stack layouts’ because they seem to be a favourite of the .NET/Mono Runtime Engineers, there’s sooo many examples!
First-up, x68
from coreclr/src/jit/lclvars.cpp (you can also see the x64, ARM and ARM64 versions).
* The frame is laid out as follows for x86:
*
* ESP frames
*
* | |
* |-----------------------|
* | incoming |
* | arguments |
* |-----------------------| stack_usage-4
* spilled regs
* ------------------- sp +
* MonoLMF structure optional
* ------------------- sp + cfg->arch.lmf_offset
* saved registers s0-s8
* ------------------- sp + cfg->arch.iregs_offset
* locals
* ------------------- sp + cfg->param_area
* param area outgoing
* ------------------- sp + MIPS_STACK_PARAM_OFFSET
* a0-a3 outgoing
* ------------------- sp
* red zone
*/
Finally, there’s another example covering [DLLImport]
callbacks and one more involving funclet frames in ARM64, I told you there were lots!!
The Rest
If you aren’t sick of ‘ASCII Art’ by now, here’s a few more examples for you to look at!!
- CoreCLR
- Roslyn
- CoreFX
- AspNetCore
- Mono
Thu, 25 Apr 2019, 12:00 am
Is C# a low-level language?
I’m a massive fan of everything Fabien Sanglard does, I love his blog and I’ve read both his books cover-to-cover (for more info on his books, check out the recent Hansleminutes podcast).
Recently he wrote an excellent post where he deciphered a postcard sized raytracer, un-packing the obfuscated code and providing a fantastic explanation of the maths involved. I really recommend you take the time to read it!
But it got me thinking, would it be possible to port that C++ code to C#?
Partly because in my day job I’ve been having to write a fair amount of C++ recently and I’ve realised I’m a bit rusty, so I thought this might help!
But more significantly, I wanted to get a better insight into the question is C# a low-level language?
A slightly different, but related question is how suitable is C# for ‘systems programming’? For more on that I really recommend Joe Duffy’s excellent post from 2013.
Line-by-line port
I started by simply porting the un-obfuscated C++ code line-by-line to C#. Turns out that this was pretty straight forward, I guess the story about C# being C++++ is true after all!!
Let’s look at an example, the main data structure in the code is a ‘vector’, here’s the code side-by-side, C++ on the left and C# on the right:
So there’s a few syntax differences, but because .NET lets you define your own ‘Value Types’ I was able to get the same functionality. This is significant because treating the ‘vector’ as a struct
means we can get better ‘data locality’ and the .NET Garbage Collector (GC) doesn’t need to be involved as the data will go onto the stack (probably, yes I know it’s an implementation detail).
For more info on structs
or ‘value types’ in .NET see:
In particular that last post form Eric Lippert contains this helpful quote that makes it clear what ‘value types’ really are:
Surely the most relevant fact about value types is not the implementation detail of how they are allocated, but rather the by-design semantic meaning of “value type”, namely that they are always copied “by value”. If the relevant thing was their allocation details then we’d have called them “heap types” and “stack types”. But that’s not relevant most of the time. Most of the time the relevant thing is their copying and identity semantics.
Now lets look at how some other methods look side-by-side (again C++ on the left, C# on the right), first up RayTracing(..)
:
Next QueryDatabase(..)
:
(see Fabien’s post for an explanation of what these 2 functions are doing)
But the point is that again, C# lets us very easily write C++ code! In this case what helps us out the most is the ref
keyword which lets us pass a value by reference. We’ve been able to use ref
in method calls for quite a while, but recently there’s been a effort to allow ref
in more places:
Now sometimes using ref
can provide a performance boost because it means that the struct
doesn’t need to be copied, see the benchmarks in Adam Sitniks post and Performance traps of ref locals and ref returns in C# for more information.
However what’s most important for this scenario is that it allows us to have the same behaviour in our C# port as the original C++ code. Although I want to point out that ‘Managed References’ as they’re known aren’t exactly the same as ‘pointers’, most notably you can’t do arithmetic on them, for more on this see:
So, it’s all well and good being able to port the code, but ultimately the performance also matters. Especially in something like a ‘ray tracer’ that can take minutes to run! The C++ code contains a variable called sampleCount
that controls the final quality of the image, with sampleCount = 2
it looks like this:
Which clearly isn’t that realistic!
However once you get to sampleCount = 2048
things look a lot better:
But, running with sampleCount = 2048
means the rendering takes a long time, so all the following results were run with it set to 2
, which means the test runs completed in ~1 minute. Changing sampleCount
only affects the number of iterations of the outermost loop of the code, see this gist for an explanation.
Results after a ‘naive’ line-by-line port
To be able to give a meaningful side-by-side comparison of the C++ and C# versions I used the time-windows tool that’s a port of the Unix time
command. My initial results looked this this:
C++ (VS 2017)
.NET Framework (4.7.2)
.NET Core (2.2)
Elapsed time (secs)
47.40
80.14
78.02
Kernel time
0.14 (0.3%)
0.72 (0.9%)
0.63 (0.8%)
User time
43.86 (92.5%)
73.06 (91.2%)
70.66 (90.6%)
page fault #
1,143
4,818
5,945
Working set (KB)
4,232
13,624
17,052
Paged pool (KB)
95
172
154
Non-paged pool
7
14
16
Page file size (KB)
1,460
10,936
11,024
So initially we see that the C# code is quite a bit slower than the C++ version, but it does get better (see below).
However lets first look at what the .NET JIT is doing for us even with this ‘naive’ line-by-line port. Firstly, it’s doing a nice job of in-lining the smaller ‘helper methods’, we can see this by looking at the output of the brilliant Inlining Analyzer tool (green overlay = inlined):
However, it doesn’t inline all methods, for example QueryDatabase(..)
is skipped because of it’s complexity:
Another feature that the .NET Just-In-Time (JIT) compiler provides is converting specific methods calls into corresponding CPU instructions. We can see this in action with the sqrt
wrapper function, here’s the original C# code (note the call to Math.Sqrt
):
// intnv square root
public static Vec operator !(Vec q) {
return q * (1.0f / (float)Math.Sqrt(q % q));
}
And here’s the assembly code that the .NET JIT generates, there’s no call to Math.Sqrt
and it makes use of the vsqrtsd
CPU instruction:
; Assembly listing for method Program:sqrtf(float):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 3, 3 ) float -> mm0
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M8216_IG01:
vzeroupper
G_M8216_IG02:
vcvtss2sd xmm0, xmm0
vsqrtsd xmm0, xmm0
vcvtsd2ss xmm0, xmm0
G_M8216_IG03:
ret
; Total bytes of code 16, prolog size 3 for method Program:sqrtf(float):float
; ============================================================
(to get this output you need to following these instructions, use the ‘Disasmo’ VS2019 Add-in or take a look at SharpLab.io)
These replacements are also known as ‘intrinsics’ and we can see the JIT generating them in the code below. This snippet just shows the mapping for AMD64
, the JIT also targets X86
, ARM
and ARM64
, the full method is here
bool Compiler::IsTargetIntrinsic(CorInfoIntrinsics intrinsicId)
{
#if defined(_TARGET_AMD64_) || (defined(_TARGET_X86_) && !defined(LEGACY_BACKEND))
switch (intrinsicId)
{
// AMD64/x86 has SSE2 instructions to directly compute sqrt/abs and SSE4.1
// instructions to directly compute round/ceiling/floor.
//
// TODO: Because the x86 backend only targets SSE for floating-point code,
// it does not treat Sine, Cosine, or Round as intrinsics (JIT32
// implemented those intrinsics as x87 instructions). If this poses
// a CQ problem, it may be necessary to change the implementation of
// the helper calls to decrease call overhead or switch back to the
// x87 instructions. This is tracked by #7097.
case CORINFO_INTRINSIC_Sqrt:
case CORINFO_INTRINSIC_Abs:
return true;
case CORINFO_INTRINSIC_Round:
case CORINFO_INTRINSIC_Ceiling:
case CORINFO_INTRINSIC_Floor:
return compSupports(InstructionSet_SSE41);
default:
return false;
}
...
}
As you can see, some methods are implemented like this, e.g. Sqrt
and Abs
, but for others the CLR instead uses the C++ runtime functions for instance powf
.
This entire process is explained very nicely in How is Math.Pow() implemented in .NET Framework?, but we can also see it in action in the CoreCLR source:
However, I wanted to see if my ‘naive’ line-by-line port could be improved, after some profiling I made two main changes:
- Remove in-line array initialisation
- Switch from
Math.XXX(..)
functions to the MathF.XXX()
counterparts.
These changes are explained in more depth below
Remove in-line array initialisation
For more information about why this is necessary see this excellent Stack Overflow answer from Andrey Akinshin complete with benchmarks and assembly code! It comes to the following conclusion:
Conclusion
- Does .NET caches hardcoded local arrays? Kind of: the Roslyn compiler put it in the metadata.
- Do we have any overhead in this case? Unfortunately, yes: JIT will copy the array content from the metadata for each invocation; it will work longer than the case with a static array. Runtime also allocates objects and produce memory traffic.
- Should we care about it? It depends. If it’s a hot method and you want to achieve a good level of performance, you should use a static array. If it’s a cold method which doesn’t affect the application performance, you probably should write “good” source code and put the array in the method scope.
You can see the change I made in this diff.
Using MathF functions instead of Math
Secondly and most significantly I got a big perf improvement by making the following changes:
#if NETSTANDARD2_1 || NETCOREAPP2_0 || NETCOREAPP2_1 || NETCOREAPP2_2 || NETCOREAPP3_0
// intnv square root
public static Vec operator !(Vec q) {
return q * (1.0f / MathF.Sqrt(q % q));
}
#else
public static Vec operator !(Vec q) {
return q * (1.0f / (float)Math.Sqrt(q % q));
}
#endif
As of ‘.NET Standard 2.1’ there are now specific float
implementations of the common maths functions, located in the System.MathF class. For more information on this API and it’s implementation see:
After these changes, the C# code is ~10% slower than the C++ version:
C++ (VS C++ 2017)
.NET Framework (4.7.2)
.NET Core (2.2) TC OFF
.NET Core (2.2) TC ON
Elapsed time (secs)
41.38
58.89
46.04
44.33
Kernel time
0.05 (0.1%)
0.06 (0.1%)
0.14 (0.3%)
0.13 (0.3%)
User time
41.19 (99.5%)
58.34 (99.1%)
44.72 (97.1%)
44.03 (99.3%)
page fault #
1,119
4,749
5,776
5,661
Working set (KB)
4,136
13,440
16,788
16,652
Paged pool (KB)
89
172
150
150
Non-paged pool
7
13
16
16
Page file size (KB)
1,428
10,904
10,960
11,044
TC = Tiered Compilation (I believe that it’ll be on by default in .NET Core 3.0)
For completeness, here’s the results across several runs:
Run
C++ (VS C++ 2017)
.NET Framework (4.7.2)
.NET Core (2.2) TC OFF
.NET Core (2.2) TC ON
TestRun-01
41.38
58.89
46.04
44.33
TestRun-02
41.19
57.65
46.23
45.96
TestRun-03
42.17
62.64
46.22
48.73
Note: the difference between .NET Core and .NET Framework is due to the lack of the MathF
API in .NET Framework v4.7.2, for more info see Support .Net Framework (4.8?) for netstandard 2.1.
However I’m sure that others can do better!
If you’re interested in trying to close the gap the C# code is available. For comparison, you can see the assembly produced by the C++ compiler courtesy of the brilliant Compiler Explorer.
Finally, if it helps, here’s the output from the Visual Studio Profiler showing the ‘hot path’ (after the perf improvement described above):
Is C# a low-level language?
Or more specifically:
What language features of C#/F#/VB.NET or BCL/Runtime functionality enable ‘low-level’* programming?
* yes, I know ‘low-level’ is a subjective term 😊
Note: Any C# developer is going to have a different idea of what ‘low-level’ means, these features would be taken for granted by C++ or Rust programmers.
Here’s the list that I came up with:
- ref returns and ref locals
- “tl;dr Pass and return by reference to avoid large struct copying. It’s type and memory safe. It can be even faster than
unsafe!
”
- Unsafe code in .NET
- “The core C# language, as defined in the preceding chapters, differs notably from C and C++ in its omission of pointers as a data type. Instead, C# provides references and the ability to create objects that are managed by a garbage collector. This design, coupled with other features, makes C# a much safer language than C or C++.”
- Managed pointers in .NET
- “There is, however, another pointer type in CLR – a managed pointer. It could be defined as a more general type of reference, which may point to other locations than just the beginning of an object.”
- C# 7 Series, Part 10: Span and universal memory management
- “
System.Span
is a stack-only type (ref struct
) that wraps all memory access patterns, it is the type for universal contiguous memory access. You can think the implementation of the Span contains a dummy reference and a length, accepting all 3 memory access types."
- Interoperability (C# Programming Guide)
- “The .NET Framework enables interoperability with unmanaged code through platform invoke services, the
System.Runtime.InteropServices
namespace, C++ interoperability, and COM interoperability (COM interop).”
However, I know my limitations and so I asked on twitter and got a lot more replies to add to the list:
- Ben Adams “Platform intrinsics (CPU instruction access)”
- Marc Gravell “SIMD via Vector (which mixes well with Span) is *fairly* low; .NET Core should (soon?) offer direct CPU intrinsics for more explicit usage targeting particular CPU ops"
- Marc Gravell “powerful JIT: things like range elision on arrays/spans, and the JIT using per-struct-T rules to remove huge chunks of code that it knows can’t be reached for that T, or on your particular CPU (BitConverter.IsLittleEndian, Vector.IsHardwareAccelerated, etc)”
- Kevin Jones “I would give a special shout-out to the
MemoryMarshal
and Unsafe
classes, and probably a few other things in the System.Runtime.CompilerServices
namespace.”
- Theodoros Chatzigiannakis “You could also include
__makeref
and the rest.”
- damageboy “Being able to dynamically generate code that fits the expected input exactly, given that the latter will only be known at runtime, and might change periodically?”
- Robert Haken “dynamic IL emission”
- Victor Baybekov “Stackalloc was not mentioned. Also ability to write raw IL (not dynamic, so save on a delegate call), e.g. to use cached
ldftn
and call them via calli
. VS2017 has a proj template that makes this trivial via extern methods + MethodImplOptions.ForwardRef + ilasm.exe rewrite.”
- Victor Baybekov “Also MethodImplOptions.AggressiveInlining “does enable ‘low-level’ programming” in a sense that it allows to write high-level code with many small methods and still control JIT behavior to get optimized result. Otherwise uncomposable 100s LOCs methods with copy-paste…”
- Ben Adams “Using the same calling conventions (ABI) as the underlying platform and p/invokes for interop might be more of a thing though?”
- Victor Baybekov “Also since you mentioned #fsharp - it does have
inline
keyword that does the job at IL level before JIT, so it was deemed important at the language level. C# lacks this (so far) for lambdas which are always virtual calls and workarounds are often weird (constrained generics).”
- Alexandre Mutel “new SIMD intrinsics, Unsafe Utility class/IL post processing (e.g custom, Fody…etc.). For C#8.0, upcoming function pointers…”
- Alexandre Mutel “related to IL, F# has support for direct IL within the language for example”
- OmariO “BinaryPrimitives. Low-level but safe.” (https://docs.microsoft.com/en-us/dotnet/api/system.buffers.binary.binaryprimitives?view=netcore-3.0)
- Kouji (Kozy) Matsui “How about native inline assembler? It’s difficult for how relation both toolchains and runtime, but can replace current P/Invoke solution and do inlining if we have it.”
- Frank A. Krueger “Ldobj, stobj, initobj, initblk, cpyblk.”
- Konrad Kokosa “Maybe Thread Local Storage? Fixed Size Buffers? unmanaged constraint and blittable types should be probably mentioned:)”
- Sebastiano Mandalà “Just my two cents as everything has been said: what about something as simple as struct layout and how padding and memory alignment and order of the fields may affect the cache line performance? It’s something I have to investigate myself too”
- Nino Floris “Constants embedding via readonlyspan, stackalloc, finalizers, WeakReference, open delegates, MethodImplOptions, MemoryBarriers, TypedReference, varargs, SIMD, Unsafe.AsRef can coerce struct types if layout matches exactly (used for a.o. TaskAwaiter and its version)"
So in summary, I would say that C# certainly lets you write code that looks a lot like C++ and in conjunction with the Runtime and Base-Class Libraries it gives you a lot of low-level functionality
Discuss this post on Hacker News, /r/programming, /r/dotnet or /r/csharp
Further Reading
The Unity ‘Burst’ Compiler:
Fri, 1 Mar 2019, 12:00 am
"Stack Walking" in the .NET Runtime
What is ‘stack walking’, well as always the ‘Book of the Runtime’ (BotR) helps us, from the relevant page:
The CLR makes heavy use of a technique known as stack walking (or stack crawling). This involves iterating the sequence of call frames for a particular thread, from the most recent (the thread’s current function) back down to the base of the stack.
The runtime uses stack walks for a number of purposes:
- The runtime walks the stacks of all threads during garbage collection, looking for managed roots (local variables holding object references in the frames of managed methods that need to be reported to the GC to keep the objects alive and possibly track their movement if the GC decides to compact the heap).
- On some platforms the stack walker is used during the processing of exceptions (looking for handlers in the first pass and unwinding the stack in the second).
- The debugger uses the functionality when generating managed stack traces.
- Various miscellaneous methods, usually those close to some public managed API, perform a stack walk to pick up information about their caller (such as the method, class or assembly of that caller).
The rest of this post will explore what ‘Stack Walking’ is, how it works and why so many parts of the runtime need to be involved.
Table of Contents
Where does the CLR use ‘Stack Walking’?
Before we dig into the ‘internals’, let’s take a look at where the runtime utilises ‘stack walking’, below is the full list (as of .NET Core CLR ‘Release 2.2’). All these examples end up calling into the Thread::StackWalkFrames(..)
method here and provide a callback
that is triggered whenever the API encounters a new section of the stack (see How to use it below for more info).
Common Scenarios
- Garbage Collection (GC)
- Exception Handling (unwinding)
- Exception Handling (resumption):
ExceptionTracker::FindNonvolatileRegisterPointers(..)
here -> callback
ExceptionTracker::RareFindParentStackFrame(..)
here -> callback
- Threads:
- Thread Suspension:
Debugging/Diagnostics
- Debugger
- Managed APIs (e.g
System.Diagnostics.StackTrace
)
- Managed code calls via an
InternalCall
(C#) here into DebugStackTrace::GetStackFramesInternal(..)
(C++) here
- Before ending up in
DebugStackTrace::GetStackFramesHelper(..)
here -> callback
- DAC (via by SOS) - Scan for GC ‘Roots’
- Profiling API
ProfToEEInterfaceImpl::ProfilerStackWalkFramesWrapper(..)
here -> callback
- Event Pipe (Diagnostics)
- CLR prints a Stack Trace (to the console/log, DEBUG builds only)
Obscure Scenarios
- Reflection
- Application (App) Domains (See ‘Stack Crawl Marks’ below)
SystemDomain::GetCallersMethod(..)
here (also GetCallersType(..)
and GetCallersModule(..)
) (callback)
SystemDomain::GetCallersModule(..)
here (callback)
- ‘Code Pitching’
- Extensible Class Factory (
System.Runtime.InteropServices.ExtensibleClassFactory
)
- Stack Sampler (unused?)
Stack Crawl Marks
One of the above scenarios deserves a closer look, but firstly why are ‘stack crawl marks’ used, from coreclr/issues/#21629 (comment):
Unfortunately, there is a ton of legacy APIs that were added during netstandard2.0 push whose behavior depend on the caller. The caller is basically passed in as an implicit argument to the API. Most of these StackCrawlMarks are there to support these APIs…
So we can see that multiple functions within the CLR itself need to have knowledge of their caller. To understand this some more, let’s look an example, the GetType(string typeName)
method. Here’s the flow from the externally-visible method all the way down to where the work is done, note how a StackCrawlMark
instance is passed through:
Type::GetType(string typeName)
implementation (Creates StackCrawlMark.LookForMyCaller
)
RuntimeType::GetType(.., ref StackCrawlMark stackMark)
implementation
RuntimeType::GetTypeByName(.., ref StackCrawlMark stackMark, ..)
implementation
extern void GetTypeByName(.., ref StackCrawlMark stackMark, ..)
definition (call into native code, i.e. [DllImport(JitHelpers.QCall, ..)]
)
RuntimeTypeHandle::GetTypeByName(.., QCall::StackCrawlMarkHandle pStackMark, ..)
implementation
TypeHandle TypeName::GetTypeManaged(.., StackCrawlMark* pStackMark, ..)
implementation
TypeHandle TypeName::GetTypeWorker(.. , StackCrawlMark* pStackMark, ..)
implementation
SystemDomain::GetCallersAssembly(StackCrawlMark *stackMark,..)
implementation
SystemDomain::GetCallersModule(StackCrawlMark* stackMark, ..)
implementation
SystemDomain::CallersMethodCallbackWithStackMark(..)
callback implementation
In addition the JIT (via the VM) has to ensure that all relevant methods are available in the call-stack, i.e. they can’t be removed:
However, the StackCrawlMark
feature is currently being cleaned up, so it may look different in the future:
Exception Handling
The place that most .NET Developers will run into ‘stack traces’ is when dealing with exceptions. I originally intended to also describe ‘exception handling’ here, but then I opened up /src/vm/exceptionhandling.cpp and saw that it contained over 7,000 lines of code!! So I decided that it can wait for a future post 😁.
However, if you want to learn more about the ‘internals’ I really recommend Chris Brumme’s post The Exception Model (2003) which is the definitive guide on the topic (also see his Channel9 Videos) and as always, the ‘BotR’ chapter ‘What Every (Runtime) Dev needs to Know About Exceptions in the Runtime’ is well worth a read.
Also, I recommend talking a look at the slides from the ‘Internals of Exceptions’ talk’ and the related post .NET Inside Out Part 2 — Handling and rethrowing exceptions in C# both by Adam Furmanek.
The ‘Stack Walking’ API
Now that we’ve seen where it’s used, let’s look at the ‘stack walking’ API itself. Firstly, how is it used?
How to use it
It’s worth pointing out that the only way you can access it from C#/F#/VB.NET code is via the StackTrace
class, only the runtime itself can call into Thread::StackWalkFrames(..)
directly. The simplest usage in the runtime is EventPipe::WalkManagedStackForThread(..)
(see here), which is shown below. As you can see it’s as simple as specifying the relevant flags, in this case ALLOW_ASYNC_STACK_WALK | FUNCTIONSONLY | HANDLESKIPPEDFRAMES | ALLOW_INVALID_OBJECTS
and then providing the callback, which in the EventPipe class is the StackWalkCallback
method (here)
bool EventPipe::WalkManagedStackForThread(Thread *pThread, StackContents &stackContents)
{
CONTRACTL
{
NOTHROW;
GC_NOTRIGGER;
MODE_ANY;
PRECONDITION(pThread != NULL);
}
CONTRACTL_END;
// Calling into StackWalkFrames in preemptive mode violates the host contract,
// but this contract is not used on CoreCLR.
CONTRACT_VIOLATION( HostViolation );
stackContents.Reset();
StackWalkAction swaRet = pThread->StackWalkFrames(
(PSTACKWALKFRAMESCALLBACK) &StackWalkCallback,
&stackContents,
ALLOW_ASYNC_STACK_WALK | FUNCTIONSONLY | HANDLESKIPPEDFRAMES | ALLOW_INVALID_OBJECTS);
return ((swaRet == SWA_DONE) || (swaRet == SWA_CONTINUE));
}
The StackWalkFrame(..)
function then does the heavy-lifting of actually walking the stack, before triggering the callback shown below. In this case it just records the ‘Instruction Pointer’ (IP/CP) and the ‘managed function’, which is an instance of the MethodDesc
obtained via the pCf->GetFunction()
call:
StackWalkAction EventPipe::StackWalkCallback(CrawlFrame *pCf, StackContents *pData)
{
CONTRACTL
{
NOTHROW;
GC_NOTRIGGER;
MODE_ANY;
PRECONDITION(pCf != NULL);
PRECONDITION(pData != NULL);
}
CONTRACTL_END;
// Get the IP.
UINT_PTR controlPC = (UINT_PTR)pCf->GetRegisterSet()->ControlPC;
if (controlPC == 0)
{
if (pData->GetLength() == 0)
{
// This happens for pinvoke stubs on the top of the stack.
return SWA_CONTINUE;
}
}
_ASSERTE(controlPC != 0);
// Add the IP to the captured stack.
pData->Append(controlPC, pCf->GetFunction());
// Continue the stack walk.
return SWA_CONTINUE;
}
How it works
Now onto the most interesting part, how to the runtime actually walks the stack. Well, first let’s understand what the stack looks like, from the ‘BotR’ page:
The main thing to note is that a .NET ‘stack’ can contain 3 types of methods:
- Managed - this represents code that started off as C#/F#/VB.NET, was turned into IL and then finally compiled to native code by the ‘JIT Compiler’.
- Unmanaged - completely native code that exists outside of the runtime, i.e. a OS function the runtime calls into or a user call via
P/Invoke
. The runtime only cares about transitions into or out of regular unmanaged code, is doesn’t care about the stack frame within it.
- Runtime Managed - still native code, but this is slightly different because the runtime case more about this code. For example there are quite a few parts of the Base-Class libraries that make use of
InternalCall
methods, for more on this see the ‘Helper Method’ Frames section later on.
So the ‘stack walk’ has to deal with these different scenarios as it proceeds. Now let’s look at the ‘code flow’ starting with the entry-point method StackWalkFrames(..)
:
Thread::StackWalkFrames(..)
here
- the entry-point function, the type of ‘stack walk’ can be controlled via these flags
Thread::StackWalkFramesEx(..)
here
- worker-function that sets up the
StackFrameIterator
, via a call to StackFrameIterator::Init(..)
here
StackFrameIterator::Next()
here, then hands off to the primary worker method StackFrameIterator::NextRaw()
here that does 5 things:
CheckForSkippedFrames(..)
here, deals with frames that may have been allocated inside a managed stack frame (e.g. an inlined p/invoke call).
UnwindStackFrame(..)
here, in-turn calls:
x64
- Thread::VirtualUnwindCallFrame(..)
here, then calls VirtualUnwindNonLeafCallFrame(..)
here or VirtualUnwindLeafCallFrame(..)
here. All of of these functions make use of the Windows API function RtlLookupFunctionEntry(..)
to do the actual unwinding.
x86
- ::UnwindStackFrame(..)
here, in turn calls UnwindEpilog(..)
here and UnwindEspFrame(..)
here. Unlike x64
, under x86
all the ‘stack-unwinding’ is done manually, within the CLR code.
PostProcessingForManagedFrames(..)
here, determines if the stack-walk is actually within a managed method rather than a native frame.
ProcessIp(..)
here has the job of looking up the current managed method (if any) based on the current instruction pointer (IP). It does this by calling into EECodeInfo::Init(..)
here and then ends up in one of:
EEJitManager::JitCodeToMethodInfo(..)
here, that uses a very cool looking data structure refereed to as a ‘nibble map’
NativeImageJitManager::JitCodeToMethodInfo(..)
here
ReadyToRunJitManager::JitCodeToMethodInfo(..)
here
ProcessCurrentFrame(..)
here, does some final house-keeping and tidy-up.
CrawlFrame::GotoNextFrame()
here
- in-turn calls
pFrame->Next()
here to walk through the ‘linked list’ of frames which drive the ‘stack walk’ (more on these ‘frames’ later)
StackFrameIterator::Filter()
here
When it gets a valid frame it triggers the callback in Thread::MakeStackwalkerCallback(..)
here and passes in a pointer to the current CrawlFrame
class defined here, this exposes methods such as IsFrameless()
, GetFunction()
and GetThisPointer()
. The CrawlFrame
actually represents 2 scenarios, based on the current IP:
- Native code, represented by a
Frame
class defined here, which we’ll discuss more in a moment.
- Managed code, well technically ‘managed code’ that was JITted to ‘native code’, so more accurately a managed stack frame. In this situation the
MethodDesc
class defined here is provided, you can read more about this key CLR data-structure in the corresponding BotR chapter.
See it ‘in Action’
Fortunately we’re able to turn on some nice diagnostics in a debug build of the CLR (COMPLUS_LogEnable
, COMPLUS_LogToFile
& COMPLUS_LogFacility
). With that in place, given C# code like this:
internal class Program {
private static void Main() {
MethodA();
}
[MethodImpl(MethodImplOptions.NoInlining)]
private void MethodA() {
MethodB();
}
[MethodImpl(MethodImplOptions.NoInlining)]
private void MethodB() {
MethodC();
}
[MethodImpl(MethodImplOptions.NoInlining)]
private void MethodC() {
var stackTrace = new StackTrace(fNeedFileInfo: true);
Console.WriteLine(stackTrace.ToString());
}
}
We get the output shown below, in which you can see the ‘stack walking’ process. It starts in InitializeSourceInfo
and CaptureStackTrace
which are methods internal to the StackTrace
class (see here), before moving up the stack MethodC
-> MethodB
-> MethodA
and then finally stopping in the Main
function. Along the way its does a ‘FILTER’ and ‘CONSIDER’ step before actually unwinding (‘finished unwind for …’):
TID 4740: STACKWALK starting with partial context
TID 4740: STACKWALK: [000] FILTER : EXPLICIT : PC= 00000000`00000000 SP= 00000000`00000000 Frame= 00000002`9977cc48 vtbl= 00007ffd`74a105b0
TID 4740: STACKWALK: [001] CONSIDER: EXPLICIT : PC= 00000000`00000000 SP= 00000000`00000000 Frame= 00000002`9977cc48 vtbl= 00007ffd`74a105b0
TID 4740: STACKWALK: [001] FILTER : EXPLICIT : PC= 00000000`00000000 SP= 00000000`00000000 Frame= 00000002`9977cc48 vtbl= 00007ffd`74a105b0
TID 4740: STACKWALK: [002] CONSIDER: EXPLICIT : PC= 00000000`00000000 SP= 00000000`00000000 Frame= 00000002`9977cdd8 vtbl= 00007ffd`74995220
TID 4740: STACKWALK LazyMachState::unwindLazyState(ip:00007FFD7439C45C,sp:000000029977C338)
TID 4740: STACKWALK: [002] CALLBACK: EXPLICIT : PC= 00000000`00000000 SP= 00000000`00000000 Frame= 00000002`9977cdd8 vtbl= 00007ffd`74995220
TID 4740: STACKWALK HelperMethodFrame::UpdateRegDisplay cached ip:00007FFD72FE9258, sp:000000029977D300
TID 4740: STACKWALK: [003] CONSIDER: FRAMELESS: PC= 00007ffd`72fe9258 SP= 00000002`9977d300 method=InitializeSourceInfo
TID 4740: STACKWALK: [003] CALLBACK: FRAMELESS: PC= 00007ffd`72fe9258 SP= 00000002`9977d300 method=InitializeSourceInfo
TID 4740: STACKWALK: [004] about to unwind for 'InitializeSourceInfo', SP: 00000002`9977d300 , IP: 00007ffd`72fe9258
TID 4740: STACKWALK: [004] finished unwind for 'InitializeSourceInfo', SP: 00000002`9977d480 , IP: 00007ffd`72eeb671
TID 4740: STACKWALK: [004] CONSIDER: FRAMELESS: PC= 00007ffd`72eeb671 SP= 00000002`9977d480 method=CaptureStackTrace
TID 4740: STACKWALK: [004] CALLBACK: FRAMELESS: PC= 00007ffd`72eeb671 SP= 00000002`9977d480 method=CaptureStackTrace
TID 4740: STACKWALK: [005] about to unwind for 'CaptureStackTrace', SP: 00000002`9977d480 , IP: 00007ffd`72eeb671
TID 4740: STACKWALK: [005] finished unwind for 'CaptureStackTrace', SP: 00000002`9977d5b0 , IP: 00007ffd`72eeadd0
TID 4740: STACKWALK: [005] CONSIDER: FRAMELESS: PC= 00007ffd`72eeadd0 SP= 00000002`9977d5b0 method=.ctor
TID 4740: STACKWALK: [005] CALLBACK: FRAMELESS: PC= 00007ffd`72eeadd0 SP= 00000002`9977d5b0 method=.ctor
TID 4740: STACKWALK: [006] about to unwind for '.ctor', SP: 00000002`9977d5b0 , IP: 00007ffd`72eeadd0
TID 4740: STACKWALK: [006] finished unwind for '.ctor', SP: 00000002`9977d5f0 , IP: 00007ffd`14c620d3
TID 4740: STACKWALK: [006] CONSIDER: FRAMELESS: PC= 00007ffd`14c620d3 SP= 00000002`9977d5f0 method=MethodC
TID 4740: STACKWALK: [006] CALLBACK: FRAMELESS: PC= 00007ffd`14c620d3 SP= 00000002`9977d5f0 method=MethodC
TID 4740: STACKWALK: [007] about to unwind for 'MethodC', SP: 00000002`9977d5f0 , IP: 00007ffd`14c620d3
TID 4740: STACKWALK: [007] finished unwind for 'MethodC', SP: 00000002`9977d630 , IP: 00007ffd`14c62066
TID 4740: STACKWALK: [007] CONSIDER: FRAMELESS: PC= 00007ffd`14c62066 SP= 00000002`9977d630 method=MethodB
TID 4740: STACKWALK: [007] CALLBACK: FRAMELESS: PC= 00007ffd`14c62066 SP= 00000002`9977d630 method=MethodB
TID 4740: STACKWALK: [008] about to unwind for 'MethodB', SP: 00000002`9977d630 , IP: 00007ffd`14c62066
TID 4740: STACKWALK: [008] finished unwind for 'MethodB', SP: 00000002`9977d660 , IP: 00007ffd`14c62016
TID 4740: STACKWALK: [008] CONSIDER: FRAMELESS: PC= 00007ffd`14c62016 SP= 00000002`9977d660 method=MethodA
TID 4740: STACKWALK: [008] CALLBACK: FRAMELESS: PC= 00007ffd`14c62016 SP= 00000002`9977d660 method=MethodA
TID 4740: STACKWALK: [009] about to unwind for 'MethodA', SP: 00000002`9977d660 , IP: 00007ffd`14c62016
TID 4740: STACKWALK: [009] finished unwind for 'MethodA', SP: 00000002`9977d690 , IP: 00007ffd`14c61f65
TID 4740: STACKWALK: [009] CONSIDER: FRAMELESS: PC= 00007ffd`14c61f65 SP= 00000002`9977d690 method=Main
TID 4740: STACKWALK: [009] CALLBACK: FRAMELESS: PC= 00007ffd`14c61f65 SP= 00000002`9977d690 method=Main
TID 4740: STACKWALK: [00a] about to unwind for 'Main', SP: 00000002`9977d690 , IP: 00007ffd`14c61f65
TID 4740: STACKWALK: [00a] finished unwind for 'Main', SP: 00000002`9977d6d0 , IP: 00007ffd`742f9073
TID 4740: STACKWALK: [00a] FILTER : NATIVE : PC= 00007ffd`742f9073 SP= 00000002`9977d6d0
TID 4740: STACKWALK: [00b] CONSIDER: EXPLICIT : PC= 00007ffd`742f9073 SP= 00000002`9977d6d0 Frame= 00000002`9977de58 vtbl= 00007ffd`74a105b0
TID 4740: STACKWALK: [00b] FILTER : EXPLICIT : PC= 00007ffd`742f9073 SP= 00000002`9977d6d0 Frame= 00000002`9977de58 vtbl= 00007ffd`74a105b0
TID 4740: STACKWALK: [00c] CONSIDER: EXPLICIT : PC= 00007ffd`742f9073 SP= 00000002`9977d6d0 Frame= 00000002`9977e7e0 vtbl= 00007ffd`74a105b0
TID 4740: STACKWALK: [00c] FILTER : EXPLICIT : PC= 00007ffd`742f9073 SP= 00000002`9977d6d0 Frame= 00000002`9977e7e0 vtbl= 00007ffd`74a105b0
TID 4740: STACKWALK: SWA_DONE: reached the end of the stack
To find out more, you can search for these diagnostic message in \vm\stackwalk.cpp, e.g. in Thread::DebugLogStackWalkInfo(..)
here
Unwinding ‘Native’ Code
As explained in this excellent article:
There are fundamentally two main ways to implement exception propagation in an ABI (Application Binary Interface):
-
“dynamic registration”, with frame pointers in each activation record, organized as a linked list. This makes stack unwinding fast at the expense of having to set up the frame pointer in each function that calls other functions. This is also simpler to implement.
-
“table-driven”, where the compiler and assembler create data structures alongside the program code to indicate which addresses of code correspond to which sizes of activation records. This is called “Call Frame Information” (CFI) data in e.g. the GNU tool chain. When an exception is generated, the data in this table is loaded to determine how to unwind. This makes exception propagation slower but the general case faster.
It turns out that .NET uses the ‘table-driven’ approach, for the reason explained in the ‘BotR’:
The exact definition of a frame varies from platform to platform and on many platforms there isn’t a hard definition of a frame format that all functions adhere to (x86 is an example of this). Instead the compiler is often free to optimize the exact format of frames. On such systems it is not possible to guarantee that a stackwalk will return 100% correct or complete results (for debugging purposes, debug symbols such as pdbs are used to fill in the gaps so that debuggers can generate more accurate stack traces).
This is not a problem for the CLR, however, since we do not require a fully generalized stack walk. Instead we are only interested in those frames that are managed (i.e. represent a managed method) or, to some extent, frames coming from unmanaged code used to implement part of the runtime itself. In particular there is no guarantee about fidelity of 3rd party unmanaged frames other than to note where such frames transition into or out of the runtime itself (i.e. one of the frame types we do care about).
Frames
To enable ‘unwinding’ of native code or more strictly the transitions ‘into’ and ‘out of’ native code, the CLR uses a mechanism of Frames
, which are defined in the source code here. These frames are arranged into a hierachy and there is one type of Frame
for each scenario, for more info on these individual Frames
take a look at the excellent source-code comments here.
- Frame (abstract/base class)
- GCFrame
- FaultingExceptionFrame
- HijackFrame
- ResumableFrame
- InlinedCallFrame
- HelperMethodFrame
- HelperMethodFrame_1OBJ
- HelperMethodFrame_2OBJ
- HelperMethodFrame_3OBJ
- HelperMethodFrame_PROTECTOBJ
- TransitionFrame
- StubHelperFrame
- SecureDelegateFrame
- FramedMethodFrame
- ComPlusMethodFrame
- PInvokeCalliFrame
- PrestubMethodFrame
- StubDispatchFrame
- ExternalMethodFrame
- TPMethodFrame
- UnmanagedToManagedFrame
- ComMethodFrame
- UMThkCallFrame
- ContextTransitionFrame
- TailCallFrame
- ProtectByRefsFrame
- ProtectValueClassFrame
- DebuggerClassInitMarkFrame
- DebuggerSecurityCodeMarkFrame
- DebuggerExitFrame
- DebuggerU2MCatchHandlerFrame
- FuncEvalFrame
- ExceptionFilterFrame
‘Helper Method’ Frames
But to make sense of this, let’s look at one type of Frame
, known as HelperMethodFrame
(above). This is used when .NET code in the runtime calls into C++ code to do the heavy-lifting, often for performance reasons. One example is if you call Environment.GetCommandLineArgs()
you end up in this code (C#), but note that it ends up calling an extern
method marked with InternalCall
:
[MethodImplAttribute(MethodImplOptions.InternalCall)]
private static extern string[] GetCommandLineArgsNative();
This means that the rest of the method is implemented in the runtime in C++, you can see how the method call is wired up, before ending up SystemNative::GetCommandLineArgs
here, which is shown below:
FCIMPL0(Object*, SystemNative::GetCommandLineArgs)
{
FCALL_CONTRACT;
PTRARRAYREF strArray = NULL;
HELPER_METHOD_FRAME_BEGIN_RET_1(strArray); // GetAppDomain());
}
delete [] argv;
HELPER_METHOD_FRAME_END(); // CountOfUnwindCodes]), sizeof(ULONG));
*pPersonalityRoutine = ExecutionManager::GetCLRPersonalityRoutineValue();
#elif defined(_TARGET_ARM64_)
*(LONG *)pUnwindInfo |= (1 End offset : 0x00004e (not in unwind data)
Version : 1
Flags : 0x00
SizeOfProlog : 0x07
CountOfUnwindCodes: 4
FrameRegister : none (0)
FrameOffset : N/A (no FrameRegister) (Value=0)
UnwindCodes :
CodeOffset: 0x07 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 11 * 8 + 8 = 96 = 0x60
CodeOffset: 0x03 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6)
CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rdi (7)
CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5)
Unwind Info:
>> Start offset : 0x00004e (not in unwind data)
>> End offset : 0x0000e2 (not in unwind data)
Version : 1
Flags : 0x00
SizeOfProlog : 0x07
CountOfUnwindCodes: 4
FrameRegister : none (0)
FrameOffset : N/A (no FrameRegister) (Value=0)
UnwindCodes :
CodeOffset: 0x07 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 5 * 8 + 8 = 48 = 0x30
CodeOffset: 0x03 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rsi (6)
CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rdi (7)
CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbp (5)
This ‘unwind info’ is then looked up during a ‘stack walk’ as explained in the How it works section above.
So next time you encounter a ‘stack trace’ remember that a lot of work went into making it possible!!
Further Reading
‘Stack Walking’ or ‘Stack Unwinding’ is a very large topic, so if you want to know more, here are some links to get you started:
Stack Unwinding (general)
Stack Unwinding (other runtimes)
In addition, it’s interesting to look at how other runtimes handles this process:
Mon, 21 Jan 2019, 12:00 am
Exploring the .NET Core Runtime (in which I set myself a challenge)
It seems like this time of year anyone with a blog is doing some sort of ‘advent calendar’, i.e. 24 posts leading up to Christmas. For instance there’s a F# one which inspired a C# one (C# copying from F#, that never happens 😉)
However, that’s a bit of a problem for me, I struggled to write 24 posts in my most productive year, let alone a single month! Also, I mostly blog about ‘.NET Internals’, a subject which doesn’t necessarily lend itself to the more ‘light-hearted’ posts you get in these ‘advent calendar’ blogs.
Until now!
Recently I’ve been giving a talk titled from ‘dotnet run’ to ‘hello world’, which attempts to explain everything that the .NET Runtime does from the point you launch your application till “Hello World” is printed on the screen:
From 'dotnet run' to 'hello world' from
Matt Warren
But as I was researching and presenting this talk, it made me think about the .NET Runtime as a whole, what does it contain and most importantly what can you do with it?
Note: this is mostly for informational purposes, for the recommended way of achieving the same thing, take a look at this excellent Deep-dive into .NET Core primitives by Nate McMaster.
In this post I will explore what you can do using only the code in the dotnet/coreclr repository and along the way we’ll find out more about how the runtime interacts with the wider .NET Ecosystem.
To makes things clearer, there are 3 challenges that will need to be solved before a simple “Hello World” application can be run. That’s because in the dotnet/coreclr repository there is:
- No compiler, that lives in dotnet/Roslyn
- No Framework Class Library (FCL) a.k.a. ‘dotnet/CoreFX’
- No
dotnet run
as it’s implemented in the dotnet/CLI repository
Building the CoreCLR
But before we even work through these ‘challenges’, we need to build the CoreCLR itself. Helpfully there is really nice guide available in ‘Building the Repository’:
The build depends on Git, CMake, Python and of course a C++ compiler. Once these prerequisites are installed
the build is simply a matter of invoking the ‘build’ script (build.cmd
or build.sh
) at the base of the repository.
The details of installing the components differ depending on the operating system. See the following pages based on your OS. There is no cross-building across OS (only for ARM, which is built on X64). You have to be on the particular platform to build that platform.
If you follow these steps successfully, you’ll end up with the following files (at least on Windows, other OSes may produce something slightly different):
No Compiler
First up, how do we get around the fact that we don’t have a compiler? After all we need some way of turing our simple “Hello World” code into a .exe?
namespace Hello_World
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Hello World!");
}
}
}
Fortunately we do have access to the ILASM tool (IL Assembler), which can turn Common Intermediate Language (CIL) into an .exe file. But how do we get the correct IL code? Well, one way is to write it from scratch, maybe after reading Inside NET IL Assembler and Expert .NET 2.0 IL Assembler by Serge Lidin (yes, amazingly, 2 books have been written about IL!)
Another, much easier way, is to use the amazing SharpLab.io site to do it for us! If you paste the C# code from above into it, you’ll get the following IL code:
.class private auto ansi ''
{
} // end of class
.class private auto ansi beforefieldinit Hello_World.Program
extends [mscorlib]System.Object
{
// Methods
.method private hidebysig static
void Main (
string[] args
) cil managed
{
// Method begins at RVA 0x2050
// Code size 11 (0xb)
.maxstack 8
IL_0000: ldstr "Hello World!"
IL_0005: call void [mscorlib]System.Console::WriteLine(string)
IL_000a: ret
} // end of method Program::Main
.method public hidebysig specialname rtspecialname
instance void .ctor () cil managed
{
// Method begins at RVA 0x205c
// Code size 7 (0x7)
.maxstack 8
IL_0000: ldarg.0
IL_0001: call instance void [mscorlib]System.Object::.ctor()
IL_0006: ret
} // end of method Program::.ctor
} // end of class Hello_World.Program
Then, if we save this to a file called ‘HelloWorld.il’ and run the cmd ilasm HelloWorld.il /out=HelloWorld.exe
, we get the following output:
Microsoft (R) .NET Framework IL Assembler. Version 4.5.30319.0
Copyright (c) Microsoft Corporation. All rights reserved.
Assembling 'HelloWorld.il' to EXE --> 'HelloWorld.exe'
Source file is ANSI
HelloWorld.il(38) : warning : Reference to undeclared extern assembly 'mscorlib'. Attempting autodetect
Assembled method Hello_World.Program::Main
Assembled method Hello_World.Program::.ctor
Creating PE file
Emitting classes:
Class 1: Hello_World.Program
Emitting fields and methods:
Global
Class 1 Methods: 2;
Emitting events and properties:
Global
Class 1
Writing PE file
Operation completed successfully
Nice, so part 1 is done, we now have our HelloWorld.exe
file!
No Base Class Library
Well, not exactly, one problem is that System.Console
lives in dotnet/corefx, in there you can see the different files that make up the implementation, such as Console.cs
, ConsolePal.Unix.cs
, ConsolePal.Windows.cs
, etc.
Fortunately, the nice CoreCLR developers included a simple Console
implementation in System.Private.CoreLib.dll
, the managed part of the CoreCLR, which was previously known as ‘mscorlib’ (before it was renamed). This internal version of Console
is pretty small and basic, but it provides enough for what we need.
To use this ‘workaround’ we need to edit our HelloWorld.il
to look like this (note the change from mscorlib
to System.Private.CoreLib
)
.class public auto ansi beforefieldinit C
extends [System.Private.CoreLib]System.Object
{
.method public hidebysig static void M () cil managed
{
.entrypoint
// Code size 11 (0xb)
.maxstack 8
IL_0000: ldstr "Hello World!"
IL_0005: call void [System.Private.CoreLib]Internal.Console::WriteLine(string)
IL_000a: ret
} // end of method C::M
...
}
Note: You can achieve the same thing with C# code instead of raw IL, by invoking the C# compiler with the following cmd-line:
csc -optimize+ -nostdlib -reference:System.Private.Corelib.dll -out:HelloWorld.exe HelloWorld.cs
So we’ve completed part 2, we are able to at least print “Hello World” to the screen without using the CoreFX repository!
Now this is a nice little trick, but I wouldn’t ever recommend writing real code like this. Compiling against System.Private.CoreLib
isn’t the right way of doing things. What the compiler normally does is compile against the publicly exposed surface area that lives in dotnet/corefx, but then at run-time a process called ‘Type-Forwarding’ is used to make that ‘reference’ implementation in CoreFX map to the ‘real’ implementation in the CoreCLR. For more on this entire process see The Rough History of Referenced Assemblies.
However, only a small amount of managed code (i.e. C#) actually exists in the CoreCLR, to show this, the directory tree for /dotnet/coreclr/src/System.Private.CoreLib is available here and the tree with all ~1280 .cs files included is here.
As a concrete example, if you look in CoreFX, you’ll see that the System.Reflection implementation is pretty empty! That’s because it’s a ‘partial facade’ that is eventually ‘type-forwarded’ to System.Private.CoreLib.
If you’re interested, the entire API that is exposed in CoreFX (but actually lives in CoreCLR) is contained in System.Runtime.cs. But back to our example, here is the code that describes all the GetMethod(..)
functions in the ‘System.Reflection’ API.
To learn more about ‘type forwarding’, I recommend watching ‘.NET Standard - Under the Hood’ (slides) by Immo Landwerth and there is also some more in-depth information in ‘Evolution of design time assemblies’.
But why is this code split useful, from the CoreFX README:
Runtime-specific library code (mscorlib) lives in the CoreCLR repo. It needs to be built and versioned in tandem with the runtime. The rest of CoreFX is agnostic of runtime-implementation and can be run on any compatible .NET runtime (e.g. CoreRT).
And from the other point-of-view, in the CoreCLR README:
By itself, the Microsoft.NETCore.Runtime.CoreCLR
package is actually not enough to do much. One reason for this is that the CoreCLR package tries to minimize the amount of the class library that it implements. Only types that have a strong dependency on the internal workings of the runtime are included (e.g, System.Object
, System.String
, System.Threading.Thread
, System.Threading.Tasks.Task
and most foundational interfaces).
Instead most of the class library is implemented as independent NuGet packages that simply use the .NET Core runtime as a dependency. Many of the most familiar classes (System.Collections
, System.IO
, System.Xml
and so on), live in packages defined in the dotnet/corefx repository.
One huge benefit of this approach is that Mono can share large amounts of the CoreFX code, as shown in this tweet:
How Mono reuses .NET Core sources for BCL (doesn't include runtime, tools, etc) according to my calculations 🙂 pic.twitter.com/8JCDxqwnNi
— Egor Bogatov (@EgorBo)
March 27, 2018
No Launcher
So far we’ve ‘compiled’ our code (well technically ‘assembled’ it) and we’ve been able to access a simple version of System.Console
, but how do we actually run our .exe
? Remember we can’t use the dotnet run
command because that lives in the dotnet/CLI repository (and that would be breaking the rules of this slightly contrived challenge!!).
Again, fortunately those clever runtime engineers have thought of this exact scenario and they built the very helpful corerun
application. You can read more about in Using corerun To Run .NET Core Application, but the td;dr is that it will only look for dependencies in the same folder as your .exe.
So, to complete the challenge, we can now run CoreRun HelloWorld.exe
:
# CoreRun HelloWorld.exe
Hello World!
Yay, the least impressive demo you’ll see this year!!
For more information on how you can ‘host’ the CLR in your application I recommend this excellent tutorial Write a custom .NET Core host to control the .NET runtime from your native code. In addition, the docs page on ‘Runtime Hosts’ gives a nice overview of the different hosts that are available:
The .NET Framework ships with a number of different runtime hosts, including the hosts listed in the following table.
Runtime Host
Description
ASP.NET
Loads the runtime into the process that is to handle the Web request. ASP.NET also creates an application domain for each Web application that will run on a Web server.
Microsoft Internet Explorer
Creates application domains in which to run managed controls. The .NET Framework supports the download and execution of browser-based controls. The runtime interfaces with the extensibility mechanism of Microsoft Internet Explorer through a mime filter to create application domains in which to run the managed controls. By default, one application domain is created for each Web site.
Shell executables
Invokes runtime hosting code to transfer control to the runtime each time an executable is launched from the shell.
Thu, 13 Dec 2018, 12:00 am
Open Source .NET – 4 years later
A little over 4 years ago Microsoft announced that they were open sourcing large parts of the .NET framework and as this slide from New Features in .NET Core and ASP.NET Core 2.1 shows, the community has been contributing in a significant way:
Side-note: This post forms part of an on-going series, if you want to see how things have changed over time you can check out the previous ones:
Runtime Changes
Before I look at the numbers, I just want to take a moment to look at the significant runtime changes that have taken place over the last 4 years. Partly because I really like looking at the ‘Internals’ of CoreCLR, but also because the runtime is the one repository that makes all the others possible, they rely on it!
To give some context, here’s the slides from a presentation I did called ‘From ‘dotnet run’ to ‘hello world’. If you flick through them you’ll see what components make up the CoreCLR code-base and what they do to make your application run.
From 'dotnet run' to 'hello world' from
Matt Warren
So, after a bit of digging through the 19,059 commits, 5,790 issues and the 8 projects, here’s the list of significant changes in the .NET Core Runtime (CoreCLR) over the last few years (if I’ve missed any out, please let me know!!):
Span
(more info)
ref-like
like types (to support Span
)
- Tiered Compilation (more info)
- Cross-platform (Unix, OS X, etc, see list of all ‘os-xxx’ labels)
- New CPU Architectures
- Hardware Intrinsics (project)
- Default Interface Methods (project)
- Performance Monitoring and Diagnostics (project)
- Ready-to-Run Images
- LocalGC (project)
- Unloadability (project)
So there’s been quite a few large, fundamental changes to the runtime since it’s been open-sourced.
Repository activity over time
But onto the data, first we are going to look at an overview of the level of activity in each repo, by analysing the total number of ‘Issues’ (created) or ‘Pull Requests’ (closed) per month. (Sparklines FTW!!). If you are interested in how I got the data, see the previous post because the process is the same.
Note: Numbers in black are from the most recent month, with the red dot showing the lowest and the green dot the highest previous value. You can toggle between Issues and Pull Requests by clicking on the buttons, hover over individual sparklines to get a tooltip showing the per/month values and click on the project name to take you to the GitHub page for that repository.
Issues
Pull Requests
This data gives a good indication of how healthy different repos are, are they growing over time, or staying the same. You can also see the different levels of activity each repo has and how they compare to other ones.
Whilst it’s clear that Visual Studio Code is way ahead of all the other repos (in ‘# of Issues’), it’s interesting to see that some of the .NET-only ones are still pretty large, notably CoreFX (base-class libraries), Roslyn (compiler) and CoreCLR (runtime).
Next will will look at the total participation from the last 4 years, i.e. November 2014 to November 2018. All Pull Requests and Issues are treated equally, so a large PR counts the same as one that fixes a speling mistake. Whilst this isn’t ideal it’s the simplest way to get an idea of the Microsoft/Community split. In addition, Community does include people paid by other companies to work on .NET Projects, for instance Samsung Engineers.
Note: You can hover over the bars to get the actual numbers, rather than percentages.
Issues:
Microsoft
Community
Pull Requests:
Microsoft
Community
Finally we can see the ‘per-month’ data from the last 4 years, i.e. November 2014 to November 2018.
Note: You can inspect different repos by selecting them from the pull-down list, but be aware that the y-axis on the graphs are re-scaled, so the maximum value will change each time.
Issues:
Microsoft
Community
Pull Requests:
Microsoft
Community
Summary
It’s clear that the community continues to be invested in the .NET-related, Open Source repositories, contributing significantly and for a sustained period of time. I think this is good for all .NET developers, whether you contribute to OSS or not, having .NET be a thriving, Open Source product has many benefits!
Tue, 4 Dec 2018, 12:00 am
A History of .NET Runtimes
Recently I was fortunate enough to chat with Chris Bacon who wrote DotNetAnywhere (an alternative .NET Runtime) and I quipped with him:
.. you’re probably one of only a select group(*) of people who’ve written a .NET runtime, that’s pretty cool!
* if you exclude people who were paid to work on one, i.e. Microsoft/Mono/Xamarin engineers, it’s a very select group.
But it got me thinking, how many .NET Runtimes are there? I put together my own list, then enlisted a crack team of highly-paid researchers, a.k.a my twitter followers:
#LazyWeb, fun Friday quiz, how many different .NET Runtimes are there? (that implement ECMA-335 https://t.co/76stuYZLrw)
- .NET Framework
- .NET Core
- Mono
- Unity
- .NET Compact Framework
- DotNetAnywhere
- Silverlight
What have I missed out?
— Matt Warren (@matthewwarren)
September 14, 2018
For the purposes of this post I’m classifying a ‘.NET Runtime’ as anything that implements the ECMA-335 Standard for .NET (more info here). I don’t know if there’s a more precise definition or even some way of officially veryifying conformance, but in practise it means that the runtimes can take a .NET exe/dll produced by any C#/F#/VB.NET compiler and run it.
Once I had the list, I made copious use of wikipedia (see the list of ‘References’) and came up with the following timeline:
Timeline maker
(If the interactive timeline isn’t working for you, take a look at this version)
If I’ve missed out any runtimes, please let me know!
To make the timeline a bit easier to understand, I put each runtime into one of the following categories:
-
Microsoft .NET Frameworks
-
Other Microsoft Runtimes
-
Mono/Xamarin Runtimes
-
'Ahead-of-Time' (AOT) Runtimes
-
Community Projects
-
Research Projects
The rest of the post will look at the different runtimes in more detail. Why they were created, What they can do and How they compare to each other.
Microsoft .NET Frameworks
The original ‘.NET Framework’ was started by Microsoft in the late 1990’s and has been going strong ever since. Recently they’ve changed course somewhat with the announcement of .NET Core, which is ‘open-source’ and ‘cross-platform’. In addition, by creating the .NET Standard they’ve provided a way for different runtimes to remain compatible:
.NET Standard is for sharing code. .NET Standard is a set of APIs that all .NET implementations must provide to conform to the standard. This unifies the .NET implementations and prevents future fragmentation.
As an aside, if you want more information on the ‘History of .NET’, I really recommend Anders Hejlsberg - What brought about the birth of the CLR? and this presentation by Richard Campbell who really knows how to tell a story!
(Also available as a podcast if you’d prefer and he’s working on a book covering the same subject. If you want to learn more about the history of the entire ‘.NET Ecosystem’ not just the Runtimes, check out ‘Legends of .NET’)
Other Microsoft Runtimes
But outside of the main general purpose ‘.NET Framework’, Microsoft have also released other runtimes, designed for specific scenarios.
.NET Compact Framework
The Compact (.NET CF) and Micro (.NET MF) Frameworks were both attempts to provide cut-down runtimes that would run on more constrained devices, for instance .NET CF:
… is designed to run on resource constrained mobile/embedded devices such as personal digital assistants (PDAs), mobile phones factory controllers, set-top boxes, etc. The .NET Compact Framework uses some of the same class libraries as the full .NET Framework and also a few libraries designed specifically for mobile devices such as .NET Compact Framework controls. However, the libraries are not exact copies of the .NET Framework; they are scaled down to use less space.
.NET Micro Framework
The .NET MF is even more constrained:
… for resource-constrained devices with at least 256 KB of flash and 64 KB of random-access memory (RAM). It includes a small version of the .NET Common Language Runtime (CLR) and supports development in C#, Visual Basic .NET, and debugging (in an emulator or on hardware) using Microsoft Visual Studio. NETMF features a subset of the .NET base class libraries (about 70 classes with about 420 methods),..
NETMF also features added libraries specific to embedded applications. It is free and open-source software released under Apache License 2.0.
If you want to try it out, Scott Hanselman did a nice write-up The .NET Micro Framework - Hardware for Software People.
Silverlight
Although now only in support mode (or ‘dead’/‘sunsetted’ depending on your POV), it’s interesting to go back to the original announcement and see what Silverlight was trying to do:
Silverlight is a cross platform, cross browser .NET plug-in that enables designers and developers to build rich media experiences and RIAs for browsers. The preview builds we released this week currently support Firefox, Safari and IE browsers on both the Mac and Windows.
Back in 2007, Silverlight 1.0 had the following features (it even worked on Linux!):
- Built-in codec support for playing VC-1 and WMV video, and MP3 and WMA audio within a browser…
- Silverlight supports the ability to progressively download and play media content from any web-server…
- Silverlight also optionally supports built-in media streaming…
- Silverlight enables you to create rich UI and animations, and blend vector graphics with HTML to create compelling content experiences…
- Silverlight makes it easy to build rich video player interactive experiences…
Mono/Xamarin Runtimes
Mono came about when Miguel de Icaza and others explored the possibility of making .NET work on Linux (from Mono early history):
Who came first is not an important question to me, because Mono to me is a means to an end: a technology to help Linux succeed on the desktop.
The same post also talks about how it started:
On the Mono side, the events were approximately like this:
As soon as the .NET documents came out in December 2000, I got really interested in the technology, and started where everyone starts: at the byte code interpreter, but I faced a problem: there was no specification for the metadata though.
The last modification to the early VM sources was done on January 22 2001, around that time I started posting to the .NET mailing lists asking for the missing information on the metadata file format.
…
About this time Sam Ruby was pushing at the ECMA committee to get the binary file format published, something that was not part of the original agenda. I do not know how things developed, but by April 2001 ECMA had published the file format.
Over time, Mono (now Xamarin) has branched out into wider areas. It runs on Android and iOS/Mac and was acquired by Microsoft in Feb 2016. In addition Unity & Mono/Xamarim have long worked together, to provide C# support in Unity and Unity is now a member of the .NET Foundation.
'Ahead-of-Time' (AOT) Runtimes
I wanted to include AOT runtimes as a seperate category, because traditionally .NET has been ‘Just-in-Time’ Compiled, but over time more and more ‘Ahead-of-Time’ compilation options have been available.
As far as I can tell, Mono was the first, with an ‘AOT’ mode since Aug 2006, but recently, Microsoft have released .NET Native and are they’re working on CoreRT - A .NET Runtime for AOT.
Community Projects
However, not all ‘.NET Runtimes’ were developed by Microsoft, or companies that they later acquired. There are some ‘Community’ owned ones:
- The oldest is DotGNU Portable.NET, which started at the same time as Mono, with the goal ‘to build a suite of Free Software tools to compile and execute applications for the Common Language Infrastructure (CLI)..’.
- Secondly, there is DotNetAnywhere, the work of just one person, Chris Bacon. DotNetAnywhere has the claim to fame that it provided the initial runtime for the Blazor project. However it’s also an excellent resource if you want to look at what makes up a ‘.NET Compatible-Runtime’ and don’t have the time to wade through the millions of lines-of-code that make up the CoreCLR!
- Next comes CosmosOS (GitHub project), which is not just a .NET Runtime, but a ‘Managed Operating System’. If you want to see how it achieves this I recommend reading through the excellent FAQ or taking a quick look under the hood. Another similar effort is SharpOS.
- Finally, I recently stumbled across CrossNet, which takes a different approach, it ‘parses .NET assemblies and generates unmanaged C++ code that can be compiled on any standard C++ compiler.’ Take a look at the overview docs and example of generated code to learn more.
Research Projects
Finally, onto the more esoteric .NET Runtimes. These are the Research Projects run by Microsoft, with the aim of seeing just how far can you extend a ‘managed runtime’, what can they be used for. Some of this research work has made it’s way back into commercial/shipping .NET Runtimes, for instance Span came from Midori.
Shared Source Common Language Infrastructure (SSCLI) (a.k.a ‘Rotor):
is Microsoft’s shared source implementation of the CLI, the core of .NET. Although the SSCLI is not suitable for commercial use due to its license, it does make it possible for programmers to examine the implementation details of many .NET libraries and to create modified CLI versions. Microsoft provides the Shared Source CLI as a reference CLI implementation suitable for educational use.
An interesting side-effect of releasing Rotor is that they were also able to release the ‘Gyro’ Project, which gives an idea of how Generics were added to the .NET Runtime.
Midori:
Midori was the code name for a managed code operating system being developed by Microsoft with joint effort of Microsoft Research. It had been reported to be a possible commercial implementation of the Singularity operating system, a research project started in 2003 to build a highly dependable operating system in which the kernel, device drivers, and applications are all written in managed code. It was designed for concurrency, and could run a program spread across multiple nodes at once. It also featured a security model that sandboxes applications for increased security. Microsoft had mapped out several possible migration paths from Windows to Midori. The operating system was discontinued some time in 2015, though many of its concepts were rolled into other Microsoft projects.
Midori is the project that appears to have led to the most ideas making their way back into the ‘.NET Framework’, you can read more about this in Joe Duffy’s excellent series Blogging about Midori
- A Tale of Three Safeties
- Objects as Secure Capabilities
- Asynchronous Everything
- Safe Native Code
- The Error Model
- Performance Culture
- 15 Years of Concurrency
Singularity (operating system) (also Singularity RDK)
Singularity is an experimental operating system (OS) which was built by Microsoft Research between 2003 and 2010. It was designed as a high dependability OS in which the kernel, device drivers, and application software were all written in managed code. Internal security uses type safety instead of hardware memory protection.
Last, but not least, there is Redhawk:
Codename for experimental minimal managed code runtime that evolved into CoreRT.
References
Below are the Wikipedia articles I referenced when creating the timeline:
Tue, 2 Oct 2018, 12:00 am
Fuzzing the .NET JIT Compiler
I recently came across the excellent ‘Fuzzlyn’ project, created as part of the ‘Language-Based Security’ course at Aarhus University. As per the project description Fuzzlyn is a:
… fuzzer which utilizes Roslyn to generate random C# programs
And what is a ‘fuzzer’, from the Wikipedia page for ‘fuzzing’:
Fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program.
Or in other words, a fuzzer is a program that tries to create source code that finds bugs in a compiler.
Massive kudos to the developers behind Fuzzlyn, Jakob Botsch Nielsen (who helped answer my questions when writing this post), Chris Schmidt and Jonas Larsen, it’s an impressive project!! (to be clear, I have no link with the project and can’t take any of the credit for it)
Compilation in .NET
But before we dive into ‘Fuzzlyn’ and what it does, we’re going to take a quick look at ‘compilation’ in the .NET Framework. When you write C#/VB.NET/F# code (delete as appropriate) and compile it, the compiler converts it into Intermediate Language (IL) code. The IL is then stored in a .exe or .dll, which the Common Language Runtime (CLR) reads and executes when your program is actually run. However it’s the job of the Just-in-Time (JIT) Compiler to convert the IL code into machine code.
Why is this relevant? Because Fuzzlyn works by comparing the output of a Debug and a Release version of a program and if they are different, there’s a bug! But it turns out that very few optimisations are actually done by the ‘Roslyn’ compiler, compared to what the JIT does, from Eric Lippert’s excellent post What does the optimize switch do? (2009)
The /optimize flag does not change a huge amount of our emitting and generation logic. We try to always generate straightforward, verifiable code and then rely upon the jitter to do the heavy lifting of optimizations when it generates the real machine code. But we will do some simple optimizations with that flag set. For example, with the flag set:
He then goes on to list the 15 things that the C# Compiler will optimise, before finishing with this:
That’s pretty much it. These are very straightforward optimizations; there’s no inlining of IL, no loop unrolling, no interprocedural analysis whatsoever. We let the jitter team worry about optimizing the heck out of the code when it is actually spit into machine code; that’s the place where you can get real wins.
So in .NET, very few of the techniques that an ‘Optimising Compiler’ uses are done at compile-time. They are almost all done at run-time by the JIT Compiler (leaving aside AOT scenarios for the time being).
For reference, most of the differences in IL are there to make the code easier to debug, for instance given this C# code:
public void M() {
foreach (var item in new [] { 1, 2, 3, 4 }) {
Console.WriteLine(item);
}
}
The differences in IL are shown below (‘Release’ on the left, ‘Debug’ on the right). As you can see there are a few extra nop
instructions to allow the debugger to ‘step-through’ more locations in the code, plus an extra local variable, which makes it easier/possible to see the value when debugging.
(click for larger image or you can view the ‘Release’ version and the ‘Debug’ version on the excellent SharpLab)
For more information on the differences in Release/Debug code-gen see the ‘Release (optimized)’ section in this doc on CodeGen Differences. Also, because Roslyn is open-source we can see how this is handled in the code:
This all means that the ‘Fuzzlyn’ project has actually been finding bugs in the .NET JIT, not in the Roslyn Compiler
(well, except this one Finally block belonging to unexecuted try runs anyway, which was fixed here)
How it works
At the simplest level, Fuzzlyn works by compiling and running a piece of randomly generated code in ‘Debug’ and ‘Release’ versions and comparing the output. If the 2 versions produce different results, then it’s a bug, specifically a bug in the optimisations that the JIT compiler has attempted.
The .NET JIT, known as ‘RyuJIT’, has several modes. It can produce fully optimised code that has the highest-performance, or in can produce more ‘debug’ friendly code that has no optimisations, but is much simpler. You can find out more about the different ‘optimisations’ that RyuJIT performs in this excellent tutorial, in this design doc or you can search through the code for usages of the ‘compDbgCode’ flag.
From a high-level Fuzzlyn goes through the following steps:
- Randomly generate a C# program
- Check if the code produces an error (Debug v. Release)
- Reduce the code to it’s simplest form
If you want to see this in action, I ran Fuzzlyn until it produced a randomly generated program with a bug. You can see the original source (6,802 LOC) and the reduced version (28 LOC). What’s interesting is that you can clearly see the buggy line-of-code in the original code, before it’s turned into a simplified version:
// Generated by Fuzzlyn v1.1 on 2018-08-22 15:19:26
// Seed: 14928117313359926641
// Reduced from 256.3 KiB to 0.4 KiB in 00:01:58
// Debug: Prints 0 line(s)
// Release: Prints 1 line(s)
public class Program
{
static short s_18;
static byte s_33 = 1;
static int[] s_40 = new int[]{0};
static short s_74 = 1;
public static void Main()
{
s_18 = -1;
// This comparision is the bug, in Debug it's False, in Release it's True
// However, '(ushort)(s_18 | 2L)' is 65,535 in Debug *and* Release
if (((ushort)(s_18 | 2L)
Tue, 28 Aug 2018, 12:00 am
Monitoring and Observability in the .NET Runtime
.NET is a managed runtime, which means that it provides high-level features that ‘manage’ your program for you, from Introduction to the Common Language Runtime (CLR) (written in 2007):
The runtime has many features, so it is useful to categorize them as follows:
- Fundamental features – Features that have broad impact on the design of other features. These include:
- Garbage Collection
- Memory Safety and Type Safety
- High level support for programming languages.
- Secondary features – Features enabled by the fundamental features that may not be required by many useful programs:
- Program isolation with AppDomains
- Program Security and sandboxing
- Other Features – Features that all runtime environments need but that do not leverage the fundamental features of the CLR. Instead, they are the result of the desire to create a complete programming environment. Among them are:
- Versioning
- Debugging/Profiling
- Interoperation
You can see that ‘Debugging/Profiling’, whilst not a Fundamental or Secondary feature, still makes it into the list because of a ‘desire to create a complete programming environment’.
The rest of this post will look at what Monitoring, Observability and Introspection features the Core CLR provides, why they’re useful and how it provides them.
To make it easier to navigate, the post is split up into 3 main sections (with some ‘extra-reading material’ at the end):
Diagnostics
Firstly we are going to look at the diagnostic information that the CLR provides, which has traditionally been supplied via ‘Event Tracing for Windows’ (ETW).
There is quite a wide range of events that the CLR provides related to:
- Garbage Collection (GC)
- Just-in-Time (JIT) Compilation
- Module and AppDomains
- Threading and Lock Contention
- and much more
For example this is where the AppDomain Load event is fired, this is the Exception Thrown event and here is the GC Allocation Tick event.
Perf View
If you want to see the ETW Events coming from your .NET program I recommend using the excellent PerfView tool and starting with these PerfView Tutorials or this excellent talk PerfView: The Ultimate .NET Performance Tool. PerfView is widely regarded because it provides invaluable information, for instance Microsoft Engineers regularly use it for performance investigations.
Common Infrastructure
However, in case it wasn’t clear from the name, ETW events are only available on Windows, which doesn’t really fit into the new ‘cross-platform’ world of .NET Core. You can use PerfView for Performance Tracing on Linux (via LTTng), but that is only the cmd-line collection tool, known as ‘PerfCollect’, the analysis and rich UI (which includes flamegraphs) is currently Windows only.
But if you do want to analyse .NET Performance Linux, there are some other approaches:
The 2nd link above discusses the new ‘EventPipe’ infrastructure that is being worked on in .NET Core (along with EventSources & EventListeners, can you spot a theme!), you can see its aims in Cross-Platform Performance Monitoring Design. At a high-level it will provide a single place for the CLR to push ‘events’ related to diagnostics and performance. These ‘events’ will then be routed to one or more loggers which may include ETW, LTTng, and BPF for example, with the exact logger being determined by which OS/Platform the CLR is running on. There is also more background information in .NET Cross-Plat Performance and Eventing Design that explains the pros/cons of the different logging technologies.
All the work being done on ‘Event Pipes’ is being tracked in the ‘Performance Monitoring’ project and the associated ‘EventPipe’ Issues.
Future Plans
Finally, there are also future plans for a Performance Profiling Controller which has the following goal:
The controller is responsible for control of the profiling infrastructure and exposure of performance data produced by .NET performance diagnostics components in a simple and cross-platform way.
The idea is for it to expose the following functionality via a HTTP server, by pulling all the relevant data from ‘Event Pipes’:
REST APIs
- Pri 1: Simple Profiling: Profile the runtime for X amount of time and return the trace.
- Pri 1: Advanced Profiling: Start tracing (along with configuration)
- Pri 1: Advanced Profiling: Stop tracing (the response to calling this will be the trace itself)
- Pri 2: Get the statistics associated with all EventCounters or a specified EventCounter.
Browsable HTML Pages
- Pri 1: Textual representation of all managed code stacks in the process.
- Provides an snapshot overview of what’s currently running for use as a simple diagnostic report.
- Pri 2: Display the current state (potentially with history) of EventCounters.
- Provides an overview of the existing counters and their values.
- OPEN ISSUE: I don’t believe the necessary public APIs are present to enumerate EventCounters.
I’m excited to see where the ‘Performance Profiling Controller’ (PPC?) goes, I think it’ll be really valuable for .NET to have this built-in to the CLR, it’s something that other runtimes have.
Profiling
Another powerful feature the CLR provides is the Profiling API, which is (mostly) used by 3rd party tools to hook into the runtime at a very low-level. You can find our more about the API in this overview, but at a high-level, it allows your to wire up callbacks that are triggered when:
- GC-related events happen
- Exceptions are thrown
- Assemblies are loaded/unloaded
- much, much more
Image from the BOTR page Profiling API – Overview
In addition is has other very power features. Firstly you can setup hooks that are called every time a .NET method is executed whether in the runtime or from users code. These callbacks are known as ‘Enter/Leave’ hooks and there is a nice sample that shows how to use them, however to make them work you need to understand ‘calling conventions’ across different OSes and CPU architectures, which isn’t always easy. Also, as a warning, the Profiling API is a COM component that can only be accessed via C/C++ code, you can’t use it from C#/F#/VB.NET!
Secondly, the Profiler is able to re-write the IL code of any .NET method before it is JITted, via the SetILFunctionBody() API. This API is hugely powerful and forms the basis of many .NET APM Tools, you can learn more about how to use it in my previous post How to mock sealed classes and static methods and the accompanying code.
ICorProfiler API
It turns out that the run-time has to perform all sorts of crazy tricks to make the Profiling API work, just look at what went into this PR Allow rejit on attach (for more info on ‘ReJIT’ see ReJIT: A How-To Guide).
The overall definition for all the Profiling API interfaces and callbacks is found in \vm\inc\corprof.idl (see Interface description language). But it’s divided into 2 logical parts, one is the Profiler -> ‘Execution Engine’ (EE) interface, known asICorProfilerInfo
:
// Declaration of class that implements the ICorProfilerInfo* interfaces, which allow the
// Profiler to communicate with the EE. This allows the Profiler DLL to get
// access to private EE data structures and other things that should never be exported
// outside of the EE.
Which is implemented in the following files:
The other main part is the EE -> Profiler callbacks, which are grouped together under the ICorProfilerCallback
interface:
// This module implements wrappers around calling the profiler's
// ICorProfilerCallaback* interfaces. When code in the EE needs to call the
// profiler, it goes through EEToProfInterfaceImpl to do so.
These callbacks are implemented across the following files:
Finally, it’s worth pointing out that the Profiler APIs might not work across all OSes and CPU-archs that .NET Core runs on, e.g. ELT call stub issues on Linux, see Status of CoreCLR Profiler APIs for more info.
Profiling v. Debugging
As a quick aside, ‘Profiling’ and ‘Debugging’ do have some overlap, so it’s helpful to understand what the different APIs provide in the context of the .NET Runtime, from CLR Debugging vs. CLR Profiling
Debugging
Debugging means different things to different people, for instance I asked on Twitter “what are the ways that you’ve debugged a .NET program” and got a wide range of different responses, although both sets of responses contain a really good list of tools and techniques, so they’re worth checking out, thanks #LazyWeb!
But perhaps this quote best sums up what Debugging really is 😊
Debugging is like being the detective in a crime movie where you are also the murderer.
— Filipe Fortes (@fortes)
November 10, 2013
The CLR provides a very extensive range of features related to Debugging, but why does it need to provide these services, the excellent post Why is managed debugging different than native-debugging? provides 3 reasons:
- Native debugging can be abstracted at the hardware level but managed debugging needs to be abstracted at the IL level
- Managed debugging needs a lot of information not available until runtime
- A managed debugger needs to coordinate with the Garbage Collector (GC)
So to give a decent experience, the CLR has to provide the higher-level debugging API known as ICorDebug
, which is shown in the image below of a ‘common debugging scenario’ from the BOTR:
In addition, there is a nice description of how the different parts interact in How do Managed Breakpoints work?:
Here’s an overview of the pipeline of components:
1) End-user
2) Debugger (such as Visual Studio or MDbg).
3) CLR Debugging Services (which we call "The Right Side"). This is the implementation of ICorDebug (in mscordbi.dll).
---- process boundary between Debugger and Debuggee ----
4) CLR. This is mscorwks.dll. This contains the in-process portion of the debugging services (which we call "The Left Side") which communicates directly with the RS in stage #3.
5) Debuggee's code (such as end users C# program)
ICorDebug API
But how is all this implemented and what are the different components, from CLR Debugging, a brief introduction:
All of .Net debugging support is implemented on top of a dll we call “The Dac”. This file (usually named mscordacwks.dll
) is the building block for both our public debugging API (ICorDebug
) as well as the two private debugging APIs: The SOS-Dac API and IXCLR.
In a perfect world, everyone would use ICorDebug
, our public debugging API. However a vast majority of features needed by tool developers such as yourself is lacking from ICorDebug
. This is a problem that we are fixing where we can, but these improvements go into CLR v.next, not older versions of CLR. In fact, the ICorDebug
API only added support for crash dump debugging in CLR v4. Anyone debugging CLR v2 crash dumps cannot use ICorDebug
at all!
(for an additional write-up, see SOS & ICorDebug)
The ICorDebug
API is actually split up into multiple interfaces, there are over 70 of them!! I won’t list them all here, but I will show the categories they fall into, for more info see Partition of ICorDebug where this list came from, as it goes into much more detail.
- Top-level: ICorDebug + ICorDebug2 are the top-level interfaces which effectively serve as a collection of ICorDebugProcess objects.
- Callbacks: Managed debug events are dispatched via methods on a callback object implemented by the debugger
- Process: This set of interfaces represents running code and includes the APIs related to eventing.
- Code / Type Inspection: Could mostly operate on a static PE image, although there are a few convenience methods for live data.
- Execution Control: Execution is the ability to “inspect” a thread’s execution. Practically, this means things like placing breakpoints (F9) and doing stepping (F11 step-in, F10 step-over, S+F11 step-out). ICorDebug’s Execution control only operates within managed code.
- Threads + Callstacks: Callstacks are the backbone of the debugger’s inspection functionality. The following interfaces are related to taking a callstack. ICorDebug only exposes debugging managed code, and thus the stacks traces are managed-only.
- Object Inspection: Object inspection is the part of the API that lets you see the values of the variables throughout the debuggee. For each interface, I list the “MVP” method that I think must succinctly conveys the purpose of that interface.
One other note, as with the Profiling APIs the level of support for the Debugging API varies across OS’s and CPU architectures. For instance, as of Aug 2018 there’s “no solution for Linux ARM of managed debugging and diagnostic”. For more info on ‘Linux’ support in general, see this great post Debugging .NET Core on Linux with LLDB and check-out the Diagnostics repository from Microsoft that has the goal of making it easier to debug .NET programs on Linux.
Finally, if you want to see what the ICorDebug
APIs look like in C#, take a look at the wrappers included in CLRMD library, include all the available callbacks (CLRMD will be covered in more depth, later on in this post).
SOS and the DAC
The ‘Data Access Component’ (DAC) is discussed in detail in the BOTR page, but in essence it provides ‘out-of-process’ access to the CLR data structures, so that their internal details can be read from another process. This allows a debugger (via ICorDebug
) or the ‘Son of Strike’ (SOS) extension to reach into a running instance of the CLR or a memory dump and find things like:
- all the running threads
- what objects are on the managed heap
- full information about a method, including the machine code
- the current ‘stack trace’
Quick aside, if you want an explanation of all the strange names and a bit of a ‘.NET History Lesson’ see this Stack Overflow answer.
The full list of SOS Commands is quite impressive and using it along-side WinDBG allows you a very low-level insight into what’s going on in your program and the CLR. To see how it’s implemented, lets take a look at the !HeapStat
command that gives you a summary of the size of different Heaps that the .NET GC is using:
(image from SOS: Upcoming release has a few new commands – HeapStat)
Here’s the code flow, showing how SOS and the DAC work together:
- SOS The full
!HeapStat
command (link)
- SOS The code in the
!HeapStat
command that deals with the ‘Workstation GC’ (link)
- SOS
GCHeapUsageStats(..)
function that does the heavy-lifting (link)
- Shared The
DacpGcHeapDetails
data structure that contains pointers to the main data in the GC heap, such as segments, card tables and individual generations (link)
- DAC
GetGCHeapStaticData
function that fills-out the DacpGcHeapDetails
struct (link)
- Shared the
DacpHeapSegmentData
data structure that contains details for an individual ‘segment’ with the GC Heap (link)
- DAC
GetHeapSegmentData(..)
that fills-out the DacpHeapSegmentData
struct (link)
3rd Party ‘Debuggers’
Because Microsoft published the debugging API it allowed 3rd parties to make use of the use of the ICorDebug
interfaces, here’s a list of some that I’ve come across:
Memory Dumps
The final area we are going to look at is ‘memory dumps’, which can be captured from a live system and analysed off-line. The .NET runtime has always had good support for creating ‘memory dumps’ on Windows and now that .NET Core is ‘cross-platform’, the are also tools available do the same on other OSes.
One of the issues with ‘memory dumps’ is that it can be tricky to get hold of the correct, matching versions of the SOS and DAC files. Fortunately Microsoft have just released the dotnet symbol
CLI tool that:
can download all the files needed for debugging (symbols, modules, SOS and DAC for the coreclr module given) for any given core dump, minidump or any supported platform’s file formats like ELF, MachO, Windows DLLs, PDBs and portable PDBs.
Finally, if you spend any length of time analysing ‘memory dumps’ you really should take a look at the excellent CLR MD library that Microsoft released a few years ago. I’ve previously written about what you can do with it, but in a nutshell, it allows you to interact with memory dumps via an intuitive C# API, with classes that provide access to the ClrHeap, GC Roots, CLR Threads, Stack Frames and much more. In fact, aside from the time needed to implemented the work, CLR MD could implement most (if not all) of the SOS commands.
But how does it work, from the announcement post:
The ClrMD managed library is a wrapper around CLR internal-only debugging APIs. Although those internal-only APIs are very useful for diagnostics, we do not support them as a public, documented release because they are incredibly difficult to use and tightly coupled with other implementation details of the CLR. ClrMD addresses this problem by providing an easy-to-use managed wrapper around these low-level debugging APIs.
By making these APIs available, in an officially supported library, Microsoft have enabled developers to build a wide range of tools on top of CLRMD, which is a great result!
So in summary, the .NET Runtime provides a wide-range of diagnostic, debugging and profiling features that allow a deep-insight into what’s going on inside the CLR.
Discuss this post on HackerNews, /r/programming or /r/csharp
Further Reading
Where appropriate I’ve included additional links that covers the topics discussed in this post.
General
ETW Events and PerfView:
Profiling API:
Debugging:
Memory Dumps:
Tue, 21 Aug 2018, 12:00 am
Presentations and Talks covering '.NET Internals'
I’m constantly surprised at just how popular resources related to ‘.NET Internals’ are, for instance take this tweet and the thread that followed:
If you like learning about '.NET Internals' here's a few talks/presentations I've watched that you might also like. First 'Writing High Performance Code in .NET' by Bart de Smet https://t.co/L5S9BsBlWe
— Matt Warren (@matthewwarren)
July 9, 2018
All I’d done was put together a list of Presentations/Talks (based on the criteria below) and people really seemed to appreciate it!!
Criteria
To keep things focussed, the talks or presentations:
- Must explain some aspect of the ‘internals’ of the .NET Runtime (CLR)
- i.e. something ‘under-the-hood’, the more ‘low-level’ the better!
- e.g. how the GC works, what the JIT does, how assemblies are structured, how to inspect what’s going on, etc
- Be entertaining and worth watching!
- i.e. worth someone giving up 40-50 mins of their time for
- this is hard when you’re talking about low-level details, not all speakers manage it!
- Needs to be a talk that I’ve watched myself and actually learnt something from
- i.e. I don’t just hope it’s good based on the speaker/topic
- Doesn’t have to be unique, fine if it overlaps with another talk
- it often helps having two people cover the same idea, from different perspectives
If you want more general lists of talks and presentations see Awesome talks and Awesome .NET Performance
List of Talks
Here’s the complete list of talks, including a few bonus ones that weren’t in the tweet:
- PerfView: The Ultimate .NET Performance Tool by Sasha Goldshtein
- Writing High Performance Code in .NET by Bart De Smet
- State of the .NET Performance by Adam Sitnik
- Let’s talk about microbenchmarking by Andrey Akinshin
- Safe Systems Programming in C# and .NET (summary) by Joe Duffy
- FlingOS - Using C# for an OS by Ed Nutting
- Maoni Stephens on .NET GC by Maoni Stephens
- What’s new for performance in .NET Core 2.0 by Ben Adams
- Open Source Hacking the CoreCLR by Geoff Norton
- .NET Core & Cross Platform by Matt Ellis
- .NET Core on Unix by Jan Vorlicek
- Multithreading Deep Dive by Gael Fraiteur
- Everything you need to know about .NET memory by Ben Emmett
I also added these 2 categories:
If I’ve missed any out, please let me know in the comments (or on twitter)
PerfView: The Ultimate .NET Performance Tool by Sasha Goldshtein (slides)
In fact, just watch all the talks/presentations that Sasha has done, they’re great!! For example Modern Garbage Collection in Theory and Practice and Making .NET Applications Faster
This talk is a great ‘how-to’ guide for PerfView, what it can do and how to use it (JIT stats, memory allocations, CPU profiling). For more on PerfView see this interview with it’s creator, Vance Morrison: Performance and PerfView.
Writing High Performance Code in .NET by Bart De Smet (he also has a some Pluralsight Courses on the same subject)
Features CLRMD, WinDBG, ETW Events and PerfView, plus some great ‘real world’ performance issues
State of the .NET Performance by Adam Sitnik (slides)
How to write high-perf code that plays nicely with the .NET GC, covering Span, Memory & ValueTask
Let’s talk about microbenchmarking by Andrey Akinshin (slides)
Primarily a look at how to benchmark .NET code, but along the way it demonstrates some of the internal behaviour of the JIT compiler (Andrey is the creator of BenchmarkDotNet)
Safe Systems Programming in C# and .NET (summary) by Joe Duffy (slides and blog)
Joe Duffy (worked on the Midori project) shows why C# is a good ‘System Programming’ language, including what low-level features it provides
FlingOS - Using C# for an OS by Ed Nutting (slides)
Shows what you need to do if you want to write and entire OS in C# (!!) The FlingOS project is worth checking out, it’s a great learning resource.
Maoni Stephens on .NET GC by Maoni Stephens who is the main (only?) .NET GC developer. In addition CLR 4.5 Server Background GC and .NET 4.5 in Practice: Bing are also worth a watch.
An in-depth Q&A on how the .NET GC works, why is does what it does and how to use it efficiently
What’s new for performance in .NET Core 2.0 by Ben Adams (slides)
Whilst it mostly focuses on performance, there is some great internal details on how the JIT generates code for ‘de-virtualisation’, ‘exception handling’ and ‘bounds checking’
Open Source Hacking the CoreCLR by Geoff Norton
Making .NET Core (the CoreCLR) work on OSX was mostly a ‘community contribution’, this talks is a ‘walk-through’ of what it took to make it happen
.NET Core & Cross Platform by Matt Ellis, one of the .NET Runtime Engineers (this one on how made .NET Core ‘Open Source’ is also worth a watch)
Discussion of the early work done to make CoreCLR ‘cross-platform’, including the build setup, ‘Platform Abstraction Layer’ (PAL) and OS differences that had to be accounted for
.NET Core on Unix by Jan Vorlicek a .NET Runtime Engineer (slides)
This talk discusses which parts of the CLR had to be changed to run on Unix, including exception handling, calling conventions, runtime suspension and the PAL
Multithreading Deep Dive by Gael Fraiteur (creator of PostSharp)
Takes a really in-depth look at the CLR memory-model and threading primitives
Everything you need to know about .NET memory by Ben Emmett (slides)
Explains how the .NET GC works using Lego! A very innovative and effective approach!!
Channel 9
The Channel 9 videos recorded by Microsoft deserve their own category, because there’s so much deep, technical information in them. This list is just a selection, including some of my favourites, there are many, many more available!!
Ones to watch
I can’t recommend these yet, because I haven’t watched them myself! (I can’t break my own rules!!).
But they all look really interesting and I will watch them as soon as I get a chance, so I thought they were worth including:
If this post causes you to go off and watch hours and hours of videos, ignoring friends, family and work for the next few weeks, Don’t Blame Me
Thu, 12 Jul 2018, 12:00 am
.NET JIT and CLR - Joined at the Hip
I’ve been digging into .NET Internals for a while now, but never really looked closely at how the ‘Just-in-Time’ (JIT) compiler works. In my mind, the interaction between the .NET Runtime and the JIT has always looked like this:
Nice and straight-forward, the CLR asks the JIT to compile some ‘Intermediate Language’ (IL) code into machine code and the JIT hands back the bytes when it’s done.
However, it turns out the interaction is much more complicated, in reality it looks more like this:
The JIT and the CLR’s ‘Execution Engine’ (EE) or ‘Virtual Machine’ (VM) work closely with one another, they really are ‘joined at the hip’.
The rest of this post will explore the interaction between the 2 components, how they work together and why they need to.
The JIT Compiler
As a quick aside, this post will not be talking about the internals of the JIT compiler itself, if you want to find out more about how that works I recommend reading the fantastic overview in the BOTR and this excellent tutorial, where this very helpful diagram comes from:
After all that, if you still want more, you can take a look at the ‘JIT’ section in the ‘Hitchhikers-Guide-to-the-CoreCLR-Source-Code’.
Components within the CLR
Before we go any further it’s helpful to discuss how the ‘Common Language Runtime’ (CLR) is actually composed. It’s actually made up of several different components including the VM/EE, JIT, GC and others. The treemap below shows the different areas of the source code, grouped by colour into the top-level sections they fall under. You can clearly see that the VM and JIT dominate as well as ‘mscorlib’ which is the only component written in C#.
You can hover over an individual box to get more detailed information and can click on the different radio buttons to toggle the sizing (LOC/Files/Commits)
Total L.O.C
# Files
# Commits
Note: This treemap is from my previous post ‘Hitchhikers-Guide-to-the-CoreCLR-Source-Code’ which was written over a year ago, so the exact numbers will have changed in the meantime.
You can also see these ‘components’ or ‘areas’ reflected in the classification scheme used for the CoreCLR GitHub issues (one difference is that area-CodeGen
is used instead of JIT
).
The CLR and the JIT Compiler
Onto the main subject, just how do the CLR and the JIT compiler work together to transform a method from IL to machine code? As always, the ‘Book of the Runtime’ is a good place to start, from the ‘Execution Environment and External Interface’ section of the RyuJIT Overview:
RyuJIT provides the just in time compilation service for the .NET runtime. The runtime itself is variously called the EE (execution engine), the VM (virtual machine) or simply the CLR (common language runtime). Depending upon the configuration, the EE and JIT may reside in the same or different executable files. RyuJIT implements the JIT side of the JIT/EE interfaces:
ICorJitCompiler
– this is the interface that the JIT compiler implements. This interface is defined in src/inc/corjit.h and its implementation is in src/jit/ee_il_dll.cpp. The following are the key methods on this interface:
compileMethod
is the main entry point for the JIT. The EE passes it a ICorJitInfo
object, and the “info” containing the IL, the method header, and various other useful tidbits. It returns a pointer to the code, its size, and additional GC, EH and (optionally) debug info.
getVersionIdentifier
is the mechanism by which the JIT/EE interface is versioned. There is a single GUID (manually generated) which the JIT and EE must agree on.
getMaxIntrinsicSIMDVectorLength
communicates to the EE the largest SIMD vector length that the JIT can support.
ICorJitInfo
– this is the interface that the EE implements. It has many methods defined on it that allow the JIT to look up metadata tokens, traverse type signatures, compute field and vtable offsets, find method entry points, construct string literals, etc. This bulk of this interface is inherited from ICorDynamicInfo
which is defined in src/inc/corinfo.h. The implementation is defined in src/vm/jitinterface.cpp.
So there are 2 main interfaces, ICorJitCompiler
which is implemented by the JIT compiler and allows the EE to control how a method is compiled. Second there is ICorJitInfo
which the EE implements to allow the JIT to request information it needs during compilation.
Let’s now look at these interfaces in more detail.
Firstly, we’ll examine ICorJitCompiler
, the interface exposed by the JIT. It’s actually pretty straight-forward and only contains 7 methods:
CorJitResult __stdcall compileMethod (..)
void clearCache()
BOOL isCacheCleanupRequired()
void ProcessShutdownWork(ICorStaticInfo* info)
void getVersionIdentifier(..)
unsigned getMaxIntrinsicSIMDVectorLength(..)
void setRealJit(..)
Of these, the most interesting one is compileMethod(..), which has the following signature:
virtual CorJitResult __stdcall compileMethod (
ICorJitInfo *comp, /* IN */
struct CORINFO_METHOD_INFO *info, /* IN */
unsigned /* code:CorJitFlag */ flags, /* IN */
BYTE **nativeEntry, /* OUT */
ULONG *nativeSizeOfCode /* OUT */
) = 0;
The EE provides the JIT with information about the method it wants compiled (CORINFO_METHOD_INFO
) as well as flags (CorJitFlag
) which control the:
- Level of optimisation
- Whether the code is compiled in
Debug
or Release
mode
- If the code needs to be ‘Profilable’ or support ‘Edit-and-Continue’
- Alignment of loops, i.e. should they be aligned on byte-boundaries
- If
SSE3
/SSE4
should be used
- and many other scenarios
The final parameter is a reference to the ICorJitInfo
interface, which is covered in the next section.
The APIs that the EE has to implement to work with the JIT are not simple, there are almost 180 functions or callbacks!!
Interface
Method Count
ICorJitHost
5
ICorJitInfo
19
ICorDynamicInfo
36
ICorStaticInfo
118
Total
178
Note: The links take you to the function ‘definitions’ for a given interface. Alternatively all the methods are listed together in this gist.
ICorJitHost
makes available ‘functionality that would normally be provided by the operating system’, predominantly the ability to allocate the ‘pages’ of memory that the JIT uses during compilation.
ICorJitInfo
(class ICorJitInfo : public ICorDynamicInfo
) contains more specific memory allocation routines, including ones for the ‘GC Info’ data, a ‘method/funclet’s unwind information’, ‘.rdata and .pdata for a method’ and the ‘exception handler blocks’.
ICorDynamicInfo
(class ICorDynamicInfo : public ICorStaticInfo
) provides data that can change from ‘invocation to invocation’, i.e. the JIT cannot cache the results of these method calls. It includes functions that provide:
- Thread Local Storage (TLS) index
- Function Entry Point (address)
- EE ‘helper functions’
- Address of a Field
- Constructor for a
delegate
- and much more
Finally, ICorStaticInfo
, which is further sub-divided up into more specific interfaces:
Interface
Method Count
ICorMethodInfo
28
ICorModuleInfo
9
ICorClassInfo
49
ICorFieldInfo
7
ICorDebugInfo
4
ICorArgInfo
4
ICorErrorInfo
7
Diagnostic methods
6
General methods
2
Misc methods
2
Total
118
Because the interface is nicely composed we can easily see what it provides. The bulk of the functions are concerned with information about a module
, class
, method
or field
. For instance the JIT can query the class size, GC layout and obtain the address of a field within a class. It can also learn about a method’s signature, find it’s parent class and get ‘exception handling’ information (the full list of methods are available in this gist).
These interfaces and the methods they contain give a nice insight into what information the JIT requests from the runtime and therefore what knowledge it requires when compiling a single method.
Now, let’s look at the end-to-end flow of a couple of these methods and see where they are implemented in the CoreCLR source code.
EE ➜ JIT getFunctionEntryPoint(..)
First we’ll look at a method where the EE provides information to the JIT:
JIT ➜ EE reportInliningDecision()
Next we’ll look at a scenario where the data flows from the JIT back to the EE:
Finally, I just want to cover the ‘SuperPMI’ tool that showed up in the previous 2 scenarios. What is this tool and what does it do? From the CoreCLR glossary:
SuperPMI - JIT component test framework (super fast JIT testing - it mocks/replays EE in EE-JIT interface)
So in a nutshell it allows JIT development and testing to be de-coupled from the EE, which is useful because we’ve just seen that the 2 components are tightly integrated.
But how does it work? From the README:
SuperPMI works in two phases: collection and playback. In the collection phase, the system is configured to collect SuperPMI data. Then, run any set of .NET managed programs. When these managed programs invoke the JIT compiler, SuperPMI gathers and captures all information passed between the JIT and its .NET host. In the playback phase, SuperPMI loads the JIT directly, and causes it to compile all the functions that it previously compiled, but using the collected data to provide answers to various questions that the JIT needs to ask. The .NET execution engine (EE) is not invoked at all.
This explains why there is a SuperPMI implementation for every method that is part of the JIT EE interface. SuperPMI needs to ‘record’ or ‘collect’ each interaction with the EE and store the information so that it can be ‘played back’ at a later time, when the EE isn’t present.
Discuss this post on Hacker News or /r/dotnet
Further Reading
As always, if you’ve read this far, here’s some further information that you might find useful:
Thu, 5 Jul 2018, 12:00 am
Tools for Exploring .NET Internals
Whether you want to look at what your code is doing ‘under-the-hood’ or you’re trying to see what the ‘internals’ of the CLR look like, there is a whole range of tools that can help you out.
To give ‘credit where credit is due’, this post is based on a tweet, so thanks to everyone who contributed to the list and if I’ve missed out any tools, please let me know in the comments below.
While you’re here, I’ve also written other posts that look at the ‘internals’ of the .NET Runtime:
Honourable Mentions
Firstly I’ll start by mentioning that Visual Studio has a great debugger and so does VSCode. Also there are lots of very good (commercial) .NET Profilers and Application Monitoring Tools available that you should also take a look at. For example I’ve recently been playing around with Codetrack and I’m very impressed by what it can do!
However, the rest of the post is going to look at some more single-use tools that give a even deeper insight into what is going on. As a added bonus they’re all ‘open-source’, so you can take a look at the code and see how they work!!
PerfView is simply an excellent tool and is the one that I’ve used most over the years. It uses ‘Event Tracing for Windows’ (ETW) Events to provide a deep insight into what the CLR is doing, as well as allowing you to profile Memory and CPU usage. It does have a fairly steep learning curve, but there are some nice tutorials to help you along the way and it’s absolutely worth the time and effort.
Also, if you need more proof of how useful it is, Microsoft Engineers themselves use it and many of the recent performance improvements in MSBuild were carried out after using PerfView to find the bottlenecks.
PerfView is built on-top of the Microsoft.Diagnostics.Tracing.TraceEvent library which you can use in your own tools. In addition, since it’s been open-sourced the community has contributed and it has gained some really nice features, including flame-graphs:
(Click for larger version)
SharpLab started out as a tool for inspecting the IL code emitted by the Roslyn compiler, but has now grown into much more:
SharpLab is a .NET code playground that shows intermediate steps and results of code compilation.
Some language features are thin wrappers on top of other features – e.g. using()
becomes try/catch
.
SharpLab allows you to see the code as compiler sees it, and get a better understanding of .NET languages.
If supports C#, Visual Basic and F#, but most impressive are the ‘Decompilation/Disassembly’ features:
There are currently four targets for decompilation/disassembly:
- C#
- Visual Basic
- IL
- JIT Asm (Native Asm Code)
That’s right, it will output the assembly code that the .NET JIT generates from your C#:
This tool gives you an insight into the memory layout of your .NET objects, i.e. it will show you how the JITter has decided to arrange the fields within your class
or struct
. This can be useful when writing high-performance code and it’s helpful to have a tool that does it for us because doing it manually is tricky:
There is no official documentation about fields layout because the CLR authors reserved the right to change it in the future. But knowledge about the layout can be helpful if you’re curious or if you’re working on a performance critical application.
How can we inspect the layout? We can look at a raw memory in Visual Studio or use !dumpobj
command in SOS Debugging Extension. These approaches are tedious and boring, so we’ll try to write a tool that will print an object layout at runtime.
From the example in the GitHub repo, if you use TypeLayout.Print()
with code like this:
public struct NotAlignedStruct
{
public byte m_byte1;
public int m_int;
public byte m_byte2;
public short m_short;
}
You’ll get the following output, showing exactly how the CLR will layout the struct
in memory, based on it’s padding and optimization rules.
Size: 12. Paddings: 4 (%33 of empty space)
|================================|
| 0: Byte m_byte1 (1 byte) |
|--------------------------------|
| 1-3: padding (3 bytes) |
|--------------------------------|
| 4-7: Int32 m_int (4 bytes) |
|--------------------------------|
| 8: Byte m_byte2 (1 byte) |
|--------------------------------|
| 9: padding (1 byte) |
|--------------------------------|
| 10-11: Int16 m_short (2 bytes) |
|================================|
TUNE is a really intriguing tool, as it says on the GitHub page, it’s purpose is to help you
… learn .NET internals and performance tuning by experiments with C# code.
You can find out more information about what it does in this blog post, but at a high-level it works like this:
- write a sample, valid C# script which contains at least one class with public method taking a single string parameter. It will be executed by hitting Run button. This script can contain as many additional methods and classes as you wish. Just remember that first public method from the first public class will be executed (with single parameter taken from the input box below the script). …
- after clicking Run button, the script will be compiled and executed. Additionally, it will be decompiled both to IL (Intermediate Language) and assembly code in the corresponding tabs.
- all the time Tune is running (including time during script execution) a graph with GC data is being drawn. It shows information about generation sizes and GC occurrences (illustrated as vertical lines with the number below indicating which generation has been triggered).
And looks like this:
(Click for larger version)
Finally, we’re going to look at a particular category of tools. Since .NET came out you’ve always been able to use WinDBG and the SOS Debugging Extension to get deep into the .NET runtime. However it’s not always the easiest tool to get started with and as this tweet says, it’s not always the most productive way to do things:
Besides how complex it is, the idea is to build better abstractions. Raw debugging at the low level is just usually too unproductive. That to me is the promise of ClrMD, that it lets us build specific extensions to extract quickly the right info
— Tomas Restrepo (@tomasrestrepo)
March 14, 2018
Fortunately Microsoft made the ClrMD library available (a.k.a Microsoft.Diagnostics.Runtime), so now anyone can write a tool that analyses memory dumps of .NET programs. You can find out even more info in the official blog post and I also recommend taking a look at ClrMD.Extensions that “.. provide integration with LINPad and to make ClrMD even more easy to use”.
I wanted to pull together a list of all the existing tools, so I enlisted twitter to help. Note to self: careful what you tweet, the WinDBG Product Manager might read your tweets and get a bit upset!!
Well this just hurts my feelings :(
— Andy Luhrs (@aluhrs13)
March 14, 2018
Most of these tools are based on ClrMD because it’s the easiest way to do things, however you can use the underlying COM interfaces directly if you want. Also, it’s worth pointing out that any tool based on ClrMD is not cross-platform, because ClrMD itself is Windows-only. For cross-platform options see Analyzing a .NET Core Core Dump on Linux
Finally, in the interest of balance, there have been lots of recent improvements to WinDBG and because it’s extensible there have been various efforts to add functionality to it:
Having said all that, onto the list:
- SuperDump (GitHub)
- msos (GitHub)
- Command-line environment a-la WinDbg for executing SOS commands without having SOS available.
- MemoScope.Net (GitHub)
- A tool to analyze .Net process memory Can dump an application’s memory in a file and read it later.
- The dump file contains all data (objects) and threads (state, stack, call stack). MemoScope.Net will analyze the data and help you to find memory leaks and deadlocks
- dnSpy (GitHub)
- .NET debugger and assembly editor
- You can use it to edit and debug assemblies even if you don’t have any source code available!!
- MemAnalyzer (GitHub)
- A command line memory analysis tool for managed code.
- Can show which objects use most space on the managed heap just like
!DumpHeap
from Windbg without the need to install and attach a debugger.
- DumpMiner (GitHub)
- UI tool for playing with ClrMD, with more features coming soon
- Trace CLI (GitHub)
- A production debugging and tracing tool
- Shed (GitHub)
- Shed is an application that allow to inspect the .NET runtime of a program in order to extract useful information. It can be used to inspect malicious applications in order to have a first general overview of which information are stored once that the malware is executed. Shed is able to:
- Extract all objects stored in the managed heap
- Print strings stored in memory
- Save the snapshot of the heap in a JSON format for post-processing
- Dump all modules that are loaded in memory
You can also find many other tools that make use of ClrMD, it was a very good move by Microsoft to make it available.
A few other tools that are also worth mentioning:
- DebugDiag
- The DebugDiag tool is designed to assist in troubleshooting issues such as hangs, slow performance, memory leaks or memory fragmentation, and crashes in any user-mode process (now with ‘CLRMD Integration’)
- SOSEX (might not be developed any more)
- … a debugging extension for managed code that begins to alleviate some of my frustrations with SOS
- VMMap from Sysinternals
Discuss this post on Hacker News or /r/programming
Fri, 15 Jun 2018, 12:00 am
CoreRT - A .NET Runtime for AOT
Firstly, what exactly is CoreRT? From its GitHub repo:
.. a .NET Core runtime optimized for AOT (ahead of time compilation) scenarios, with the accompanying .NET native compiler toolchain
The rest of this post will look at what that actually means.
Contents
- Existing .NET ‘AOT’ Implementations
- High-Level Overview
- The Compiler
- The Runtime
- ‘Hello World’ Program
- Limitations
- Further Reading
Existing .NET ‘AOT’ Implementations
However, before we look at what CoreRT is, it’s worth pointing out there are existing .NET ‘Ahead-of-Time’ (AOT) implementations that have been around for a while:
Mono
.NET Native (Windows 10/UWP apps only, a.k.a ‘Project N’)
So if there were existing implementations, why was CoreRT created? The official announcement gives us some idea:
If we want to shortcut this two-step compilation process and deliver a 100% native application on Windows, Mac, and Linux, we need an alternative to the CLR. The project that is aiming to deliver that solution with an ahead-of-time compilation process is called CoreRT.
The main difference is that CoreRT is designed to support .NET Core scenarios, i.e. .NET Standard, cross-platform, etc.
Also worth pointing out is that whilst .NET Native is a separate product, they are related and in fact “.NET Native shares many CoreRT parts”.
High-Level Overview
Because all the code is open source, we can very easily identify the main components and understand where the complexity is. Firstly lets look at where the most ‘lines of code’ are:
We clearly see that the majority of the code is written in C#, with only the Native component written in C++. The largest single component is System.Private.CoreLib which is all C# code, although there are other sub-components that contribute to it (‘System.Private.XXX’), such as System.Private.Interop (36,547 LOC), System.Private.TypeLoader (30,777) and System.Private.Reflection.Core (24,964). Other significant components are the ‘Intermediate Language (IL) Compiler’ and the Common code that is used re-used by everything else.
All these components are discussed in more detail below.
The Compiler
So whilst CoreRT is a run-time, it also needs a compiler to put everything together, from Intro to .NET Native and CoreRT:
.NET Native is a native toolchain that compiles CIL byte code to machine code (e.g. X64 instructions). By default, .NET Native (for .NET Core, as opposed to UWP) uses RyuJIT as an ahead-of-time (AOT) compiler, the same one that CoreCLR uses as a just-in-time (JIT) compiler. It can also be used with other compilers, such as LLILC, UTC for UWP apps and IL to CPP (an IL to textual C++ compiler we have built as a reference prototype).
But what does this actually look like in practice, as they say ‘a picture paints a thousand words’:
(Click for larger version)
To give more detail, the main compilation phases (started from \ILCompiler\src\Program.cs) are the following:
- Calculate the reachable modules/types/classes, i.e. the ‘compilation roots’ using the ILScanner.cs
- Allow for reflection, via an optional rd.xml file and generate the necessary metadata using ILCompiler.MetadataWriter
- Compile the IL using the specific back-end (generic/shared code is in Compilation.cs)
- Finally, write out the compiled methods using ObjectWriter which in turn uses LLVM under-the-hood
But it’s not just your code that ends up in the final .exe, along the way the CoreRT compiler also generates several ‘helper methods’ to cover the following scenarios:
Fortunately the compiler doesn’t blindly include all the code it finds, it is intelligent enough to only include code that’s actually used:
We don’t use ILLinker, but everything gets naturally treeshaken by the compiler itself (we start with compiling Main
/NativeCallable
exports and continue compiling other methods and generating necessary data structures as we go). If there’s a type or method that is not used, the compiler doesn’t even look at it.
The Runtime
All the user/helper code then sits on-top of the CoreRT runtime, from Intro to .NET Native and CoreRT:
CoreRT is the .NET Core runtime that is optimized for AOT scenarios, which .NET Native targets. This is a refactored and layered runtime. The base is a small native execution engine that provides services such as garbage collection(GC). This is the same GC used in CoreCLR. Many other parts of the traditional .NET runtime, such as the type system, are implemented in C#. We’ve always wanted to implement runtime functionality in C#. We now have the infrastructure to do that. In addition, library implementations that were built deep into CoreCLR, have also been cleanly refactored and implemented as C# libraries.
This last point is interesting, why is it advantageous to implement ‘runtime functionality in C#’? Well it turns out that it’s hard to do in an un-managed language because there’s some very subtle and hard-to-track-down ways that you can get it wrong:
Reliability and performance. The C/C++ code has to manually managed. It means that one has to be very careful to report all GC references to the GC. The manually managed code is both very hard to get right and it has performance overhead.
— Jan Kotas (@JanKotas7)
April 24, 2018
These are known as ‘GC Holes’ and the BOTR provides more detail on them. The author of that tweet is significant, Jan Kotas has worked on the .NET runtime for a long time, if he thinks something is hard, it really is!!
Runtime Components
As previously mentioned it’s a layered runtime, i.e made up of several, distinct components, as explained in this comment:
At the core of CoreRT, there’s a runtime that provides basic services for the code to run (think: garbage collection, exception handling, stack walking). This runtime is pretty small and mostly depends on C/C++ runtime (even the C++ runtime dependency is not a hard requirement as Jan pointed out - #3564). This code mostly lives in src/Native/Runtime, src/Native/gc, and src/Runtime.Base. It’s structured so that the places that do require interacting with the underlying platform (allocating native memory, threading, etc.) go through a platform abstraction layer (PAL). We have a PAL for Windows, Linux, and macOS, but others can be added.
And you can see the PAL Components in the following locations:
C# Code shared with CoreCLR
One interesting aspect of the CoreRT runtime is that wherever possible it shares code with the CoreCLR runtime, this is part of a larger effort to ensure that wherever possible code is shared across multiple repositories:
This directory contains the shared sources for System.Private.CoreLib. These are shared between dotnet/corert, dotnet/coreclr and dotnet/corefx.
The sources are synchronized with a mirroring tool that watches for new commits on either side and creates new pull requests (as @dotnet-bot) in the other repository.
Recently there has been a significant amount of work done to moved more and more code over into the ‘shared partition’ to ensure work isn’t duplicated and any fixes are shared across both locations. You can see how this works by looking at the links below:
What this means is that about 2/3 of the C# code in System.Private.CoreLib
is shared with CoreCLR
and only 1/3 is unique to CoreRT
:
Group
C# LOC (Files)
shared
170,106 (759)
src
96,733 (351)
Total
266,839 (1,110)
Native Code
Finally, whilst it is advantageous to write as much code as possible in C#, there are certain components that have to be written in C++, these include the GC (the majority of which is one file, gc.cpp which is almost 37,000 LOC!!), the JIT Interface, ObjWriter (based on LLVM) and most significantly the Core Runtime that contains code for activities like:
- Threading
- Stack Frame handling
- Debugging/Profiling
- Interfacing to the OS
- CPU specific helpers for:
- Exception handling
- GC Write Barriers
- Stubs/Thunks
- Optimised object allocation
‘Hello World’ Program
One of the first things people asked about CoreRT is “what is the size of a ‘Hello World’ app” and the answer is ~3.93 MB (if you compile in Release mode), but there is work being done to reduce this. At a ‘high-level’, the .exe that is produced looks like this:
Note the different colours correspond to the original format of a component, obviously the output is a single, native, executable file.
This file comes with a full .NET specific ‘base runtime’ or ‘class libraries’ (‘System.Private.XXX’) so you get a lot of functionality, it is not the absolute bare-minimum app. Fortunately there is a way to see what a ‘bare-minimum’ runtime would look like by compiling against the Test.CoreLib project included in the CoreRT source. By using this you end up with an .exe that looks like this:
But it’s so minimal that OOTB you can’t even write ‘Hello World’ to the console as there is no System.Console
type! After a bit of hacking I was able to build a version that did have a working Console
output (if you’re interested, this diff is available here). To make it work I had to include the following components:
So Test.CoreLib
really is a minimal runtime!! But the difference in size is dramatic, it shrinks down to 0.49 MB compared to 3.93 MB for the fully-featured runtime!
Type
Standard (bytes)
Test.CoreLib (bytes)
Difference
.data
163,840
36,864
-126,976
.managed
1,540,096
65,536
-1,474,560
.pdata
147,456
20,480
-126,976
.rdata
1,712,128
81,920
-1,630,208
.reloc
98,304
4,096
-94,208
.text
360,448
299,008
-61,440
rdata
98,304
4,096
-94,208
Total (bytes)
4,120,576
512,000
-3,608,576
Total (MB)
3.93
0.49
-3.44
These data sizes were obtained by using the Microsoft DUMPBIN tool and the /DISASM
cmd line switch (zip file of the full ouput), which produces the following summary (note: size values are in HEX):
Summary
28000 .data
178000 .managed
24000 .pdata
1A2000 .rdata
18000 .reloc
58000 .text
18000 rdata
Also contained in the output is the assembly code for a simple Hello World
method:
HelloWorld_HelloWorld_Program__Main:
0000000140004C50: 48 8D 0D 19 94 37 lea rcx,[__Str_Hello_World__E63BA1FD6D43904697343A373ECFB93457121E4B2C51AF97278C431E8EC85545]
00
0000000140004C57: 48 8D 05 DA C5 00 lea rax,[System_Console_System_Console__WriteLine_12]
00
0000000140004C5E: 48 FF E0 jmp rax
0000000140004C61: 90 nop
0000000140004C62: 90 nop
0000000140004C63: 90 nop
and if we dig further we can see the code for System.Console.WriteLine(..)
:
System_Console_System_Console__WriteLine_12:
0000000140011238: 56 push rsi
0000000140011239: 48 83 EC 20 sub rsp,20h
000000014001123D: 48 8B F1 mov rsi,rcx
0000000140011240: E8 33 AD FF FF call System_Console_System_Console__get_Out
0000000140011245: 48 8B C8 mov rcx,rax
0000000140011248: 48 8B D6 mov rdx,rsi
000000014001124B: 48 8B 00 mov rax,qword ptr [rax]
000000014001124E: 48 8B 40 68 mov rax,qword ptr [rax+68h]
0000000140011252: 48 83 C4 20 add rsp,20h
0000000140011256: 5E pop rsi
0000000140011257: 48 FF E0 jmp rax
000000014001125A: 90 nop
000000014001125B: 90 nop
Limitations
Missing Functionality
There have been some people who’ve successfully run complex apps using CoreRT, but, as it stands CoreRT is still an alpha product. At least according to the NuGet package ‘1.0.0-alpha-26529-02’ that the official samples instruct you to use and I’ve not seen any information about when a full 1.0 Release will be available.
So there is some functionality that is not yet implemented, e.g. F# Support, GC.GetMemoryInfo or canGetCookieForPInvokeCalliSig (a calli
to a p/invoke). For more information on this I recommend this entertaining presentation on Building Native Executables from .NET with CoreRT by Mark Rendle. In the 2nd half he chronicles all the issues that he ran into when he was trying to run an ASP.NET app under CoreRT (some of which may well be fixed now).
Reflection
But more fundamentally, because of the nature of AOT compilation, there are 2 main stumbling blocks that you may also run into Reflection and Runtime Code-Generation.
Firstly, if you want to use reflection in your code you need to tell the CoreRT compiler about the types you expect to reflect over, because by-default it only includes the types it knows about. You can do with by using a file called rd.xml
as shown here. Unfortunately this will always require manual intervention for the reasons explained in this issue. More information is available in this comment ‘…some details about CoreRT’s restriction on MakeGenericType and MakeGenericMethod’.
To make reflection work the compiler adds the required metadata to the final .exe using this process:
This would reuse the same scheme we already have for the RyuJIT codegen path:
- The compiler generates a blob of bytes that describes the metadata (namespaces, types, their members, their custom attributes, method parameters, etc.). The data is generated as a byte array in the ComputeMetadata method.
- The metadata gets embedded as a data blob into the executable image. This is achieved by adding the blob to a “ready to run header”. Ready to run header is a well known data structure that can be located by the code in the framework at runtime.
- The ready to run header along with the blobs it refers to is emitted into the final executable.
- At runtime, pointer to the byte array is located using the RhFindBlob API, and a parser is constructed over the array, to be used by the reflection stack.
Runtime Code-Generation
In .NET you often use reflection once (because it can be slow) followed by ‘dynamic’ or ‘runtime’ code-generation with Reflection.Emit(..)
. This technique is widely using in .NET libraries for Serialisation/Deserialisation, Dependency Injection, Object Mapping and ORM.
The issue is that ‘runtime’ code generation is problematic in an ‘AOT’ scenario:
ASP.NET dependency injection introduced dependency on Reflection.Emit in aspnet/DependencyInjection#630 unfortunately. It makes it incompatible with CoreRT.
We can make it functional in CoreRT AOT environment by introducing IL interpretter (#5011), but it would still perform poorly. The dependency injection framework is using Reflection.Emit on performance critical paths.
It would be really up to ASP.NET to provide AOT-friendly flavor that generates all code at build time instead of runtime to make this work well. It would likely help the startup without CoreRT as well.
I’m sure this will be solved one way or the other (see #5011), but at the moment it’s still ‘work-in-progress’.
Discuss this post on HackerNews and /r/dotnet
Further Reading
If you’ve got this far, here’s some other links that you might be interested in:
Thu, 7 Jun 2018, 12:00 am
Taking a look at the ECMA-335 Standard for .NET
It turns out that the .NET Runtime has a technical standard (or specification), known by its full name ECMA-335 - Common Language Infrastructure (CLI) (not to be confused with ECMA-334 which is the ‘C# Language Specification’). The latest update is the 6th edition from June 2012.
The specification or standard was written before .NET Core existed, so only applies to the .NET Framework, I’d be interested to know if there are any plans for an updated version?
The rest of this post will take a look at the standard, exploring the contents and investigating what we can learn from it (hint: lots of low-level details and information about .NET internals)
Why is it useful?
Having a standard means that different implementations, such as Mono and DotNetAnywhere can exist, from Common Language Runtime (CLR):
Compilers and tools are able to produce output that the common language runtime can consume because the type system, the format of metadata, and the runtime environment (the virtual execution system) are all defined by a public standard, the ECMA Common Language Infrastructure specification. For more information, see ECMA C# and Common Language Infrastructure Specifications.
and from the CoreCLR documentation on .NET Standards:
There was a very early realization by the founders of .NET that they were creating a new programming technology that had broad applicability across operating systems and CPU types and that advanced the state of the art of late 1990s (when the .NET project started at Microsoft) programming language implementation techniques. This led to considering and then pursuing standardization as an important pillar of establishing .NET in the industry.
The key addition to the state of the art was support for multiple programming languages with a single language runtime, hence the name Common Language Runtime. There were many other smaller additions, such as value types, a simple exception model and attributes. Generics and language integrated query were later added to that list.
Looking back, standardization was quite effective, leading to .NET having a strong presence on iOS and Android, with the Unity and Xamarin offerings, both of which use the Mono runtime. The same may end up being true for .NET on Linux.
The various .NET standards have been made meaningful by the collaboration of multiple companies and industry experts that have served on the working groups that have defined the standards. In addition (and most importantly), the .NET standards have been implemented by multiple commercial (ex: Unity IL2CPP, .NET Native) and open source (ex: Mono) implementors. The presence of multiple implementations proves the point of standardization.
As the last quote points out, the standard is not produced solely by Microsoft:
There is also a nice Wikipedia page that has some additional information.
What is in it?
At a high-level overview, the specification is divided into the following ‘partitions’ :
- I: Concepts and Architecture
- A great introduction to the CLR itself, explaining many of the key concepts and components, as well as the rationale behind them
- II: Metadata Definition and Semantics
- An explanation of the format of .NET dll/exe files, the different sections within them and how they’re laid out in-memory
- III: CIL Instruction Set
- A complete list of all the Intermediate Language (IL) instructions that the CLR understands, along with a detailed description of what they do and how to use them
- IV: Profiles and Libraries
- Describes the various different ‘Base Class libraries’ that make-up the runtime and how they are grouped into ‘Profiles’
- V: Binary Formats (Debug Interchange Format)
- An overview of ‘Portable CILDB files’, which give a way for additional debugging information to be provided
- VI: Annexes
- Annex A - Introduction
- Annex B - Sample programs
- Annex C - CIL assembler implementation
- Annex D - Class library design guidelines
- Annex E - Portability considerations
- Annex F - Imprecise faults
- Annex G - Parallel library
But, working your way through the entire specification is a mammoth task, generally I find it useful to just search for a particular word or phrase and locate the parts I need that way. However if you do want to read through one section, I recommend ‘Partition I: Concepts and Architecture’, at just over 100 pages it is much easier to fully digest! This section is a very comprehensive overview of the key concepts and components contained within the CLR and well worth a read.
Also, I’m convinced that the authors of the spec wanted to help out any future readers, so to break things up they included lots of very helpful diagrams:
For more examples see:
On top of all that, they also dropped in some Comic Sans 😀, just to make it clear when the text is only ‘informative’:
How has it changed?
The spec has been through 6th editions and it’s interesting to look at the changes over time:
Edition
Release Date
CLR Version
Significant Changes
1st
December 2001
1.0 (February 2002)
N/A
2nd
December 2002
1.1 (April 2003)
3rd
June 2005
2.0 (January 2006)
See below
(link)
4th
June 2006
None, revision of 3rd edition
(link)
5th
December 2010
4.0 (April 2010)
See below
(link)
6th
June 2012
None, revision of 5th edition
(link)
However, only 2 editions contained significant updates, they are explained in more detail below:
- Support for generic types and methods (see ‘How generics were added to .NET’)
- New IL instructions -
ldelem
, stelem
and unbox.any
- Added the
constrained.
, no.
and readonly.
IL instruction prefixes
- Brand new ‘namespaces’ (with corresponding types) -
System.Collections.Generics
, System.Threading.Parallel
- New types added, including
Action
, Nullable
and ThreadStaticAttribute
- Type-forwarding added
- Semantics of ‘variance’ redefined, became a core feature
- Multiple types added or updated, including
System.Action
, System.MulticastDelegate
and System.WeakReference
System.Math
and System.Double
modified to better conform to IEEE
Microsoft Specific Implementation
Another interesting aspect to look at is the Microsoft specific implementation details and notes. The following links are to pdf documents that are modified versions of the 4th edition:
They all contain multiple occurrences of text like this ‘Implementation Specific (Microsoft)’:
Finally, if you want to find out more there’s a book available (affiliate link):
Fri, 6 Apr 2018, 12:00 am
Exploring the internals of the .NET Runtime
I recently appeared on Herding Code and Stackify ‘Developer Things’ podcasts and in both cases, the first question asked was ‘how do you figure out the internals of the .NET runtime’?
This post is an attempt to articulate that process, in the hope that it might be useful to others.
Here are my suggested steps:
- Decide what you want to investigate
- See if someone else has already figured it out (optional)
- Read the ‘Book of the Runtime’
- Build from the source
- Debugging
- Verify against .NET Framework (optional)
Note: As with all these types of lists, just because it worked for me doesn’t mean that it will for everyone. So, ‘your milage may vary’.
Step One - Decide what you want to investigate
For me, this means working out what question I’m trying to answer, for example here are some previous posts I’ve written:
(it just goes to show, you don’t always need fancy titles!)
I put this as ‘Step 1’ because digging into .NET internals isn’t quick or easy work, some of my posts take weeks to research, so I need to have a motivation to keep me going, something to focus on. In addition, the CLR isn’t a small run-time, there’s a lot in there, so just blindly trying to find your way around it isn’t easy! That’s why having a specific focus helps, looking at one feature or section at a time is more manageable.
The very first post where I followed this approach was Strings and the CLR - a Special Relationship. I’d previously spent some time looking at the CoreCLR source and I knew a bit about how Strings
in the CLR worked, but not all the details. During the research of that post I then found more and more areas of the CLR that I didn’t understand and the rest of my blog grew from there (delegates, arrays, fixed keyword, type loader, etc).
Aside: I think this is generally applicable, if you want to start blogging, but you don’t think you have enough ideas to sustain it, I’d recommend that you start somewhere and other ideas will follow.
Another tip is to look at HackerNews or /r/programming for posts about the ‘internals’ of other runtimes, e.g. Java, Ruby, Python, Go etc, then write the equivalent post about the CLR. One of my most popular posts A Hitchhikers Guide to the CoreCLR Source Code was clearly influenced by equivalent articles!
Finally, for more help with learning, ‘figuring things out’ and explaining them to others, I recommend that you read anything by Julia Evans. Start with Blogging principles I use and So you want to be a wizard (also available as a zine), then work your way through all the other posts related to blogging or writing.
I’ve been hugely influenced, for the better, by Julia’s approach to blogging.
I put this in as ‘optional’, because it depends on your motivation. If you are trying to understand .NET internals for your own education, then feel-free to write about whatever you want. If you are trying to do it to also help others, I’d recommend that you first see what’s already been written about the subject. If, once you’ve done that you still think there is something new or different that you can add, then go ahead, but I try not to just re-hash what is already out there.
To see what’s already been written, you can start with Resources for Learning about .NET Internals or peruse the ‘Internals’ tag on this blog. Another really great resource is all the answers by Hans Passant on StackOverflow, he is prolific and amazingly knowledgeable, here’s some examples to get you started:
Step Three - Read the ‘Book of the Runtime’
You won’t get far in investigating .NET internals without coming across the ‘Book of the Runtime’ (BOTR) which is an invaluable resource, even Scott Hanselman agrees!
It was written by the .NET engineering team, for the .NET engineering team, as per this HackerNews comment:
Having worked for 7 years on the .NET runtime team, I can attest that the BOTR is the official reference. It was created as documentation for the engineering team, by the engineering team. And it was (supposed to be) kept up to date any time a new feature was added or changed.
However, just a word of warning, this means that it’s an in-depth, non-trivial document and hard to understand when you are first learning about a particular topic. Several of my blog posts have consisted of the following steps:
- Read the BOTR chapter on ‘Topic X’
- Understand about 5% of what I read
- Go away and learn more (read the source code, read other resources, etc)
- GOTO ‘Step 1’, understanding more this time!
Related to this, the source code itself is often as helpful as the BOTR due to the extensive comments, for example this one describing the rules for prestubs really helped me out. The downside of the source code comments is that they are bit harder to find, whereas the BOTR is all in one place.
Step Four - Build from the source
However, at some point, just reading about the internals of the CLR isn’t enough, you actually need to ‘get your hands’ dirty and see it in action. Now that the Core CLR is open source it’s very easy to build it yourself and then once you’ve done that, there are even more docs to help you out if you are building on different OSes, want to debug, test CoreCLR in conjunction with CoreFX, etc.
But why is building from source useful?
Because it lets you build a Debug/Diagnostic version of the runtime that gives you lots of additional information that isn’t available in the Release/Retails builds. For instance you can view JIT Dumps using COMPlus_JitDump=...
, however this is just one of many COMPlus_XXX
settings you can use, there are 100’s available.
However, even more useful is the ability to turn on diagnostic logging for a particular area of the CLR. For instance, lets imagine that we want to find out more about AppDomains
and how they work under-the-hood, we can use the following logging configuration settings:
SET COMPLUS_LogEnable=1
SET COMPLUS_LogToFile=1
SET COMPLUS_LogFacility=02000000
SET COMPLUS_LogLevel=A
Where LogFacility
is set to LF_APPDOMAIN
, there are many other values you can provide as a HEX bit-mask the full list is available in the source code. If you set these variables and then run an app, you will get a log output like this one. Once you have this log you can very easily search around in the code to find where the messages came from, for instance here are all the places that LF_APPDOMAIN
is logged. This is a great technique to find your way into a section of the CLR that you aren’t familiar with, I’ve used it many times to great effect.
Step Five - Debugging
For me, biggest boon of Microsoft open sourcing .NET is that you can discover so much more about the internals without having to resort to ‘old school’ debugging using WinDBG. But there still comes a time when it’s useful to step through the code line-by-line to see what’s going on. The added advantage of having the source code is that you can build a copy locally and then debug through that using Visual Studio which is slightly easier than WinDBG.
I always leave debugging to last, as it can be time-consuming and I only find it helpful when I already know where to set a breakpoint, i.e. I already know which part of the code I want to step through. I once tried to blindly step through the source of the CLR whilst it was starting up and it was very hard to see what was going on, as I’ve said before the CLR is a complex runtime, there are many things happening, so stepping through lots of code, line-by-line can get tricky.
Step Six - Verify against .NET Framework
I put this final step in because the .NET CLR source available on GitHub is the ‘.NET Core’ version of the runtime, which isn’t the same as the full/desktop .NET Framework that’s been around for years. So you may need to verify the behavior matches, if you want to understand the internals ‘as they were’, not just ‘as they will be’ going forward. For instance .NET Core has removed the ability to create App Domains as a way to provide isolation but interestingly enough the internal class lives on!
To verify the behaviour, your main option is to debug the CLR using WinDBG. Beyond that, you can resort to looking at the ‘Rotor’ source code (roughly the same as .NET Framework 2.0), or petition Microsoft the release the .NET Framework Source Code (probably not going to happen)!
However, low-level internals don’t change all that often, so more often than not the way things behave in the CoreCLR is the same as they’ve always worked.
Resources
Finally, for your viewing pleasure, here are a few talks related to ‘.NET Internals’:
Discuss this post on /r/programming or /r/dotnet
Fri, 23 Mar 2018, 12:00 am
How generics were added to .NET
Discuss this post on HackerNews and /r/programming
Before we dive into the technical details, let’s start with a quick history lesson, courtesy of Don Syme who worked on adding generics to .NET and then went on to design and implement F#, which is a pretty impressive set of achievements!!
Background and History
- 1999 Initial research, design and planning
- 1999 First ‘white paper’ published
- 2001 C# Language Design Specification created
- 2001 Research paper published
- 2004 Work completed and all bugs fixed
Update: Don Syme, pointed out another research paper related to .NET generics, Combining Generics, Precompilation and Sharing Between Software Based Processes (pdf)
To give you an idea of how these events fit into the bigger picture, here are the dates of .NET Framework Releases, up-to 2.0 which was the first version to have generics:
Version number
CLR version
Release date
1.0
1.0
2002-02-13
1.1
1.1
2003-04-24
2.0
2.0
2005-11-07
Aside from the historical perspective, what I find most fascinating is just how much the addition of generics in .NET was due to the work done by Microsoft Research, from .NET/C# Generics History:
It was only through the total dedication of Microsoft Research, Cambridge during 1998-2004, to doing a complete, high quality implementation in both the CLR (including NGEN, debugging, JIT, AppDomains, concurrent loading and many other aspects), and the C# compiler, that the project proceeded.
He then goes on to say:
What would the cost of inaction have been? What would the cost of failure have been? No generics in C# 2.0? No LINQ in C# 3.0? No TPL in C# 4.0? No Async in C# 5.0? No F#? Ultimately, an erasure model of generics would have been adopted, as for Java, since the CLR team would never have pursued a in-the-VM generics design without external help.
Wow, C# and .NET would look very different without all these features!!
The ‘Gyro’ Project - Generics for Rotor
Unfortunately there doesn’t exist a publicly accessible version of the .NET 1.0 and 2.0 source code, so we can’t go back and look at the changes that were made (if I’m wrong, please let me know as I’d love to read it).
However, we do have the next best thing, the ‘Gyro’ project in which the equivalent changes were made to the ‘Shared Source Common Language Implementation’ (SSCLI) code base (a.k.a ‘Rotor’). As an aside, if you want to learn more about the Rotor code base I really recommend the excellent book by Ted Neward, which you can download from his blog.
Gyro 1.0 was released in 2003 which implies that is was created after the work has been done in the real .NET Framework source code, I assume that Microsoft Research wanted to publish the ‘Rotor’ implementation so it could be studied more widely. Gyro is also referenced in one Don Syme’s posts, from Some History: 2001 “GC#” research project draft, from the MSR Cambridge team:
With Dave Berry’s help we later published a version of the corresponding code as the “Gyro” variant of the “Rotor” CLI implementation.
The rest of this post will look at how generics were implemented in the Rotor source code.
Note: There are some significant differences between the Rotor source code and the real .NET framework. Most notably the JIT and GC are completely different implementations (due to licensing issues, listen to DotNetRocks show 360 - Ted Neward and Joel Pobar on Rotor 2.0 for more info). However, the Rotor source does give us an accurate idea about how other core parts of the CLR are implemented, such as the Type-System, Debugger, AppDomains and the VM itself. It’s interesting to compare the Rotor source with the current CoreCLR source and see how much of the source code layout and class names have remained the same.
Implementation
To make things easier for anyone who wants to follow-along, I created a GitHub repo that contains the Rotor code for .NET 1.0 and then checked in the Gyro source code on top, which means that you can see all the changes in one place:
The first thing you notice in the Gyro source is that all the files contain this particular piece of legalese:
; By using this software in any fashion, you are agreeing to be bound by the
; terms of this license.
;
+; This file contains modifications of the base SSCLI software to support generic
+; type definitions and generic methods. These modifications are for research
+; purposes. They do not commit Microsoft to the future support of these or
+; any similar changes to the SSCLI or the .NET product. -- 31st October, 2002.
+;
; You must not remove this notice, or any other, from this software.
It’s funny that they needed to add the line ‘They do not commit Microsoft to the future support of these or any similar changes to the SSCLI or the .NET product’, even though they were just a few months away from doing just that!!
Components (Directories) with the most changes
To see where the work was done, lets start with a high-level view, showing the directories with a significant amount of changes (> 1% of the total changes):
$ git diff --dirstat=lines,1 464bf98 2714cca
0.1% bcl/
14.4% csharp/csharp/sccomp/
9.1% debug/di/
11.9% debug/ee/
2.1% debug/inc/
1.9% debug/shell/
2.5% fjit/
21.1% ilasm/
1.5% ildasm/
1.2% inc/
1.4% md/compiler/
29.9% vm/
Note: fjit
is the “Fast JIT” compiler, i.e the version released with Rotor, which was significantly different to one available in the full .NET framework.
The full output from git diff --dirstat=lines,0
is available here and the output from git diff --stat
is here.
0.1% bcl/
is included only to show that very little C# code changes were needed, these were mostly plumbing code to expose the underlying C++ methods and changes to the various ToString()
methods to include generic type information, e.g. ‘Class[int,double]
’. However there are 2 more significant ones:
bcl/system/reflection/emit/opcodes.cs
(diff)
bcl/system/reflection/emit/signaturehelper.cs
(diff)
- Add the ability to parse method metadata that contains generic related information, such as methods with generic parameters.
Files with the most changes
Next, we’ll take a look at the specific classes/files that had the most changes as this gives us a really good idea about where the complexity was
Added
Deleted
Total Changes
File (click to go directly to the diff)
1794
323
1471
debug/di/module.cpp
1418
337
1081
vm/class.cpp
1335
308
1027
vm/jitinterface.cpp
1616
888
728
debug/ee/debugger.cpp
741
46
695
csharp/csharp/sccomp/symmgr.cpp
693
0
693
vm/genmeth.cpp
999
362
637
csharp/csharp/sccomp/clsdrec.cpp
926
321
605
csharp/csharp/sccomp/fncbind.cpp
559
0
559
vm/typeparse.cpp
605
156
449
vm/siginfo.cpp
417
29
388
vm/method.hpp
642
255
387
fjit/fjit.cpp
379
0
379
vm/jitinterfacegen.cpp
3045
2672
373
ilasm/parseasm.cpp
465
94
371
vm/class.h
515
163
352
debug/inc/cordb.h
339
0
339
vm/generics.cpp
733
418
315
csharp/csharp/sccomp/parser.cpp
471
169
302
debug/shell/dshell.cpp
382
88
294
csharp/csharp/sccomp/import.cpp
Components of the Runtime
Now we’ll look at individual components in more detail so we can get an idea of how different parts of the runtime had to change to accommodate generics.
Type System changes
Not surprisingly the bulk of the changes are in the Virtual Machine (VM) component of the CLR and related to the ‘Type System’. Obviously adding ‘parameterised types’ to a type system that didn’t already have them requires wide-ranging and significant changes, which are shown in the list below:
vm/class.cpp
(diff
)
- Allow the type system to distinguish between open and closed generic types and provide APIs to allow working them, such as
IsGenericVariable()
and GetGenericTypeDefinition()
vm/genmeth.cpp
(diff)
- Contains the bulk of the functionality to make ‘generic methods’ possible, i.e.
MyMethod(T item, U filter)
, including to work done to enable ‘shared instantiation’ of generic methods
vm/typeparse.cpp
(diff)
- Changes needed to allow generic types to be looked-up by name, i.e. ‘
MyClass[System.Int32]
’
vm/siginfo.cpp
(diff)
- Adds the ability to work with ‘generic-related’ method signatures
vm/method.hpp
(diff) and vm/method.cpp
(diff)
- Provides the runtime with generic related methods such as
IsGenericMethodDefinition()
, GetNumGenericMethodArgs()
and GetNumGenericClassArgs()
vm/generics.cpp
(diff)
- All the completely new ‘generics’ specific code is in here, mostly related to ‘shared instantiation’ which is explained below
The main place that the implementation of generics in the CLR differs from the JVM is that they are ‘fully reified’ instead of using ‘type erasure’, this was possible because the CLR designers were willing to break backwards compatibility, whereas the JVM had been around longer so I assume that this was a much less appealing option. For more discussion on this issue see Erasure vs reification and Reified Generics for Java. Update: this HackerNews discussion is also worth a read.
The specific changes made to the .NET Intermediate Language (IL) op-codes can be seen in the inc/opcode.def
(diff), in essence the following 3 instructions were added
In addition the IL Assembler
tool (ILASM) needed significant changes as well as it’s counter part `IL Disassembler (ILDASM) so it could handle the additional instructions.
There is also a whole section titled ‘Support for Polymorphism in IL’ that explains these changes in greater detail in Design and Implementation of Generics for the .NET Common Language Runtime
Shared Instantiations
From Design and Implementation of Generics for the .NET Common Language Runtime
Two instantiations are compatible if for any parameterized class its
compilation at these instantiations gives rise to identical code and
other execution structures (e.g. field layout and GC tables), apart
from the dictionaries described below in Section 4.4. In particular,
all reference types are compatible with each other, because the
loader and JIT compiler make no distinction for the purposes of
field layout or code generation. On the implementation for the Intel
x86, at least, primitive types are mutually incompatible, even
if they have the same size (floats and ints have different parameter
passing conventions). That leaves user-defined struct types, which
are compatible if their layout is the same with respect to garbage
collection i.e. they share the same pattern of traced pointers
From a comment with more info:
// For an generic type instance return the representative within the class of
// all type handles that share code. For example,
// --> ,
// --> ,
// --> ,
// --> ,
// -->
//
// If the code for the type handle is not shared then return
// the type handle itself.
In addition, this comment explains the work that needs to take place to allow shared instantiations when working with generic methods.
Update: If you want more info on the ‘code-sharing’ that takes places, I recommend reading these 4 posts:
Compiler and JIT Changes
If seems like almost every part of the compiler had to change to accommodate generics, which is not surprising given that they touch so many parts of the code we write, Types
, Classes
and Methods
. Some of the biggest changes were:
csharp/csharp/sccomp/clsdrec.cpp
- +999 -363 - (diff)
csharp/csharp/sccomp/emitter.cpp
- +347 -127 - (diff)
csharp/csharp/sccomp/fncbind.cpp
- +926 -321 - (diff)
csharp/csharp/sccomp/import.cpp
- +382 - 88 - (diff)
csharp/csharp/sccomp/parser.cpp
- +733 -418 - (diff)
csharp/csharp/sccomp/symmgr.cpp
- +741 -46 - (diff)
In the ‘just-in-time’ (JIT) compiler extra work was needed because it’s responsible for implementing the additional ‘IL Instructions’. The bulk of these changes took place in fjit.cpp
(diff) and fjitdef.h
(diff).
Finally, a large amount of work was done in vm/jitinterface.cpp
(diff) to enable the JIT to access the extra information it needed to emit code for generic methods.
Debugger Changes
Last, but by no means least, a significant amount of work was done to ensure that the debugger could understand and inspect generics types. It goes to show just how much inside information a debugger needs to have of the type system in an managed language.
debug/ee/debugger.cpp
(diff)
debug/ee/debugger.h
(diff)
debug/di/module.cpp
(diff)
debug/di/rsthread.cpp
(diff)
debug/shell/dshell.cpp
(diff)
Further Reading
If you want even more information about generics in .NET, there are also some very useful design docs available (included in the Gyro source code download):
Also Pre-compilation for .NET Generics by Andrew Kennedy & Don Syme (pdf) is an interesting read
Fri, 2 Mar 2018, 12:00 am
Resources for Learning about .NET Internals
It all started with a tweet, which seemed to resonate with people:
If you like reading my posts on .NET internals, you'll like all these other blogs. So I've put them together in a thread for you!!
— Matt Warren (@matthewwarren)
January 12, 2018
The aim was to list blogs that specifically cover .NET internals at a low-level or to put it another way, blogs that answer the question how does feature ‘X’ work, under-the-hood. The list includes either typical posts for that blog, or just some of my favourites!
Note: for a wider list of .NET and performance related blogs see Awesome .NET Performance by Adam Sitnik
I wouldn’t recommend reading through the entire list, at least not in one go, your brain will probably melt. Picks some posts/topics that interest you and start with those.
Finally, bear in mind that some of the posts are over 10 years old, so there’s a chance that things have changed since then (however, in my experience, the low-levels parts of the CLR are more stable). If you want to double-check the latest behaviour, you’re best option is to read the source!
These blogs are all written by non-Microsoft employees (AFAICT), or if they do work for Microsoft, they don’t work directly on the CLR. If I’ve missed any interesting blogs out, please let me know!
Special mention goes to Sasha Goldshtein, he’s been blogging about this longer than anyone!!
Update: I missed out a few blogs and learnt about some new ones:
Honourable mention goes to .NET Type Internals - From a Microsoft CLR Perspective on CodeProject, it’s a great article!!
Book of the Runtime (BotR)
The BotR deserves it’s own section (thanks to svick to reminding me about it).
If you haven’t heard of the BotR before, there’s a nice FAQ that explains what it is:
The Book of the Runtime is a set of documents that describe components in the CLR and BCL. They are intended to focus more on architecture and invariants and not an annotated description of the codebase.
It was originally created within Microsoft in ~2007, including this document. Developers were responsible to document their feature areas. This helped new devs joining the team and also helped share the product architecture across the team.
To find your way around it, I recommend starting with the table of contents and then diving in.
Note: It’s written for developers working on the CLR, so it’s not an introductory document. I’d recommend reading some of the other blog posts first, then referring to the BotR once you have the basic knowledge. For instance many of my blog posts started with me reading a chapter from the BotR, not fully understanding it, going away and learning some more, writing up what I found and then pointing people to the relevant BotR page for more information.
Microsoft Engineers
The blogs below are written by the actual engineers who worked on, designed or managed various parts of the CLR, so they give a deep insight (again, if I’ve missed any blogs out, please let me know):
- Maoni’s WebLog - CLR Garbage Collector by Maoni Stephens
- cbrumme’s WebLog by Christopher Brumme
- A blog on coding, .NET, .NET Compact Framework and life in general.. by Abhinaba Basu
- Joel Pobar’s CLR weblog - CLR Program Manager: Reflection, LCG, Generics and the type system.. by Joel Pobar
- CLR Profiling API Blog - Info about the Common Language Runtime’s Profiling API by David Broman (slightly niche, but still worth a read)
- Yun Jin’s WebLog CLR internals, Rotor code explanation, CLR debugging tips, trivial debugging notes, .NET programming pitfalls by Yun Jin
- JIT, NGen, and other Managed Code Generation Stuff - Details about RyuJIT stuff of all sort.. by various
- Distributed Matters - Troubleshooting issues in technologies available to developers for building distributed applications by Carlo
- B# .NET Blog - BART DE SMET’S on-line blog (0X2B | ~0X2B, THAT’S THE QUESTION) by Bart De Smet
- Nate McMaster’s blog by Nate McMaster
Books
Finally, if you prefer reading off-line there are some decent books that discuss .NET Internals (Note: all links are Amazon Affiliate links):
All the books listed above I own copies of and I’ve read cover-to-cover, they’re fantastic resources.
I’ve also been recently recommend the 2 books below, they look good and certainly the authors know their stuff, but I haven’t read them yet:
*New Release*
Discuss this post on HackerNews and /r/programming
Mon, 22 Jan 2018, 12:00 am
A look back at 2017
I’ve now been blogging consistently for over 2 years (~2 times per/month) and I decided it was time for my first ‘retrospective’ post.
Warning this post contains a large amount of humble brags, if you’ve come here to read about ‘.NET internals’ you’d better check back in a few weeks, when normal service will be resumed!
Overall Stats
Firstly, lets looks at my Google Analytics stats for 2017, showing Page Views and Sessions:
Which clearly shows that I took a bit of a break during the summer! But I still managed over 800K page views, mostly because I was fortunate enough to end up on the front page of HackerNews a few times!
As a comparison, here’s what ‘2017 v 2016’ looks like:
This is cool because it shows a nice trend, more people read my blog posts in 2017 than in 2016 (but I have no idea if it will continue in 2018?!)
Most Read Posts
Next, here are my top 10 most read posts. Surprising enough my most read post was literally just a list with 68 entries in it!!
Post
Page Views
The 68 things the CLR does before executing a single line of your code
101,382
A Hitchhikers Guide to the CoreCLR Source Code
61,169
A DoS Attack against the C# Compiler
50,884
Analysing C# code on GitHub with BigQuery
40,165
Adding a new Bytecode Instruction to the CLR
39,101
Open Source .NET – 3 years later
36,316
How do .NET delegates work?
36,047
Lowering in the C# Compiler (and what happens when you misuse it)
34,375
How the .NET Runtime loads a Type
32,813
DotNetAnywhere: An Alternative .NET Runtime
26,140
Traffic Sources
I was going to do a write-up on where/how I get my blog traffic, but instead I’d encourage you to read 6 Years of Thoughts on Programming by Henrik Warne as his experience exactly matches mine. But in summary, getting onto the front-page of HackerNews drives a lot of traffic to your site/blog.
Finally, a big thanks to everyone who has read, commented on or shared my blogs posts, it means a lot!!
Sun, 31 Dec 2017, 12:00 am
Open Source .NET – 3 years later
A little over 3 years ago Microsoft announced that they were open sourcing large parts of the .NET framework and as Scott Hanselman said in his Connect 2016 keynote, the community has been contributing in a significant way:
This post forms part of an on-going series, if you want to see how things have changed over time you can check out the previous ones:
In addition, I’ve recently done a talk covering this subject, the slides are below:
Microsoft & open source a 'brave new world' - CORESTART 2.0 from
Matt Warren
Historical Perspective
Now that we are 3 years down the line, it’s interesting to go back and see what the aims were when it all started. If you want to know more about this, I recommend watching the 2 Channel 9 videos below, made by the Microsoft Engineers involved in the process:
It hasn’t always been plain sailing, it’s fair to say that there have been a few bumps along the way (I guess that’s what happens if you get to see “how the sausage gets made”), but I think that we’ve ended up in a good place.
During the past 3 years there have been a few notable events that I think are worth mentioning:
Repository activity over time
But onto the data, first we are going to look at an overview of the level of activity in each repo, by looking at the total number of ‘Issues’ (created) or ‘Pull Requests’ (closed) per month. (yay sparklines FTW!!). If you are interested in how I got the data, see the previous post because the process is the same.
Note: Numbers in black are from the most recent month, with the red dot showing the lowest and the green dot the highest previous value. You can toggle between Issues and Pull Requests by clicking on the buttons, hover over individual sparklines to get a tooltip showing the per/month values and click on the project name to take you to the GitHub page for that repository.
Issues
Pull Requests
This data gives a good indication of how healthy different repos are, are they growing over time, or staying the same. You can also see the different levels of activity each repo has and how they compare to other ones.
Whilst it’s clear that Visual Studio Code is way ahead of all the other repos in terms of ‘Issues’, it’s interesting to see that the .NET-only ones have the most ‘Pull-Requests’, notably CoreFX (Base Class Libraries), Roslyn (Compiler) and CoreCLR (Runtime).
Next will will look at the total participation from the last 3 years, i.e. November 2014 to November 2017. All Pull Requests are Issues are treated equally, so a large PR counts the same as one that fixes a spelling mistake. Whilst this isn’t ideal it’s the simplest way to get an idea of the Microsoft/Community split.
Note: You can hover over the bars to get the actual numbers, rather than percentages.
Issues:
Microsoft
Community
Pull Requests:
Microsoft
Community
Finally we can see the ‘per-month’ data from the last 3 years, i.e. November 2014 to November 2017.
Note: You can inspect different repos by selecting them from the pull-down list, but be aware that the y-axis on the graphs are re-scaled, so the maximum value will change each time.
Issues:
Microsoft
Community
Pull Requests:
Microsoft
Community
Summary
It’s clear that the community continues to be invested in the .NET-related, Open Source repositories, contributing significantly and for a sustained period of time. I think this is good for all .NET developers, whether you contribute to OSS or not, having .NET be a thriving, Open Source product has many benefits!
Discuss this post on Hacker News and /r/programming
Tue, 19 Dec 2017, 12:00 am
A look at the internals of 'Tiered JIT Compilation' in .NET Core
The .NET runtime (CLR) has predominantly used a just-in-time (JIT) compiler to convert your executable into machine code (leaving aside ahead-of-time (AOT) scenarios for the time being), as the official Microsoft docs say:
At execution time, a just-in-time (JIT) compiler translates the MSIL into native code. During this compilation, code must pass a verification process that examines the MSIL and metadata to find out whether the code can be determined to be type safe.
But how does that process actually work?
The same docs give us a bit more info:
JIT compilation takes into account the possibility that some code might never be called during execution. Instead of using time and memory to convert all the MSIL in a PE file to native code, it converts the MSIL as needed during execution and stores the resulting native code in memory so that it is accessible for subsequent calls in the context of that process. The loader creates and attaches a stub to each method in a type when the type is loaded and initialized. When a method is called for the first time, the stub passes control to the JIT compiler, which converts the MSIL for that method into native code and modifies the stub to point directly to the generated native code. Therefore, subsequent calls to the JIT-compiled method go directly to the native code.
Simple really!! However if you want to know more, the rest of this post will explore this process in detail.
In addition, we will look at a new feature that is making its way into the Core CLR, called ‘Tiered Compilation’. This is a big change for the CLR, up till now .NET methods have only been JIT compiled once, on their first usage. Tiered compilation is looking to change that, allowing methods to be re-compiled into a more optimised version much like the Java Hotspot compiler.
How it works
But before we look at future plans, how does the current CLR allow the JIT to transform a method from IL to native code? Well, they say ‘a pictures speaks a thousand words’
Before the method is JITed
After the method has been JITed
The main things to note are:
- The CLR has put in a ‘precode’ and ‘stub’ to divert the initial method call to the
PreStubWorker()
method (which ultimately calls the JIT). These are hand-written assembly code fragments consisting of only a few instructions.
- Once the method had been JITed into ‘native code’, a stable entry point it created. For the rest of the life-time of the method the CLR guarantees that this won’t change, so the rest of the run-time can depend on it remaining stable.
- The ‘temporary entry point’ doesn’t go away, it’s still available because there may be other methods that are expecting to call it. However the associated ‘precode fixup’ has been re-written or ‘back patched’ to point to the newly created ‘native code’ instead of
PreStubWorker()
.
- The CLR doesn’t change the address of the
call
instruction in the method that called the method being JITted, it only changes the address inside the ‘precode’. But because all method calls in the CLR go via a precode, the 2nd time the newly JITed method is called, the call will end up at the ‘native code’.
For reference, the ‘stable entry point’ is the same memory location as the IntPtr
that is returned when you call the RuntimeMethodHandle.GetFunctionPointer() method.
If you want to see this process in action for yourself, you can either re-compile the CoreCLR source and add the relevant debug information as I did or just use WinDbg and follow the steps in this excellent blog post (for more on the same topic see ‘Advanced Call Processing in the CLR’ and Vance Morrison’s excellent write-up ‘Digging into interface calls in the .NET Framework: Stub-based dispatch’).
Finally, the different parts of the Core CLR source code that are involved are listed below:
Note: this post isn’t going to look at how the JIT itself works, if you are interested in that take a look as this excellent overview written by one of the main developers.
JIT and Execution Engine (EE) Interaction
The make all this work the JIT and the EE have to work together, to get an idea of what is involved, take a look at this comment describing the rules that determine which type of precode the JIT can use. All this info is stored in the EE as it’s the only place that has the full knowledge of what a method does, so the JIT has to ask which mode to work in.
In addition, the JIT has to ask the EE what the address of a functions entry point is, this is done via the following methods:
Precode and Stubs
There are different types or ‘precode’ available, ‘FIXUP’, ‘REMOTING’ or ‘STUB’, you can see the rules for which one is used in MethodDesc::GetPrecodeType(). In addition, because they are such a low-level mechanism, they are implemented differently across CPU architectures, from a comment in the code:
There two implementation options for temporary entrypoints:
(1) Compact entrypoints. They provide as dense entrypoints as possible, but can’t be patched to point to the final code. The call to unjitted method is indirect call via slot.
(2) Precodes. The precode will be patched to point to the final code eventually, thus the temporary entrypoint can be embedded in the code.
The call to unjitted method is direct call to direct jump.
We use (1) for x86 and (2) for 64-bit to get the best performance on each platform. For ARM (1) is used.
There’s also a whole lot more information about ‘precode’ available in the BOTR.
Finally, it turns out that you can’t go very far into the internals of the CLR without coming across ‘stubs’ (or ‘trampolines’, ‘thunks’, etc), for instance they’re used in
Tiered Compilation
Before we go any further I want to point out that Tiered Compilation is very much work-in-progress. As an indication, to get it working you currently have to set an environment variable called COMPLUS_EXPERIMENTAL_TieredCompilation
. It appears that the current work is focussed on the infrastructure to make it possible (i.e. CLR changes), then I assume that there has to be a fair amount of testing and performance analysis before it’s enabled by default.
If you want to learn about the goals of the feature and how it fits into the wider process of ‘code versioning’, I recommend reading the excellent design docs, including the future roadmap possibilities.
To give an indications of what has been involved so far, there has been work going on in the:
If you want to follow along you can take a look at the related issues/PRs, here are the main ones to get you started:
There is also some nice background information available in Introduce a tiered JIT and if you want to understand how it will eventually makes use of changes in the JIT (‘MinOpts’), take a look at Low Tier Back-Off and JIT: enable aggressive inline policy for Tier1.
History - ReJIT
As an quick historical aside, you have previously been able to get the CLR to re-JIT a method for you, but it only worked with the Profiling APIs, which meant you had to write some C/C++ COM code to make it happen! In addition ReJIT only allowed the method to be re-compiled at the same level, so it wouldn’t ever produce more optimised code. It was mostly meant to help monitoring or profiling tools.
How it works
Finally, how does it work, again lets look at some diagrams. Firstly, as a recap, lets take a look at how things ends up once a method had been JITed, with tiered compilation turned off (the same diagram as above):
Now, as a comparison, here’s what the same stage looks like with tiered compilation enabled:
The main difference is that tiered compilation has forced the method call to go through another level of indirection, the ‘pre stub’. This is to make it possible to count the number of times the method is called, then once it has hit the threshold (currently 30), the ‘pre stub’ is re-written to point to the ‘optimised native code’ instead:
Note that the original ‘native code’ is still available, so if needed the changes can be reverted and the method call can go back to the unoptimised version.
Using a counter
We can see a bit more details about the counter in this comments from prestub.cpp:
/*************************** CALL COUNTER ***********************/
// If we are counting calls for tiered compilation, leave the prestub
// in place so that we can continue intercepting method invocations.
// When the TieredCompilationManager has received enough call notifications
// for this method only then do we back-patch it.
BOOL fCanBackpatchPrestub = TRUE;
#ifdef FEATURE_TIERED_COMPILATION
BOOL fEligibleForTieredCompilation = IsEligibleForTieredCompilation();
if (fEligibleForTieredCompilation)
{
CallCounter * pCallCounter = GetCallCounter();
fCanBackpatchPrestub = pCallCounter->OnMethodCalled(this);
}
#endif
In essence the ‘stub’ calls back into the TieredCompilationManager until the ‘tiered compilation’ is triggered, once that happens the ‘stub’ is ‘back-patched’ to stop it being called any more.
Why not ‘Interpreted’?
If you’re wondering why tiered compilation doesn’t have an interpreted mode, you’re not alone, I asked the same question (for more info see my previous post on the .NET Interpreter)
And the answer I got was:
There’s already an Interpreter available, or is it not considered suitable for production code?
Its a fine question, but you guessed correctly - the interpreter is not in good enough shape to run production code as-is. There are also some significant issues if you want debugging and profiling tools to work (which we do). Given enough time and effort it is all solvable, it just isn’t the easiest place to start.
How different is the overhead between non-optimised and optimised JITting?
On my machine non-optimized jitting used about ~65% of the time that optimized jitting took for similar IL input sizes, but of course I expect results will vary by workload and hardware. Getting this first step checked in should make it easier to collect better measurements.
But that’s from a few months ago, maybe Mono’s New .NET Interpreter will change things, who knows?
Why not LLVM?
Finally, why aren’t they using a LLVM to compile the code, from Introduce a tiered JIT (comment)
There were (and likely still are) significant differences in the LLVM support needed for the CLR versus what is needed for Java, both in GC and in EH, and in the restrictions one must place on the optimizer. To cite just one example: the CLRs GC currently cannot tolerate managed pointers that point off the end of objects. Java handles this via a base/derived paired reporting mechanism. We’d either need to plumb support for this kind of paired reporting into the CLR or restrict LLVM’s optimizer passes to never create these kinds of pointers. On top of that, the LLILC jit was slow and we weren’t sure ultimately what kind of code quality it might produce.
So, figuring out how LLILC might fit into a potential multi-tier approach that did not yet exist seemed (and still seems) premature. The idea for
now is to get tiering into the framework and use RyuJit for the second-tier jit. As we learn more, we may discover there is indeed room for higher tier jits, or, at least, understand better what else we need to do before such things make sense.
There is more background info in Introduce a tiered JIT
Summary
One of my favourite side-effects of Microsoft making .NET Open Source and developing out in the open is that we can follow along with work-in-progress features. It’s great being able to download the latest code, try them out and see how they work under-the-hood, yay for OSS!!
Discuss this post on Hacker News
Fri, 15 Dec 2017, 12:00 am
Exploring the BBC micro:bit Software Stack
If you grew up in the UK and went to school during the 1980’s or 1990’s there’s a good chance that this picture brings back fond memories:
(image courtesy of Classic Acorn)
I’d imagine that for a large amount of computer programmers (currently in their 30’s) the BBC Micro was their first experience of programming. If this applies to you and you want a trip down memory lane, have a read of Remembering: The BBC Micro and The BBC Micro in my education.
Programming the classic Turtle was done in Logo, with code like this:
FORWARD 100
LEFT 90
FORWARD 100
LEFT 90
FORWARD 100
LEFT 90
FORWARD 100
LEFT 90
Of course, once you knew what you were doing, you would re-write it like so:
REPEAT 4 [FORWARD 100 LEFT 90]
BBC micro:bit
The original Micro was launched as an education tool, as part of the BBC’s Computer Literacy Project and by most accounts was a big success. As a follow-up, in March 2016 the micro:bit was launched as part of the BBC’s ‘Make it Digital’ initiative and 1 million devices were given out to schools and libraries in the UK to ‘help develop a new generation of digital pioneers’ (i.e. get them into programming!)
Aside: I love the difference in branding across 30 years, ‘BBC Micro’ became ‘BBC micro:bit’ (you must include the colon) and ‘Computer Literacy Project’ changed to the ‘Make it Digital Initiative’.
A few weeks ago I walked into my local library, picked up a nice starter kit and then spent a fun few hours watching my son play around with it (I’m worried about how quickly he picked up the basics of programming, I think I might be out of a job in a few years time!!)
However once he’d gone to bed it was all mine! The result of my ‘playing around’ is this post, in it I will be exploring the software stack that makes up the micro:bit, what’s in it, what it does and how it all fits together.
If you want to learn about how to program the micro:bit, its hardware or anything else, take a look at this excellent list of resources.
Slightly off-topic, but if you enjoy reading source code you might like these other posts:
BBC micro:bit Software Stack
If we take a high-level view at the stack, it divides up into 3 discrete software components that all sit on top of the hardware itself:
If you would like to build this stack for yourself take a look at the Building with Yotta guide. I also found this post describing The First Video Game on the BBC micro:bit [probably] very helpful.
Runtimes
There are several high-level runtimes available, these are useful because they let you write code in a language other than C/C++ or even create programs by dragging blocks around on a screen. The main ones that I’ve come across are below (see ‘Programming’ for a full list):
They both work in a similar way, the users code (Python or TypeScript) is bundled up along with the C/C++ code of the runtime itself and then the entire binary (hex) file is deployed to the micro:bit. When the device starts up, the runtime then looks for the users code at a known location in memory and starts interpreting it.
Update It turns out that I was wrong about the Microsoft PXT, it actually compiles your TypeScript program to native code, very cool! Interestingly, they did it that way because:
Compared to a typical dynamic JavaScript engine, PXT compiles code statically, giving rise to significant time and space performance improvements:
- user programs are compiled directly to machine code, and are never in any byte-code form that needs to be interpreted; this results in much faster execution than a typical JS interpreter
- there is no RAM overhead for user-code - all code sits in flash; in a dynamic VM there are usually some data-structures representing code
- due to lack of boxing for small integers and static class layout the memory consumption for objects is around half the one you get in a dynamic VM (not counting the user-code structures mentioned above)
- while there is some runtime support code in PXT, it’s typically around 100KB smaller than a dynamic VM, bringing down flash consumption and leaving more space for user code
The execution time, RAM and flash consumption of PXT code is as a rule of thumb 2x of compiled C code, making it competitive to write drivers and other user-space libraries.
Memory Layout
Just before we go onto the other parts of the software stack I want to take a deeper look at the memory layout. This is important because memory is so constrained on the micro:bit, there is only 16KB of RAM. To put that into perspective, we’ll use the calculation from this StackOverflow question How many bytes of memory is a tweet?
Twitter uses UTF-8 encoded messages. UTF-8 code points can be up to six four octets long, making the maximum message size 140 x 4 = 560 8-bit bytes.
If we re-calculate for the newer, longer tweets 280 x 4 = 1,120 bytes. So we could only fit 10 tweets into the available RAM on the micro:bit (it turns out that only ~11K out of the total 16K is available for general use). Which is why it’s worth using a custom version of atoi() to save 350 bytes of RAM!
The memory layout is specified by the linker at compile-time using NRF51822.ld, there is a sample output available if you want to take a look. Because it’s done at compile-time you run into build errors such as “region RAM overflowed with stack” if you configure it incorrectly.
The table below shows the memory layout from the ‘no SD’ version of a ‘Hello World’ app, i.e. with the maximum amount of RAM available as the Bluetooth (BLE) Soft-Device (SD) support has been removed. By comparison with BLE enabled, you instantly have 8K less RAM available, so things start to get tight!
Name
Start Address
End Address
Size
Percentage
.data
0x20000000
0x20000098
152 bytes
0.93%
.bss
0x20000098
0x20000338
672 bytes
4.10%
Heap (mbed)
0x20000338
0x20000b38
2,048 bytes
12.50%
Empty
0x20000b38
0x20003800
11,464 bytes
69.97%
Stack
0x20003800
0x20004000
2,048 bytes
12.50%
For more info on the column names see the Wikipedia pages for .data and .bss as well as text, data and bss: Code and Data Size Explained
As a comparison there is a nice image of the micro:bit RAM Layout in this article. It shows what things look like when running MicroPython and you can clearly see the main Python heap in the centre taking up all the remaining space.
Sitting in the stack below the high-level runtime is the device abstraction layer (DAL), created at Lancaster University in the UK, it’s made up of 4 main components:
- core
- High-level components, such as
Device
, Font
, HeapAllocator
, Listener
and Fiber
, often implemented on-top of 1 or more driver
classes
- types
- Helper types such as
ManagedString
, Image
, Event
and PacketBuffer
- drivers
- For control of a specific hardware component, such as
Accelerometer
, Button
, Compass
, Display
, Flash
, IO
, Serial
and Pin
- bluetooth
- asm
- Just 4 functions are implemented in assembly, they are
swap_context
, save_context
, save_register_context
and restore_register_context
. As the names suggest, they handle the ‘context switching’ necessary to make the MicroBit Fiber scheduler work
The image below shows the distribution of ‘Lines of Code’ (LOC), as you can see the majority of the code is in the drivers
and bluetooth
components.
In addition to providing nice helper classes for working with the underlying devices, the DAL provides the Fiber
abstraction to allows asynchronous functions to work. This is useful because you can asynchronously display text on the LED display and your code won’t block whilst it’s scrolling across the screen. In addition the Fiber
class is used to handle the interrupts that signal when the buttons on the micro:bit are pushed. This comment from the code clearly lays out what the Fiber scheduler does:
This lightweight, non-preemptive scheduler provides a simple threading mechanism for two main purposes:
1) To provide a clean abstraction for application languages to use when building async behaviour (callbacks).
2) To provide ISR decoupling for EventModel events generated in an ISR context.
Finally the high-level classes MicroBit.cpp and MicroBit.h are housed in the microbit repository. These classes define the API of the MicroBit runtime and setup the default configuration, as shown in the Constructor
of MicroBit.cpp
:
/**
* Constructor.
*
* Create a representation of a MicroBit device, which includes member variables
* that represent various device drivers used to control aspects of the micro:bit.
*/
MicroBit::MicroBit() :
serial(USBTX, USBRX),
resetButton(MICROBIT_PIN_BUTTON_RESET),
storage(),
i2c(I2C_SDA0, I2C_SCL0),
messageBus(),
display(),
buttonA(MICROBIT_PIN_BUTTON_A, MICROBIT_ID_BUTTON_A),
buttonB(MICROBIT_PIN_BUTTON_B, MICROBIT_ID_BUTTON_B),
buttonAB(MICROBIT_ID_BUTTON_A,MICROBIT_ID_BUTTON_B, MICROBIT_ID_BUTTON_AB),
accelerometer(i2c),
compass(i2c, accelerometer, storage),
compassCalibrator(compass, accelerometer, display),
thermometer(storage),
io(MICROBIT_ID_IO_P0,MICROBIT_ID_IO_P1,MICROBIT_ID_IO_P2,
MICROBIT_ID_IO_P3,MICROBIT_ID_IO_P4,MICROBIT_ID_IO_P5,
MICROBIT_ID_IO_P6,MICROBIT_ID_IO_P7,MICROBIT_ID_IO_P8,
MICROBIT_ID_IO_P9,MICROBIT_ID_IO_P10,MICROBIT_ID_IO_P11,
MICROBIT_ID_IO_P12,MICROBIT_ID_IO_P13,MICROBIT_ID_IO_P14,
MICROBIT_ID_IO_P15,MICROBIT_ID_IO_P16,MICROBIT_ID_IO_P19,
MICROBIT_ID_IO_P20),
bleManager(storage),
radio(),
ble(NULL)
{
...
}
The software at the bottom of the stack is making use of the ARM mbed OS which is:
.. an open-source embedded operating system designed for the “things” in the Internet of Things (IoT). mbed OS includes the features you need to develop a connected product using an ARM Cortex-M microcontroller.
mbed OS provides a platform that includes:
- Security foundations.
- Cloud management services.
- Drivers for sensors, I/O devices and connectivity.
mbed OS is modular, configurable software that you can customize it to your device and to reduce memory requirements by excluding unused software.
We can see this from the layout of it’s source, it’s based around common
components, which can be combined with a hal
(Hardware Abstraction Layers) and a target
specific to the hardware you are running on.
More specifically the micro:bit uses the yotta target bbc-microbit-classic-gcc
, but it can also use others targets as needed.
For reference, here are the files from the common
section of mbed
that are used by the micro:bit-dal
:
And here are the hardware specific files, targeting the NORDIC - MCU NRF51822
:
End-to-end (or top-to-bottom)
Finally, lets look a few examples of how the different components within the stack are used in specific scenarios
Writing to the Display
Storing files on the Flash memory
- microbit-dal
- mbed-classic
- Allows low-level control of the hardware, such as writing to the flash itself either directly or via the SoftDevice (SD) layer
In addition, this comment from MicroBitStorage.h gives a nice overview of how the file system is implemented on-top of the raw flash storage:
* The first 8 bytes are reserved for the KeyValueStore struct which gives core
* information such as the number of KeyValuePairs in the store, and whether the
* store has been initialised.
*
* After the KeyValueStore struct, KeyValuePairs are arranged contiguously until
* the end of the block used as persistent storage.
*
* |-------8-------|--------48-------|-----|---------48--------|
* | KeyValueStore | KeyValuePair[0] | ... | KeyValuePair[N-1] |
* |---------------|-----------------|-----|-------------------|
Summary
All-in-all the micro:bit is a very nice piece of kit and hopefully will achieve its goal ‘to help develop a new generation of digital pioneers’. However, it also has a really nice software stack, one that is easy to understand and find your way around.
Further Reading
I’ve got nothing to add that isn’t already included in this excellent, comprehensive list of resources, thanks Carlos for putting it together!!
Discuss this post on Hacker News or /r/microbit
Tue, 28 Nov 2017, 12:00 am
Microsoft & Open Source a 'Brave New World' - CORESTART 2.0
Recently I was fortunate enough to be invited to the CORESTART 2.0 conference to give a talk on Microsoft & Open Source a ‘Brave New World’. It was a great conference, well organised by Tomáš Herceg and the teams from .NET College and Riganti and I had a great time.
I encourage you to attend next years ‘Update’ conference if you can and as bonus you’ll get to see the sights of Prague! Including the Head of Franz Kafka as well as the amazing buildings, castles and bridges that all the guide-books will tell you about!
I’ve not been ‘invited’ to speak at a conference before, so I wasn’t sure what to expect, but there was a great audience and they seemed happy to learn about the Open Source projects that Microsoft are running and what is being done to encourage us (the ‘Community’) to contribute.
The slides for my talk are embedded below and you can also ‘watch’ the entire recording (audio and slides only, no video).
Microsoft & open source a 'brave new world' - CORESTART 2.0 from
Matt Warren
Talk Outline
But if you don’t fancy sitting through the whole thing, you can read the summary below and jump straight to the relevant parts
Before
[jump to slide] [direct video link]
During
[jump to slide] [direct video link]
After
[jump to slide] [direct video link]
What Now?
[jump to slide] [direct video link]
Domino Chain Reaction
Finally, if you’re wondering what the section on ‘Domino Chain Reaction’ is all about, you’ll have to listen to that part of the talk, but the video itself is embedded below:
(Based on actual research, see The Curious Mathematics of Domino Chain Reactions)
Tue, 14 Nov 2017, 12:00 am
A DoS Attack against the C# Compiler
Generics in C# are certainly very useful and I find it amazing that we almost didn’t get them:
What would the cost of inaction have been? What would the cost of failure have been? No generics in C# 2.0? No LINQ in C# 3.0? No TPL in C# 4.0? No Async in C# 5.0? No F#? Ultimately, an erasure model of generics would have been adopted, as for Java, since the CLR team would never have pursued a in-the-VM generics design without external help.
So a big thanks is due to Don Syme and the rest of the team at Microsoft Research in Cambridge!
But as well as being useful, I also find some usages of generics mind-bending, for instance I’m still not sure what this code actually means or how to explain it in words:
class Blah where T : Blah
As always, reading an Eric Lippert post helps a lot, but even he recommends against using this specific ‘circular’ pattern.
Recently I spoke at the CORESTART 2.0 conference in Prague, giving a talk on ‘Microsoft and Open-Source – A ‘Brave New World’. Whilst I was there I met the very knowledgeable Jiri Cincura, who blogs at tabs ↹ over ␣ ␣ ␣ spaces. He was giving a great talk on ‘C# 7.1 and 7.2 features’, but also shared with me an excellent code snippet that he called ‘Crazy Class’:
class Class
{
class Inner : Class
{
Inner.Inner.Inner.Inner.Inner.Inner.Inner.Inner.Inner inner;
}
}
He said:
this is the class that takes crazy amount of time to compile. You can add more Inner.Inner.Inner...
to make it even longer (and also generic parameters).
After a big of digging around I found that someone else had noticed this, see the StackOverflow question Why does field declaration with duplicated nested type in generic class results in huge source code increase? Helpfully the ‘accepted answer’ explains what is going on:
When you combine these two, the way you have done, something interesting happens. The type Outer.Inner
is not the same type as Outer.Inner.Inner
. Outer.Inner
is a subclass of Outer
while Outer.Inner.Inner
is a subclass of Outer
, which we established before as being different from Outer.Inner
. So Outer.Inner.Inner
and Outer.Inner
are referring to different types.
When generating IL, the compiler always uses fully qualified names for types. You have cleverly found a way to refer to types with names whose lengths that grow at exponential rates. That is why as you increase the generic arity of Outer
or add additional levels .Y
to the field field
in Inner
the output IL size and compile time grow so quickly.
Clear? Good!!
You probably have to be Jon Skeet, Eric Lippert or a member of the C# Language Design Team (yay, ‘Matt Warren’) to really understand what’s going on here, but that doesn’t stop the rest of us having fun with the code!!
I can’t think of any reason why you’d actually want to write code like this, so please don’t!! (or at least if you do, don’t blame me!!)
For a simple idea of what’s actually happening, lets take this code (with only 2 ‘Levels’):
class Class
{
class Inner : Class
{
Inner.Inner inner;
}
}
The ‘decompiled’ version actually looks like this:
internal class Class
{
private class Inner : Class
{
private Class.Inner inner;
}
}
Wow, no wonder things go wrong quickly!!
Exponential Growth
Firstly let’s check the claim of exponential growth, if you don’t remember your Big O notation you can also think of this as O(very, very bad)
!!
To test this out, I’m going to compile the code above, but vary the ‘level’ each time by adding a new .Inner
, so ‘Level 5’ looks like this:
Inner.Inner.Inner.Inner.Inner inner;
‘Level 6’ like this, and so on
Inner.Inner.Inner.Inner.Inner.Inner inner;
We then get the following results:
Level
Compile Time (secs)
Working set (KB)
Binary Size (Bytes)
5
1.15
54,288
135,680
6
1.22
59,500
788,992
7
2.00
70,728
4,707,840
8
6.43
121,852
28,222,464
9
33.23
405,472
169,310,208
10
202.10
2,141,272
CRASH
If we look at these results in graphical form, it’s very obvious what’s going on
(the dotted lines are a ‘best fit’ trend-line and they are exponential)
If I compile the code with dotnet build
(version 2.0.0), things go really wrong at ‘Level 10’ and the compiler throws an error (full stack trace):
System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values.
Which looks similar to Internal compiler error when creating Portable PDB files #3866.
However your mileage may vary, when I ran the code in Visual Studio 2015 it threw an OutOfMemoryException
instead and then promptly restarted itself!! I assume this is because VS is a 32-bit application and it runs out of memory before it can go really wrong!
Mono Compiler
As a comparison, here are the results from the Mono compiler, thanks to Egor Bogatov for putting them together.
Level
Compile Time (secs)
Memory Usage (Bytes)
5
0.480
134,144
6
0.502
786,944
7
0.745
4,706,304
8
2.053
28,220,928
9
10.134
169,308,672
10
57.307
1,015,835,136
At ‘Level 10’ it produced a 968.78 Mb binary!!
Profiling the Compiler
Finally, I want to look at just where the compiler is spending all it’s time. From the results above we saw that it was taking over 3 minutes to compile a simple program, with a peak memory usage of 2.14 GB, so what was it actually doing??
Well clearly there’s lots of Types
involved and the Compiler seems happy for you to write this code, so I guess it needs to figure it all out. Once it’s done that, it then needs to write all this Type
metadata out to a .dll or .exe, which can be 100’s of MB in size.
At a high-level the profiling summary produce by VS looks like this (click for full-size image):
However if we take a bit of a close look, we can see the ‘hot-path’ is inside the SerializeTypeReference(..)
method in Compilers/Core/Portable/PEWriter/MetadataWriter.cs
Summary
I’m a bit torn about this, it is clearly an ‘abuse’ of generics!!
In some ways I think that it shouldn’t be fixed, it seems better that the compiler encourages you to not write code like this, rather than making is possible!!
So if it takes 3 mins to compile your code, allocates 2GB of memory and then crashes, take that as a warning!!
Discuss this post on Hacker News, /r/programming and /r/csharp
The post A DoS Attack against the C# Compiler first appeared on my blog Performance is a Feature!
CodeProject
Wed, 8 Nov 2017, 12:00 am
DotNetAnywhere: An Alternative .NET Runtime
Recently I was listening to the excellent DotNetRocks podcast and they had Steven Sanderson (of Knockout.js fame) talking about ‘WebAssembly and Blazor’.
In case you haven’t heard about it, Blazor is an attempt to bring .NET to the browser, using the magic of WebAssembly. If you want more info, Scott Hanselmen has done a nice write-up of the various .NET/WebAssembly projects.
However, as much as the mention of WebAssembly was pretty cool, what interested me even more how Blazor was using DotNetAnywhere as the underlying .NET runtime. This post will look at what DotNetAnywhere is, what you can do with it and how it compares to the full .NET framework.
DotNetAnywhere
Firstly it’s worth pointing out that DotNetAnywhere (DNA) is designed to be a fully compliant .NET runtime, which means that it can run .NET dlls/exes that have been compiled to run against the full framework. On top of that (at least in theory) it supports all the following .NET runtime features, which is a pretty impressive list!
- Generics
- Garbage collection and finalization
- Weak references
- Full exception handling - try/catch/finally
- PInvoke
- Interfaces
- Delegates
- Events
- Nullable types
- Single-dimensional arrays
- Multi-threading
In addition there is some partial support for Reflection
- Very limited read-only reflection
- typeof(), .GetType(), Type.Name, Type.Namespace, Type.IsEnum(), .ToString() only
Finally, there are a few features that are currently unsupported:
- Attributes
- Most reflection
- Multi-dimensional arrays
- Unsafe code
There are various bugs or missing functionality that might prevent your code running under DotNetAnywhere, however several of these have been fixed since Blazor came along, so it’s worth checking against the Blazor version of DotNetAnywhere.
At this point in time the original DotNetAnywhere repo is no longer active (the last sustained activity was in Jan 2012), so it seems that any future development or bugs fixes will likely happen in the Blazor repo. If you have ever fixed something in DotNetAnywhere, consider sending a P.R there, to help the effort.
Update: In addition there are other forks with various bug fixes and enhancements:
Source Code Layout
What I find most impressive about the DotNetAnywhere runtime is that it was developed by one person and is less that 40,000 lines of code!! For a comparison the .NET framework Garbage Collector is almost 37,000 lines on it’s own (more info available in my previous post A Hitchhikers Guide to the CoreCLR Source Code).
This makes DotNetAnywhere an ideal learning resource!
Firstly, lets take a look at the Top-10 largest source files, to see where the complexity is:
Native Code - 17,710 lines in total
LOC
File
3,164
JIT_Execute.c
1,778
JIT.c
1,109
PInvoke_CaseCode.h
630
Heap.c
618
MetaData.c
563
MetaDataTables.h
517
Type.c
491
MetaData_Fill.c
467
MetaData_Search.c
452
JIT_OpCodes.h
Managed Code - 28,783 lines in total
LOC
File
2393
corlib/System.Globalization/CalendricalCalculations.cs
2314
corlib/System/NumberFormatter.cs
1582
System.Drawing/System.Drawing/Pens.cs
1443
System.Drawing/System.Drawing/Brushes.cs
1405
System.Core/System.Linq/Enumerable.cs
745
corlib/System/DateTime.cs
693
corlib/System.IO/Path.cs
632
corlib/System.Collections.Generic/Dictionary.cs
598
corlib/System/String.cs
467
corlib/System.Text/StringBuilder.cs
Main areas of functionality
Next, lets look at the key components in DotNetAnywhere as this gives us a really good idea about what you need to implement a .NET compatible runtime. Along the way, we will also see how they differ from the implementation found in Microsoft’s .NET Framework.
Reading .NET dlls
The first thing DotNetAnywhere has to do is read/understand/parse the .NET Metadata and Code that’s contained in a .dll/.exe. This all takes place in MetaData.c, primarily within the LoadSingleTable(..) function. By adding some debugging code, I was able to get a summary of all the different types of Metadata that are read in from a typical .NET dll, it’s quite an interesting list:
MetaData contains 1 Assemblies (MD_TABLE_ASSEMBLY)
MetaData contains 1 Assembly References (MD_TABLE_ASSEMBLYREF)
MetaData contains 0 Module References (MD_TABLE_MODULEREF)
MetaData contains 40 Type References (MD_TABLE_TYPEREF)
MetaData contains 13 Type Definitions (MD_TABLE_TYPEDEF)
MetaData contains 14 Type Specifications (MD_TABLE_TYPESPEC)
MetaData contains 5 Nested Classes (MD_TABLE_NESTEDCLASS)
MetaData contains 11 Field Definitions (MD_TABLE_FIELDDEF)
MetaData contains 0 Field RVA's (MD_TABLE_FIELDRVA)
MetaData contains 2 Propeties (MD_TABLE_PROPERTY)
MetaData contains 59 Member References (MD_TABLE_MEMBERREF)
MetaData contains 2 Constants (MD_TABLE_CONSTANT)
MetaData contains 35 Method Definitions (MD_TABLE_METHODDEF)
MetaData contains 5 Method Specifications (MD_TABLE_METHODSPEC)
MetaData contains 4 Method Semantics (MD_TABLE_PROPERTY)
MetaData contains 0 Method Implementations (MD_TABLE_METHODIMPL)
MetaData contains 22 Parameters (MD_TABLE_PARAM)
MetaData contains 2 Interface Implementations (MD_TABLE_INTERFACEIMPL)
MetaData contains 0 Implementation Maps? (MD_TABLE_IMPLMAP)
MetaData contains 2 Generic Parameters (MD_TABLE_GENERICPARAM)
MetaData contains 1 Generic Parameter Constraints (MD_TABLE_GENERICPARAMCONSTRAINT)
MetaData contains 22 Custom Attributes (MD_TABLE_CUSTOMATTRIBUTE)
MetaData contains 0 Security Info Items? (MD_TABLE_DECLSECURITY)
For more information on the Metadata see Introduction to CLR metadata, Anatomy of a .NET Assembly – PE Headers and the ECMA specification itself.
Executing .NET IL
Another large piece of functionality within DotNetAnywhere is the ‘Just-in-Time’ Compiler (JIT), i.e. the code that is responsible for executing the IL, this takes place initially in JIT_Execute.c and then JIT.c. The main ‘execution loop’ is in the JITit(..) function which contains an impressive 1,374 lines of code and over 200 case
statements within a single switch
!!
Taking a higher level view, the overall process that it goes through looks like this:
Where the .NET IL Op-Codes (CIL_XXX
) are defined in CIL_OpCodes.h and the DotNetAnywhere JIT Op-Codes (JIT_XXX
) are defined in JIT_OpCodes.h
Interesting enough, the JIT is the only place in DotNetAnywhere that uses assembly code and even then it’s only for win32
. It is used to allow a ‘jump’ or a goto
to labels in the C source code, so as IL instructions are executed it never actually leaves the JITit(..)
function, control is just moved around without having to make a full method call.
#ifdef __GNUC__
#define GET_LABEL(var, label) var = &&label
#define GO_NEXT() goto **(void**)(pCurOp++)
#else
#ifdef WIN32
#define GET_LABEL(var, label) \
{ __asm mov edi, label \
__asm mov var, edi }
#define GO_NEXT() \
{ __asm mov edi, pCurOp \
__asm add edi, 4 \
__asm mov pCurOp, edi \
__asm jmp DWORD PTR [edi - 4] }
#endif
Differences with the .NET Framework
In the full .NET framework all IL code is turned into machine code by the Just-in-Time Compiler (JIT) before being executed by the CPU.
However as we’ve already seen, DotNetAnywhere ‘interprets’ the IL, instruction-by-instruction and even through it’s done in a file called JIT.c no machine code is emitted, so the naming seems strange!?
Maybe it’s just a difference of perspective, but it’s not clear to me at what point you move from ‘interpreting’ code to ‘JITting’ it, even after reading the following links I’m not sure!! (can someone enlighten me?)
Garbage Collector
All the code for the DotNetAnywhere Garbage Collector (GC) is contained in Heap.c and is a very readable 600 lines of code. To give you an overview of what it does, here is the list of functions that it exposes:
void Heap_Init();
void Heap_SetRoots(tHeapRoots *pHeapRoots, void *pRoots, U32 sizeInBytes);
void Heap_UnmarkFinalizer(HEAP_PTR heapPtr);
void Heap_GarbageCollect();
U32 Heap_NumCollections();
U32 Heap_GetTotalMemory();
HEAP_PTR Heap_Alloc(tMD_TypeDef *pTypeDef, U32 size);
HEAP_PTR Heap_AllocType(tMD_TypeDef *pTypeDef);
void Heap_MakeUndeletable(HEAP_PTR heapEntry);
void Heap_MakeDeletable(HEAP_PTR heapEntry);
tMD_TypeDef* Heap_GetType(HEAP_PTR heapEntry);
HEAP_PTR Heap_Box(tMD_TypeDef *pType, PTR pMem);
HEAP_PTR Heap_Clone(HEAP_PTR obj);
U32 Heap_SyncTryEnter(HEAP_PTR obj);
U32 Heap_SyncExit(HEAP_PTR obj);
HEAP_PTR Heap_SetWeakRefTarget(HEAP_PTR target, HEAP_PTR weakRef);
HEAP_PTR* Heap_GetWeakRefAddress(HEAP_PTR target);
void Heap_RemovedWeakRefTarget(HEAP_PTR target);
Differences with the .NET Framework
However, like the JIT/Interpreter, the GC has some fundamental differences when compared to the .NET Framework
Conservative Garbage Collection
Firstly DotNetAnywhere implements what is knows as a Conservative GC. In simple terms this means that is does not know (for sure) which areas of memory are actually references/pointers to objects and which are just a random number (that looks like a memory address). In the Microsoft .NET Framework the JIT calculates this information and stores it in the GCInfo structure so the GC can make use of it. But DotNetAnywhere doesn’t do this.
Instead, during the Mark
phase the GC gets all the available ‘roots’, but it will consider all memory addresses within an object as ‘potential’ references (hence it is ‘conservative’). It then has to lookup each possible reference, to see if it really points to an ‘object reference’. It does this by keeping track of all memory/heap references in a balanced binary search tree (ordered by memory address), which looks something like this:
However, this means that all objects references have to be stored in the binary tree when they are allocated, which adds some overhead to allocation. In addition extra memory is needed, 20 bytes per heap entry. We can see this by looking at the tHeapEntry
data structure (all pointers are 4 bytes, U8
= 1 byte and padding
is ignored), tHeapEntry *pLink[2]
is the extra data that is needed just to enable the binary tree lookup.
struct tHeapEntry_ {
// Left/right links in the heap binary tree
tHeapEntry *pLink[2];
// The 'level' of this node. Leaf nodes have lowest level
U8 level;
// Used to mark that this node is still in use.
// If this is set to 0xff, then this heap entry is undeletable.
U8 marked;
// Set to 1 if the Finalizer needs to be run.
// Set to 2 if this has been added to the Finalizer queue
// Set to 0 when the Finalizer has been run (or there is no Finalizer in the first place)
// Only set on types that have a Finalizer
U8 needToFinalize;
// unused
U8 padding;
// The type in this heap entry
tMD_TypeDef *pTypeDef;
// Used for locking sync, and tracking WeakReference that point to this object
tSync *pSync;
// The user memory
U8 memory[0];
};
But why does DotNetAnywhere work like this? Fortunately Chris Bacon the author of DotNetAnywhere explains
Mind you, the whole heap code really needs a rewrite to reduce per-object memory overhead, and to remove the need for the binary tree of allocations. Not really thinking of a generational GC, that would probably add to much code. This was something I vaguely intended to do, but never got around to.
The current heap code was just the simplest thing to get GC working quickly. The very initial implementation did no GC at all. It was beautifully fast, but ran out of memory rather too quickly.
For more info on ‘Conservative’ and ‘Precise’ GCs see:
GC only does ‘Mark-Sweep’, it doesn’t Compact
Another area in which the GC behaviour differs is that it doesn’t do any Compaction of memory after it’s cleaned up, as Steve Sanderson found out when working on Blazor
.. During server-side execution we don’t actually need to pin anything, because there’s no interop outside .NET. During client-side execution, everything is (in effect) pinned regardless, because DNA’s GC only does mark-sweep - it doesn’t have any compaction phase.
In addition, when an object is allocated DotNetAnywhere just makes a call to malloc(), see the code that does this is in the Heap_Alloc(..) function. So there is no concept of ‘Generations’ or ‘Segments’ that you have in the .NET Framework GC, i.e. no ‘Gen 0’, ‘Gen 1’, or ‘Large Object Heap’.
Threading Model
Finally, lets take a look at the threading model, which is fundamentally different from the one found in the .NET Framework.
Differences with the .NET Framework
Whilst DotNetAnywhere will happily create new threads and execute them for you, it’s only providing the illusion of true multi-threading. In reality it only runs on one thread, but context switches between the different threads that your program creates:
You can see this in action in the code below, (from the Thread_Execute() function), note the call to JIT_Execute(..)
with numInst
set to 100
:
for (;;) {
U32 minSleepTime = 0xffffffff;
I32 threadExitValue;
status = JIT_Execute(pThread, 100);
switch (status) {
....
}
}
An interesting side-effect is that the threading code in the DotNetAnywhere corlib
implementation is really simple. For instance the internal implementation of the Interlocked.CompareExchange()
function looks like the following, note the lack of synchronisation that you would normally expect:
tAsyncCall* System_Threading_Interlocked_CompareExchange_Int32(
PTR pThis_, PTR pParams, PTR pReturnValue) {
U32 *pLoc = INTERNALCALL_PARAM(0, U32*);
U32 value = INTERNALCALL_PARAM(4, U32);
U32 comparand = INTERNALCALL_PARAM(8, U32);
*(U32*)pReturnValue = *pLoc;
if (*pLoc == comparand) {
*pLoc = value;
}
return NULL;
}
Benchmarks
As a simple test, I ran some benchmarks from The Computer Language Benchmarks Game - binary-trees, using the simplest C# version
Note: DotNetAnywhere was designed to run on low-memory devices, so it was not meant to have the same performance as the full .NET Framework. Please bear that in mind when looking at the results!!
.NET Framework, 4.6.1 - 0.36 seconds
Invoked=TestApp.exe 15
stretch tree of depth 16 check: 131071
32768 trees of depth 4 check: 1015808
8192 trees of depth 6 check: 1040384
2048 trees of depth 8 check: 1046528
512 trees of depth 10 check: 1048064
128 trees of depth 12 check: 1048448
32 trees of depth 14 check: 1048544
long lived tree of depth 15 check: 65535
Exit code : 0
Elapsed time : 0.36
Kernel time : 0.06 (17.2%)
User time : 0.16 (43.1%)
page fault # : 6604
Working set : 25720 KB
Paged pool : 187 KB
Non-paged pool : 24 KB
Page file size : 31160 KB
DotNetAnywhere - 54.39 seconds
Invoked=dna TestApp.exe 15
stretch tree of depth 16 check: 131071
32768 trees of depth 4 check: 1015808
8192 trees of depth 6 check: 1040384
2048 trees of depth 8 check: 1046528
512 trees of depth 10 check: 1048064
128 trees of depth 12 check: 1048448
32 trees of depth 14 check: 1048544
long lived tree of depth 15 check: 65535
Total execution time = 54288.33 ms
Total GC time = 36857.03 ms
Exit code : 0
Elapsed time : 54.39
Kernel time : 0.02 (0.0%)
User time : 54.15 (99.6%)
page fault # : 5699
Working set : 15548 KB
Paged pool : 105 KB
Non-paged pool : 8 KB
Page file size : 13144 KB
So clearly DotNetAnywhere doesn’t work as fast in this benchmark (0.36 seconds v 54 seconds). However if we look at other benchmarks from the same site, it performs a lot better. It seems that DotNetAnywhere has a significant overhead when allocating objects (a class
), which is less obvious when using structs
.
Benchmark 1 (using
classes
)
Benchmark 2 (using
structs
)
Elapsed Time (secs)
3.1
2.0
GC Collections
96
67
Total GC time (msecs)
983.59
439.73
Finally, I really want to thank Chris Bacon, DotNetAnywhere is a great code base and gives a fantastic insight into what needs to happen for a .NET runtime to work.
Discuss this post on Hacker News and /r/programming
The post DotNetAnywhere: An Alternative .NET Runtime first appeared on my blog Performance is a Feature!
CodeProject
Thu, 19 Oct 2017, 12:00 am
Analysing C# code on GitHub with BigQuery
Just over a year ago Google made all the open source code on GitHub available for querying within BigQuery and as if that wasn’t enough you can run a terabyte of queries each month for free!
So in this post I am going to be looking at all the C# source code on GitHub and what we can find out from it. Handily a smaller, C# only, dataset has been made available (in BigQuery you are charged per byte read), called fh-bigquery:github_extracts.contents_net_cs and has
- 5,885,933 unique ‘.cs’ files
- 792,166,632 lines of code (LOC)
- 37.17 GB of data
Which is a pretty comprehensive set of C# source code!
The rest of this post will attempt to answer the following questions:
- Tabs or Spaces?
regions
: ‘should be banned’ or ‘okay in some cases’?
- ‘K&R’ or ‘Allman’, where do C# devs like to put their braces?
- Do C# developers like writing functional code?
Then moving onto some less controversial C# topics:
- Which
using
statements are most widely used?
- What NuGet packages are most often included in a .NET project
- How many lines of code (LOC) are in a typical C# file?
- What is the most widely thrown
Exception
?
- ‘async/await all the things’ or not?
- Do C# developers like using the
var
keyword? (Updated)
Before we end up looking at repositories, not just individual C# files:
- What is the most popular repository with C# code in it?
- Just how many files should you have in a repository?
- What are the most popular C#
class
names?
- ‘Foo.cs’, ‘Program.cs’ or something else, what’s the most common file name?
If you want to try the queries for yourself (or find my mistakes), all of them are available in this gist. There’s a good chance that my regular expressions miss out some edge-cases, after all Regular Expressions: Now You Have Two Problems:
Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
Tabs or Spaces?
In the entire data-set there are 5,885,933 files, but here we only include ones that have more than 10 lines starting with a tab or a space
Tabs
Tabs %
Spaces
Spaces %
Total
799,055
17.15%
3,859,528
82.85%
4,658,583
Clearly, C# developers (on GitHub) prefer Spaces over Tabs, let the endless debates continue!! (I think some of this can be explained by the fact that Visual Studio uses ‘spaces’ by default)
If you want to see how C# compares to other programming languages, take a look at 400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?.
regions
: ‘should be banned’ or ‘okay in some cases’?
It turns out that there are an impressive 712,498 C# files (out of 5.8 million) that contain at least one #region
statement (query used), that’s just over 12%. (I’m hoping that a lot of those files have been auto-generated by a tool!)
‘K&R’ or ‘Allman’, where do C# devs like to put their braces?
C# developers overwhelmingly prefer putting an opening brace {
on it’s own line (query used)
separate line
same line
same line (initializer)
total (with brace)
total (all code)
81,306,320 (67%)
40,044,603 (33%)
3,631,947 (2.99%)
121,350,923 (15.32%)
792,166,632
(‘same line initializers’ include code like new { Name = "", .. }
, new [] { 1, 2, 3.. }
)
Do C# developers like writing functional code?
This is slightly unscientific, but I wanted to see how widely the Lambda Operator =>
is used in C# code (query). Yes, I know, if you want to write functional code on .NET you really should use F#, but C# has become more ‘functional’ over the years and I wanted to see how much code was taking advantage of that.
Here’s the raw percentiles:
Percentile
% of lines using lambdas
10
0.51
25
1.14
50
2.50
75
5.26
90
9.95
95
14.29
99
28.00
So we can say that:
- 50% of all the C# code on GitHub uses
=>
on 2.44% (or less) of their lines.
- 10% of all C# files have lambdas on almost 1 in 10 of their lines (9.95%)
- 5% use
=>
on 1 in 7 lines (14.29%)
- 1% of files have lambdas on over 1 in 3 lines (28%) of their lines of code, that’s pretty impressive!
Which using
statements are most widely used?
Now on to some a bit more substantial, what are the most widely used using
statements in C# code?
The top 10 looks like this (the full results are available):
using statement
count
using System.Collections.Generic;
1,780,646
using System;
1,477,019
using System.Linq;
1,319,830
using System.Text;
902,165
using System.Threading.Tasks;
628,195
using System.Runtime.InteropServices;
431,867
using System.IO;
407,848
using System.Runtime.CompilerServices;
338,686
using System.Collections;
289,867
using System.Reflection;
218,369
However, as was pointed out, the top 5 are included by default when you add a new file in Visual Studio and many people wouldn’t remove them. The same applies to ‘System.Runtime.InteropServices’ and ‘System.Runtime.CompilerServices’ which are include in ‘AssemblyInfo.cs` by default.
So if we adjust the list to take account of this, the top 10 looks like so:
using statement
count
using System.IO;
407,848
using System.Collections;
289,867
using System.Reflection;
218,369
using System.Diagnostics;
201,341
using System.Threading;
179,168
using System.ComponentModel;
160,681
using System.Web;
160,323
using System.Windows.Forms;
137,003
using System.Globalization;
132,113
using System.Drawing;
127,033
Finally, an interesting list is the top 10 using statements that aren’t System
, Microsoft
or Windows
namespaces:
using statement
count
using NUnit.Framework;
119,463
using UnityEngine;
117,673
using Xunit;
99,099
using Newtonsoft.Json;
81,675
using Newtonsoft.Json.Linq;
29,416
using Moq;
23,546
using UnityEngine.UI;
20,355
using UnityEditor;
19,937
using Amazon.Runtime;
18,941
using log4net;
17,297
What NuGet packages are most often included in a .NET project?
It turns out that there is also a separate dataset containing all the ‘packages.config’ files on GitHub, it’s called contents_net_packages_config and has 104,808 entries. By querying this we can see that Json.Net is the clear winner!!
package
count
Newtonsoft.Json
45,055
Microsoft.Web.Infrastructure
16,022
Microsoft.AspNet.Razor
15,109
Microsoft.AspNet.WebPages
14,495
Microsoft.AspNet.Mvc
14,236
EntityFramework
14,191
Microsoft.AspNet.WebApi.Client
13,480
Microsoft.AspNet.WebApi.Core
12,210
Microsoft.Net.Http
11,625
jQuery
10,646
Microsoft.Bcl.Build
10,641
Microsoft.Bcl
10,349
NUnit
10,341
Owin
9,681
Microsoft.Owin
9,202
Microsoft.AspNet.WebApi.WebHost
9,007
WebGrease
8,743
Microsoft.AspNet.Web.Optimization
8,721
Microsoft.AspNet.WebApi
8,179
How many lines of code (LOC) are in a typical C# file?
Are C# developers prone to creating huge files that go one for 1000’s of lines? Well some are but fortunately it’s the minority of us!!
Note the Y-axis is ‘lines of code’ and is logarithmic, the raw data is available.
Oh dear, Uncle Bob isn’t going to be happy, whilst 96% of the files have 509 LOC of less, the other 4% don’t!! From Clean Code:
And in case you’re wondering, here’s the Top 10 longest C# files!!
File
Lines
MarMot/Input/test.marmot.cs
92663
src/CodenameGenerator/WordRepos/LastNamesRepository.cs
88810
cs_inputtest/cs_02_7000.cs
63004
cs_inputtest/cs_02_6000.cs
54004
src/ML NET20/Utility/UserName.cs
52014
MWBS/Dictionary/DefaultWordDictionary.cs
48912
Sources/Accord.Math/Matrix/Matrix.Comparisons1.Generated.cs
48407
UrduProofReader/UrduLibs/Utils.cs
48255
cs_inputtest/cs_02_5000.cs
45004
css/style.cs
44366
What is the most widely thrown Exception
?
There’s a few interesting results in this query, for instance who knew that so many ApplicationExceptions
were thrown and NotSupportedException
being so high up the list is a bit worrying!!
Exception
count
throw new ArgumentNullException
699,526
throw new ArgumentException
361,616
throw new NotImplementedException
340,361
throw new InvalidOperationException
260,792
throw new ArgumentOutOfRangeException
160,640
throw new NotSupportedException
110,019
throw new HttpResponseException
74,498
throw new ValidationException
35,615
throw new ObjectDisposedException
31,129
throw new ApplicationException
30,849
throw new UnauthorizedException
21,133
throw new FormatException
19,510
throw new SerializationException
17,884
throw new IOException
15,779
throw new IndexOutOfRangeException
14,778
throw new NullReferenceException
12,372
throw new InvalidDataException
12,260
throw new ApiException
11,660
throw new InvalidCastException
10,510
‘async/await all the things’ or not?
The addition of the async
and await
keywords to the C# language makes writing asynchronous code much easier:
public async Task GetDotNetCountAsync()
{
// Suspends GetDotNetCountAsync() to allow the caller (the web server)
// to accept another request, rather than blocking on this one.
var html = await _httpClient.DownloadStringAsync("http://dotnetfoundation.org");
return Regex.Matches(html, ".NET").Count;
}
But how much is it used? Using the query below:
SELECT Count(*) count
FROM
[fh-bigquery:github_extracts.contents_net_cs]
WHERE
REGEXP_MATCH(content, r'\sasync\s|\sawait\s')
I found that there are 218,643 files (out of 5,885,933) that have at least one usage of async
or await
in them.
Do C# developers like using the var
keyword?
Less that they use async
and await
, there are 130,590 files that have at least one usage of the var
keyword
Update: thanks for jairbubbles for pointing out that my var
regex was wrong and supplying a fixed version!
More than they use async
and await
, there are 1,457,154 files that have at least one usage of the var
keyword
Just how many files should you have in a repository?
90% of the repositories (that have any C# files) have 95 files or less. 95% have 170 files or less and 99% have 535 files or less.
(again the Y-axis (# files) is logarithmic)
The top 10 largest repositories, by number of C# files are shown below:
Repository
# Files
https://github.com/xen2/mcs
23389
https://github.com/mater06/LEGOChimaOnlineReloaded
14241
https://github.com/Microsoft/referencesource
13051
https://github.com/dotnet/corefx
10652
https://github.com/apo-j/Projects_Working
10185
https://github.com/Microsoft/CodeContracts
9338
https://github.com/drazenzadravec/nequeo
8060
https://github.com/ClearCanvas/ClearCanvas
7946
https://github.com/mwilliamson-firefly/aws-sdk-net
7860
https://github.com/151706061/MacroMedicalSystem
7765
What is the most popular repository with C# code in it?
This time we are going to look at the most popular repositories (based on GitHub ‘stars’) that contain at least 50 C# files (query used):
repo
stars
files
https://github.com/grpc/grpc
11075
237
https://github.com/dotnet/coreclr
8576
6503
https://github.com/dotnet/roslyn
8422
6351
https://github.com/facebook/yoga
8046
73
https://github.com/bazelbuild/bazel
7123
132
https://github.com/dotnet/corefx
7115
10652
https://github.com/SeleniumHQ/selenium
7024
512
https://github.com/Microsoft/WinObjC
6184
81
https://github.com/qianlifeng/Wox
5674
207
https://github.com/Wox-launcher/Wox
5674
142
https://github.com/ShareX/ShareX
5336
766
https://github.com/Microsoft/Windows-universal-samples
5130
1501
https://github.com/NancyFx/Nancy
3701
957
https://github.com/chocolatey/choco
3432
248
https://github.com/JamesNK/Newtonsoft.Json
3340
650
Interesting that the top spot is a Google Repository! (the C# files in it are sample code for using the GRPC library from .NET)
What are the most popular C# class
names?
Assuming that I got the regex correct, the most popular C# class
names are the following:
Class name
Count
class C
182480
class Program
163462
class Test
50593
class Settings
40841
class Resources
39345
class A
34687
class App
28462
class B
24246
class Startup
18238
class Foo
15198
Yay for Foo
, just sneaking into the Top 10!!
‘Foo.cs’, ‘Program.cs’ or something else, what’s the most common file name?
Finally lets look at the different class
names used, as with the using
statement they are dominated by the default ones used in the Visual Studio templates:
File
Count
AssemblyInfo.cs
386822
Program.cs
105280
Resources.Designer.cs
40881
Settings.Designer.cs
35392
App.xaml.cs
21928
Global.asax.cs
16133
Startup.cs
14564
HomeController.cs
13574
RouteConfig.cs
11278
MainWindow.xaml.cs
11169
Discuss this post on Hacker News and /r/csharp
As always, if you’ve read this far your present is yet more blog posts to read, enjoy!!
How BigQuery Works (only put in at the end of the blog post)
BigQuery analysis of other Programming Languages
The post Analysing C# code on GitHub with BigQuery first appeared on my blog Performance is a Feature!
CodeProject
Thu, 12 Oct 2017, 12:00 am
A look at the internals of 'boxing' in the CLR
It’s a fundamental part of .NET and can often happen without you knowing, but how does it actually work? What is the .NET Runtime doing to make boxing possible?
Note: this post won’t be discussing how to detect boxing, how it can affect performance or how to remove it (speak to Ben Adams about that!). It will only be talking about how it works.
As an aside, if you like reading about CLR internals you may find these other posts interesting:
Boxing in the CLR Specification
Firstly it’s worth pointing out that boxing is mandated by the CLR specification ‘ECMA-335’, so the runtime has to provide it:
This means that there are a few key things that the CLR needs to take care of, which we will explore in the rest of this post.
Creating a ‘boxed’ Type
The first thing that the runtime needs to do is create the corresponding reference type (‘boxed type’) for any struct
that it loads. You can see this in action, right at the beginning of the ‘Method Table’ creation where it first checks if it’s dealing with a ‘Value Type’, then behaves accordingly. So the ‘boxed type’ for any struct
is created up front, when your .dll is imported, then it’s ready to be used by any ‘boxing’ that happens during program execution.
The comment in the linked code is pretty interesting, as it reveals some of the low-level details the runtime has to deal with:
// Check to see if the class is a valuetype; but we don't want to mark System.Enum
// as a ValueType. To accomplish this, the check takes advantage of the fact
// that System.ValueType and System.Enum are loaded one immediately after the
// other in that order, and so if the parent MethodTable is System.ValueType and
// the System.Enum MethodTable is unset, then we must be building System.Enum and
// so we don't mark it as a ValueType.
CPU-specific code-generation
But to see what happens during program execution, let’s start with a simple C# program. The code below creates a custom struct
or Value Type
, which is then ‘boxed’ and ‘unboxed’:
public struct MyStruct
{
public int Value;
}
var myStruct = new MyStruct();
// boxing
var boxed = (object)myStruct;
// unboxing
var unboxed = (MyStruct)boxed;
This gets turned into the following IL code, in which you can see the box
and unbox.any
IL instructions:
L_0000: ldloca.s myStruct
L_0002: initobj TestNamespace.MyStruct
L_0008: ldloc.0
L_0009: box TestNamespace.MyStruct
L_000e: stloc.1
L_000f: ldloc.1
L_0010: unbox.any TestNamespace.MyStruct
Runtime and JIT code
So what does the JIT do with these IL op codes? Well in the normal case it wires up and then inlines the optimised, hand-written, assembly code versions of the ‘JIT Helper Methods’ provided by the runtime. The links below take you to the relevant lines of code in the CoreCLR source:
Interesting enough, the only other ‘JIT Helper Methods’ that get this special treatment are object
, string
or array
allocations, which goes to show just how performance sensitive boxing is.
In comparison, there is only one helper method for ‘unboxing’, called JIT_Unbox(..), which falls back to JIT_Unbox_Helper(..) in the uncommon case and is wired up here (CORINFO_HELP_UNBOX
to JIT_Unbox
). The JIT will also inline the helper call in the common case, to save the cost of a method call, see Compiler::impImportBlockCode(..).
Note that the ‘unbox helper’ only fetches a reference/pointer to the ‘boxed’ data, it has to then be put onto the stack. As we saw above, when the C# compiler does unboxing it uses the ‘Unbox_Any’ op-code not just the ‘Unbox’ one, see Unboxing does not create a copy of the value for more information.
Unboxing Stub Creation
As well as ‘boxing’ and ‘unboxing’ a struct
, the runtime also needs to help out during the time that a type remains ‘boxed’. To see why, let’s extend MyStruct
and override
the ToString()
method, so that it displays the current Value
:
public struct MyStruct
{
public int Value;
public override string ToString()
{
return "Value = " + Value.ToString();
}
}
Now, if we look at the ‘Method Table’ the runtime creates for the boxed version of MyStruct
(remember, value types have no ‘Method Table’), we can see something strange going on. Note that there are 2 entries for MyStruct::ToString
, one of which I’ve labelled as an ‘Unboxing Stub’
Method table summary for 'MyStruct':
Number of static fields: 0
Number of instance fields: 1
Number of static obj ref fields: 0
Number of static boxed fields: 0
Number of declared fields: 1
Number of declared methods: 1
Number of declared non-abstract methods: 1
Vtable (with interface dupes) for 'MyStruct':
Total duplicate slots = 0
SD: MT::MethodIterator created for MyStruct (TestNamespace.MyStruct).
slot 0: MyStruct::ToString 0x000007FE41170C10 (slot = 0) (Unboxing Stub)
slot 1: System.ValueType::Equals 0x000007FEC1194078 (slot = 1)
slot 2: System.ValueType::GetHashCode 0x000007FEC1194080 (slot = 2)
slot 3: System.Object::Finalize 0x000007FEC14A30E0 (slot = 3)
slot 5: MyStruct::ToString 0x000007FE41170C18 (slot = 4)
Wed, 2 Aug 2017, 12:00 am
Memory Usage Inside the CLR
Have you ever wondered where and why the .NET Runtime (CLR) allocates memory? I don’t mean the ‘managed’ memory that your code allocates, e.g. via new MyClass(..)
and the Garbage Collector (GC) then cleans up. I mean the memory that the CLR itself allocates, all the internal data structures that it needs to make is possible for your code to run.
Note just to clarify, this post will not be telling you how you can analyse the memory usage of your code, for that I recommend using one of the excellent .NET Profilers available such as dotMemory by JetBrains or the ANTS Memory Profiler from Redgate (I’ve personally used both and they’re great)
The high-level view
Fortunately there’s a fantastic tool that makes it very easy for us to get an overview of memory usage within the CLR itself. It’s called VMMap and it’s part of the excellent Sysinternals Suite.
For the post I will just be using a simple HelloWorld
program, so that we can observe what the CLR does in the simplest possible scenario, obviously things may look a bit different in a more complex app.
Firstly, lets look at the data over time, in 1 second intervals. The HelloWorld
program just prints to the Console and then waits until you press
, so once the memory usage has reached it’s peak it remains there till the program exits. (Click for a larger version)
However, to get a more detailed view, we will now look at the snapshot from 2 seconds into the timeline, when the memory usage has stabilised.
Note: If you want to find out more about memory usage in general, but also specifically how measure it in .NET applications, I recommend reading this excellent series of posts by Sasha Goldshtein
Also, if like me you always get the different types of memory mixed-up, please read this Stackoverflow answer first What is private bytes, virtual bytes, working set?
‘Image’ Memory
Now we’ve seen the high-level view, lets take a close look at the individual chucks, the largest of which is labelled Image, which according to the VMMap help page (see here for all info on all memory types):
… represents an executable file such as a .exe or .dll and has been loaded into a process by the image loader. It does not include images mapped as data files, which would be included in the Mapped File memory type. Image mappings can include shareable memory like code. When data regions, like initialized data, is modified, additional private memory is created in the process.
At this point, it’s worth pointing out a few things:
- This memory is takes up a large amount of the total process memory because I’m using a simple
HelloWorld
program, in other types of programs it wouldn’t dominate the memory usage as much
- I was using a
DEBUG
version of the CoreCLR, so the CLR specific files System.Private.CoreLib.dll, coreclr.dll, clrjit.dll and CoreRun.exe may well be larger than if they were compiled in RELEASE
mode
- Some of this memory is potentially ‘shared’ with other processes, compare the numbers in the ‘Total WS’, ‘Private WS’, ‘Shareable WS’ and ‘Shared WS’ columns to see this in action.
‘Managed Heaps’ created by the Garbage Collector
The next largest usage of memory is the GC itself, it pre-allocates several heaps that it can then give out whenever your program allocates an object, for example via code such as new MyClass()
or new byte[]
.
The main thing to note about the image above is that you can clearly see the different heap, there is 256 MB allocated for Generations (Gen 0, 1, 2) and 128 MB for the ‘Large Object Heap’. In addition, note the difference between the amounts in the Size and the Committed columns. Only the Committed memory is actually being used, the total Size is what the GC pre-allocates or reserves up front from the address space.
If you’re interested, the rules for heap or more specifically segment sizes are helpfully explained in the Microsoft Docs, but simply put, it varies depending on the GC mode (Workstation v Server), whether the process is 32/64-bit and ‘Number of CPUs’.
Internal CLR ‘Heap’ memory
However the part that I’m going to look at for the rest of this post is the memory that is allocated by the CLR itself, that is unmanaged memory that is uses for all its internal data structures.
But if we just look at the VMMap UI view, it doesn’t really tell us that much!
However, using the excellent PerfView tool we can capture the full call-stack of any memory allocations, that is any calls to VirtualAlloc() or RtlAllocateHeap() (obviously these functions only apply when running the CoreCLR on Windows). If we do this, PerfView gives us the following data (yes, it’s not pretty, but it’s very powerful!!)
So lets explore this data in more detail.
Notable memory allocations
There are a few places where the CLR allocates significant chunks of memory up-front and then uses them through its lifetime, they are listed below:
- GC related allocations (see gc.cpp)
- Mark List - 1,052,672 Bytes (1,028 K) in
WKS::make_mark_list(..)
. using during the ‘mark’ phase of the GC, see Back To Basics: Mark and Sweep Garbage Collection
- Card Table - 397,312 Bytes (388 K) in
WKS::gc_heap::make_card_table(..)
, see Marking the ‘Card Table’
- Overall Heap Creation/Allocation - 204,800 Bytes (200 K) in
WKS::gc_heap::make_gc_heap(..)
- S.O.H Segment creation - 65,536 Bytes (64 K) in
WKS::gc_heap::allocate(..)
, triggered by the first object allocation
- L.O.H Segment creation - 65,536 Bytes (64 K) in
WKS::gc_heap::allocate_large_object(..)
, triggered by the first ‘large’ object allocation
- Handle Table - 20,480 Bytes (20 K) in HndCreateHandleTable(..)
- Stress Log - 4,194,304 Bytes (4,096 K) in StressLog::Initialize(..). Only if the ‘stress log’ is activated, see this comment for more info
- ‘Watson’ error reporting - 65,536 Bytes (64 K) in EEStartupHelper routine
- Virtual Call Stub Manager - 36,864 Bytes (36 K) in VirtualCallStubManager::InitStatic(), which in turn creates the DispatchCache. See ‘Virtual Stub Dispatch’ in the BOTR for more info
- Debugger Heap and Control-Block - 28,672 Bytes (28K) (only if debugging support is needed) in DebuggerHeap::Init(..) and DebuggerRCThread::Init(..), both called via InitializeDebugger(..)
Execution Engine Heaps
However another technique that it uses is to allocated ‘heaps’, often 64K at a time and then perform individual allocations within the heaps as needed. These heaps are split up into individual use-cases, the most common being for ‘frequently accessed’ data and it’s counter-part, data that is ‘rarely accessed’, see the explanation from this comment in loaderallocator.hpp for more. This is done to ensure that the CLR retains control over any memory allocations and can therefore prevent ‘fragmentation’.
These heaps are together known as ‘Loader Heaps’ as explained in Drill Into .NET Framework Internals to See How the CLR Creates Runtime Objects (wayback machine version):
LoaderHeaps
LoaderHeaps are meant for loading various runtime CLR artifacts and optimization artifacts that live for the lifetime of the domain. These heaps grow by predictable chunks to minimize fragmentation. LoaderHeaps are different from the garbage collector (GC) Heap (or multiple heaps in case of a symmetric multiprocessor or SMP) in that the GC Heap hosts object instances while LoaderHeaps hold together the type system. Frequently accessed artifacts like MethodTables, MethodDescs, FieldDescs, and Interface Maps get allocated on a HighFrequencyHeap, while less frequently accessed data structures, such as EEClass and ClassLoader and its lookup tables, get allocated on a LowFrequencyHeap. The StubHeap hosts stubs that facilitate code access security (CAS), COM wrapper calls, and P/Invoke.
One of the main places you see this high/low-frequency of access is in the heart of the Type system, where different data items are either classified as ‘hot’ (high-frequency) or ‘cold’ (low-frequency), from the ‘Key Data Structures’ section of the BOTR page on ‘Type Loader Design’:
EEClass
MethodTable data are split into “hot” and “cold” structures to improve working set and cache utilization. MethodTable itself is meant to only store “hot” data that are needed in program steady state. EEClass stores “cold” data that are typically only needed by type loading, JITing or reflection. Each MethodTable points to one EEClass.
Further to this, listed below are some specific examples of when each heap type is used:
All the general ‘Loader Heaps’ listed above are allocated in the LoaderAllocator::Init(..)
function (link to actual code), the executable
and stub
heap have the ‘executable’ flag set, all the rest don’t. The size of these heaps is configured in this code, they ‘reserve’ different amounts up front, but they all have a ‘commit’ size that is equivalent to one OS ‘page’.
In addition to the ‘general’ heaps, there are some others that are specifically used by the Virtual Stub Dispatch mechanism, they are known as the indcell_heap
, cache_entry_heap
, lookup_heap
, dispatch_heap
and resolve_heap
, they’re allocated in this code, using the specified commit/reserve sizes.
Finally, if you’re interested in the mechanics of how the heaps actually work take a look at LoaderHeap.cpp.
JIT Memory Usage
Last, but by no means least, there is one other component in the CLR that extensively allocates memory and that is the JIT. It does so in 2 main scenarios:
- ‘Transient’ or temporary memory needed when it’s doing the job of converting IL code into machine code
- ‘Permanent’ memory used when it needs to emit the ‘machine code’ for a method
‘Transient’ Memory
This is needed by the JIT when it is doing the job of converting IL code into machine code for the current CPU architecture. This memory is only needed whilst the JIT is running and can be re-used/discarded later, it is used to hold the internal JIT data structures (e.g. Compiler
, BasicBlock
, GenTreeStmt
, etc).
For example, take a look at the following code from Compiler::fgValueNumber():
...
// Allocate the value number store.
assert(fgVNPassesCompleted > 0 || vnStore == nullptr);
if (fgVNPassesCompleted == 0)
{
CompAllocator* allocator = new (this, CMK_ValueNumber) CompAllocator(this, CMK_ValueNumber);
vnStore = new (this, CMK_ValueNumber) ValueNumStore(this, allocator);
}
...
The line vnStore = new (this, CMK_ValueNumber) ...
ends up calling the specialised new
operator defined in compiler.hpp (code shown below), which as per the comment, uses a customer ‘Arena Allocator’ that is implemented in /src/jit/alloc.cpp
/*****************************************************************************
* operator new
*
* Note that compGetMem is an arena allocator that returns memory that is
* not zero-initialized and can contain data from a prior allocation lifetime.
* it also requires that 'sz' be aligned to a multiple of sizeof(int)
*/
inline void* __cdecl operator new(size_t sz, Compiler* context, CompMemKind cmk)
{
sz = AlignUp(sz, sizeof(int));
assert(sz != 0 && (sz & (sizeof(int) - 1)) == 0);
return context->compGetMem(sz, cmk);
}
This technique (of overriding the new
operator) is used in lots of places throughout the CLR, for instance there is a generic one implemented in the CLR Host.
‘Permanent’ Memory
The last type of memory that the JIT uses is ‘permanent’ memory to store the JITted machine code, this is done via calls to Compiler::compGetMem(..), starting from Compiler::compCompile(..) via the call-stack shown below. Note that as before this uses a customer ‘Arena Allocator’ that is implemented in /src/jit/alloc.cpp
+ clrjit!ClrAllocInProcessHeap
+ clrjit!ArenaAllocator::allocateHostMemory
+ clrjit!ArenaAllocator::allocateNewPage
+ clrjit!ArenaAllocator::allocateMemory
+ clrjit!Compiler::compGetMem
+ clrjit!emitter::emitGetMem
+ clrjit!emitter::emitAllocInstr
+ clrjit!emitter::emitNewInstrTiny
+ clrjit!emitter::emitIns_R_R
+ clrjit!emitter::emitInsBinary
+ clrjit!CodeGen::genCodeForStoreLclVar
+ clrjit!CodeGen::genCodeForTreeNode
+ clrjit!CodeGen::genCodeForBBlist
+ clrjit!CodeGen::genGenerateCode
+ clrjit!Compiler::compCompile
Real-world example
Finally, to prove that this investigation matches with more real-world scenarios, we can see similar memory usage breakdowns in this GitHub issue: [Question] Reduce memory consumption of CoreCLR
Yes, we have profiled several Xamarin GUI applications on Tizen Mobile.
Typical profile of CoreCLR’s memory on the GUI applications is the following:
- Mapped assembly images - 4.2 megabytes (50%)
- JIT-compiler’s memory - 1.7 megabytes (20%)
- Execution engine - about 1 megabyte (11%)
- Code heap - about 1 megabyte (11%)
- Type information - about 0.5 megabyte (6%)
- Objects heap - about 0.2 megabyte (2%)
Discuss this post on HackerNews
Further Reading
See the links below for additional information on ‘Loader Heaps’
The post Memory Usage Inside the CLR first appeared on my blog Performance is a Feature!
CodeProject
Mon, 10 Jul 2017, 12:00 am
How the .NET Runtime loads a Type
It is something we take for granted every time we run a .NET program, but it turns out that loading a Type or class
is a fairly complex process.
So how does the .NET Runtime (CLR) actually load a Type?
If you want the tl;dr it’s done carefully, cautiously and step-by-step
Ensuring Type Safety
One of the key requirements of a ‘Managed Runtime’ is providing Type Safety, but what does it actually mean? From the MSDN page on Type Safety and Security
Type-safe code accesses only the memory locations it is authorized to access. (For this discussion, type safety specifically refers to memory type safety and should not be confused with type safety in a broader respect.) For example, type-safe code cannot read values from another object’s private fields. It accesses types only in well-defined, allowable ways.
So in effect, the CLR has to ensure your Types/Classes are well-behaved and following the rules.
Compiler prevents you from creating an ‘abstract’ class
But lets look at a more concrete example, using the C# code below
public abstract class AbstractClass
{
public AbstractClass() { }
}
public class NormalClass : AbstractClass
{
public NormalClass() { }
}
public static void Main(string[] args)
{
var test = new AbstractClass();
}
The compiler quite rightly refuses to compile this and gives the following error, because abstract
classes can’t be created, you can only inherit from them.
error CS0144: Cannot create an instance of the abstract class or interface
'ConsoleApplication.AbstractClass'
So that’s all well and good, but the CLR can’t rely on all code being created via a well-behaved compiler, or in fact via a compiler at all. So it has to check for and prevent any attempt to create an abstract
class.
Writing IL code by hand
One way to circumvent the compiler is to write IL code by hand using the IL Assembler tool (ILAsm) which will do almost no checks on the validity of the IL you give it.
For instance the IL below is the equivalent of writing var test = new AbstractClass();
(if the C# compiler would let us):
.method public hidebysig static void Main(string[] args) cil managed
{
.entrypoint
.maxstack 1
.locals init (
[0] class ConsoleApplication.NormalClass class2)
// System.InvalidOperationException: Instances of abstract classes cannot be created.
newobj instance void ConsoleApplication.AbstractClass::.ctor()
stloc.0
ldloc.0
callvirt instance class [mscorlib]System.Type [mscorlib]System.Object::GetType()
callvirt instance string [mscorlib]System.Reflection.MemberInfo::get_Name()
call void [mscorlib]Internal.Console::WriteLine(string)
ret
}
Fortunately the CLR has got this covered and will throw an InvalidOperationException
when you execute the code. This is due to this check which is hit when the JIT compiles the newobj
IL instruction.
Creating Types at run-time
One other way that you can attempt to create an abstract
class is at run-time, using reflection (thanks to this blog post for giving me some tips on other ways of creating Types).
This is shown in the code below:
var abstractType = Type.GetType("ConsoleApplication.AbstractClass");
Console.WriteLine(abstractType.FullName);
// System.MissingMethodException: Cannot create an abstract class.
var abstractInstance = Activator.CreateInstance(abstractType);
The compiler is completely happy with this, it doesn’t do anything to prevent or warn you and nor should it. However when you run the code, it will throw an exception, strangely enough a MissingMethodException
this time, but it does the job!
The call stack is below:
One final way (unless I’ve missed some out?) is to use GetUninitializedObject(..)
in the FormatterServices class like so:
public static object CreateInstance(Type type)
{
var constructor = type.GetConstructor(new Type[0]);
if (constructor == null && !type.IsValueType)
{
throw new NotSupportedException(
"Type '" + type.FullName + "' doesn't have a parameterless constructor");
}
var emptyInstance = FormatterServices.GetUninitializedObject(type);
if (constructor == null)
return null;
return constructor.Invoke(emptyInstance, new object[0]) ?? emptyInstance;
}
var abstractType = Type.GetType("ConsoleApplication.AbstractClass");
Console.WriteLine(abstractType.FullName);
// System.MemberAccessException: Cannot create an abstract class.
var abstractInstance = CreateInstance(abstractType);
Again the run-time stops you from doing this, however this time it decides to throw a MemberAccessException
?
This happens via the following call stack:
Further Type-Safety Checks
These checks are just one example of what the runtime has to validate when creating types, there are many more things is has to deal with. For instance you can’t:
Loading Types ‘step-by-step’
So we’ve seen that the CLR has to do multiple checks when it’s loading types, but why does it have to load them ‘step-by-step’?
Well in a nutshell, it’s because of circular references and recursion, particularly when dealing with generics types. If we take the code below from section ‘2.1 Load Levels’ in Type Loader Design (BotR):
classA : C
{ }
classB : C
{ }
classC
{ }
These are valid types and class A
depends on class B
and vice versa. So we can’t load A
until we know that B
is valid, but we can’t load B
, until we’re sure that A
is valid, a classic deadlock!!
How does the run-time get round this, well from the same BotR page:
The loader initially creates the structure(s) representing the type and initializes them with data that can be obtained without loading other types. When this “no-dependencies” work is done, the structure(s) can be referred from other places, usually by sticking pointers to them into another structures. After that the loader progresses in incremental steps and fills the structure(s) with more and more information until it finally arrives at a fully loaded type. In the above example, the base types of A and B will be approximated by something that does not include the other type, and substituted by the real thing later.
(there is also some more info here)
So it loads types in stages, step-by-step, ensuring each dependant type has reached the same stage before continuing. These ‘Class Load’ stages are shown in the image below and explained in detail in this very helpful source-code comment (Yay for Open-Sourcing the CoreCLR!!)
The different levels are handled in the ClassLoader::DoIncrementalLoad(..) method, which contains the switch
statement that deals with them all in turn.
However this is part of a bigger process, which controls loading an entire file, also known as a Module
or Assembly
in .NET terminology. The entire process for that is handled in by another dispatch loop (switch statement), that works with the FileLoadLevel
enum (definition). So in reality the whole process for loading an Assembly
looks like this (the loading of one or more Types happens as sub-steps once the Module
had reached the FILE_LOADED
stage)
- FILE_LOAD_CREATE - DomainFile ctor()
- FILE_LOAD_BEGIN - Begin()
- FILE_LOAD_FIND_NATIVE_IMAGE - FindNativeImage()
- FILE_LOAD_VERIFY_NATIVE_IMAGE_DEPENDENCIES - VerifyNativeImageDependencies()
- FILE_LOAD_ALLOCATE - Allocate()
- FILE_LOAD_ADD_DEPENDENCIES - AddDependencies()
- FILE_LOAD_PRE_LOADLIBRARY - PreLoadLibrary()
- FILE_LOAD_LOADLIBRARY - LoadLibrary()
- FILE_LOAD_POST_LOADLIBRARY - PostLoadLibrary()
- FILE_LOAD_EAGER_FIXUPS - EagerFixups()
- FILE_LOAD_VTABLE_FIXUPS - VtableFixups()
- FILE_LOAD_DELIVER_EVENTS - DeliverSyncEvents()
- FILE_LOADED - FinishLoad()
- CLASS_LOAD_BEGIN
- CLASS_LOAD_UNRESTOREDTYPEKEY
- CLASS_LOAD_UNRESTORED
- CLASS_LOAD_APPROXPARENTS
- CLASS_LOAD_EXACTPARENTS
- CLASS_DEPENDENCIES_LOADED
- CLASS_LOADED
- FILE_LOAD_VERIFY_EXECUTION - VerifyExecution()
- FILE_ACTIVE - Activate()
We can see this in action if we build a Debug version of the CoreCLR and enable the relevant configuration knobs. For a simple ‘Hello World’ program we get the log output shown below, where LOADER:
messages correspond to FILE_LOAD_XXX
stages and PHASEDLOAD:
messages indicate which CLASS_LOAD_XXX
step we are on.
You can also see some of the other events that happen at the same time, these include creation of static
variables (STATICS:
), thread-statics (THREAD STATICS:
) and PreStubWorker
which indicates methods being prepared for the JITter.
-------------------------------------------------------------------------------------------------------
This is NOT the full output, it's only the parts that reference 'Program.exe' and it's modules/classses
-------------------------------------------------------------------------------------------------------
PEImage: Opened HMODULE C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe
StoreFile: Add cached entry (000007FE65174540) with PEFile 000000000040D6E0
Assembly C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe: bits=0x2
LOADER: 439e30:***Program* >>>Load initiated, LOADED/LOADED
LOADER: 0000000000439E30:***Program* loading at level BEGIN
LOADER: 0000000000439E30:***Program* loading at level FIND_NATIVE_IMAGE
LOADER: 0000000000439E30:***Program* loading at level VERIFY_NATIVE_IMAGE_DEPENDENCIES
LOADER: 0000000000439E30:***Program* loading at level ALLOCATE
STATICS: Allocating statics for module Program
Loaded pModule: "C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe".
Module Program: bits=0x2
STATICS: Allocating 72 bytes for precomputed statics in module C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe in LoaderAllocator 000000000043AA18
StoreFile (StoreAssembly): Add cached entry (000007FE65174F28) with PEFile 000000000040D6E0Completed Load Level ALLOCATE for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 0000000000439E30:***Program* loading at level ADD_DEPENDENCIES
Completed Load Level ADD_DEPENDENCIES for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 0000000000439E30:***Program* loading at level PRE_LOADLIBRARY
LOADER: 0000000000439E30:***Program* loading at level LOADLIBRARY
LOADER: 0000000000439E30:***Program* loading at level POST_LOADLIBRARY
LOADER: 0000000000439E30:***Program* loading at level EAGER_FIXUPS
LOADER: 0000000000439E30:***Program* loading at level VTABLE FIXUPS
LOADER: 0000000000439E30:***Program* loading at level DELIVER_EVENTS
DRCT::IsReady - wait(0x100)=258, GetLastError() = 42424
DRCT::IsReady - wait(0x100)=258, GetLastError() = 42424
D::LA: Load Assembly Asy:0x000000000040D8C0 AD:0x0000000000439E30 which:C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe
Completed Load Level DELIVER_EVENTS for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 0000000000439E30:***Program* loading at level LOADED
Completed Load Level LOADED for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 439e30:***Program* Load initiated, ACTIVE/ACTIVE
LOADER: 0000000000439E30:***Program* loading at level VERIFY_EXECUTION
LOADER: 0000000000439E30:***Program* loading at level ACTIVE
Completed Load Level ACTIVE for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 439e30:***Program*
Thu, 15 Jun 2017, 12:00 am
Lowering in the C# Compiler (and what happens when you misuse it)
Turns out that what I’d always thought of as “Compiler magic” or “Syntactic sugar” is actually known by the technical term ‘Lowering’ and the C# compiler (a.k.a Roslyn) uses it extensively.
But what is it? Well this quote from So You Want To Write Your Own Language? gives us some idea:
Lowering
One semantic technique that is obvious in hindsight (but took Andrei Alexandrescu to point out to me) is called “lowering.” It consists of, internally, rewriting more complex semantic constructs in terms of simpler ones. For example, while loops and foreach loops can be rewritten in terms of for loops. Then, the rest of the code only has to deal with for loops. This turned out to uncover a couple of latent bugs in how while loops were implemented in D, and so was a nice win. It’s also used to rewrite scope guard statements in terms of try-finally statements, etc. Every case where this can be found in the semantic processing will be win for the implementation.
– by Walter Bright (author of the D programming language)
But if you’re still not sure what it means, have a read of Eric Lippert’s post on the subject, Lowering in language design, which contains this quote:
A common technique along the way though is to have the compiler “lower” from high-level language features to low-level language features in the same language.
As an aside, if you like reading about the Roslyn compiler source you may like these other posts that I’ve written:
What does ‘Lowering’ look like?
The C# compiler has used lowering for a while, one of the oldest or most recognised examples is when this code:
using System.Collections.Generic;
public class C {
public IEnumerable M()
{
foreach (var value in new [] { 1, 2, 3, 4, 5 })
{
yield return value;
}
}
}
is turned into this
public class C
{
[CompilerGenerated]
private sealed class d__0 : IEnumerable, IEnumerable, IEnumerator, IDisposable, IEnumerator
{
private int 1__state;
private int 2__current;
private int l__initialThreadId;
public C 4__this;
private int[] s__1;
private int s__2;
private int 5__3;
int IEnumerator.Current
{
[DebuggerHidden]
get
{
return this.2__current;
}
}
object IEnumerator.Current
{
[DebuggerHidden]
get
{
return this.2__current;
}
}
[DebuggerHidden]
public d__0(int 1__state)
{
this.1__state = 1__state;
this.l__initialThreadId = Environment.CurrentManagedThreadId;
}
[DebuggerHidden]
void IDisposable.Dispose()
{
}
bool IEnumerator.MoveNext()
{
int num = this.1__state;
if (num != 0)
{
if (num != 1)
{
return false;
}
this.1__state = -1;
this.s__2++;
}
else
{
this.1__state = -1;
this.s__1 = new int[] { 1, 2, 3, 4, 5 };
this.s__2 = 0;
}
if (this.s__2 >= this.s__1.Length)
{
this.s__1 = null;
return false;
}
this.5__3 = this.s__1[this.s__2];
this.2__current = this.5__3;
this.1__state = 1;
return true;
}
[DebuggerHidden]
void IEnumerator.Reset()
{
throw new NotSupportedException();
}
[DebuggerHidden]
IEnumerator IEnumerable.GetEnumerator()
{
C.d__0 d__;
if (this.1__state == -2 && this.l__initialThreadId == Environment.CurrentManagedThreadId)
{
this.1__state = 0;
d__ = this;
}
else
{
d__ = new C.d__0(0);
d__.4__this = this.4__this;
}
return d__;
}
[DebuggerHidden]
IEnumerator IEnumerable.GetEnumerator()
{
return this.System.Collections.Generic.IEnumerable.GetEnumerator();
}
}
[IteratorStateMachine(typeof(C.d__0))]
public IEnumerable M()
{
C.d__0 expr_07 = new C.d__0(-2);
expr_07.4__this = this;
return expr_07;
}
}
Yikes, I’m glad we don’t have to write that code ourselves!! There’s an entire state-machine in there, built to allow our original code to be halted/resumed each time round the loop (at the ‘yield’ statement).
The C# compiler and ‘Lowering’
But it turns out that the Roslyn compiler does a lot more ‘lowering’ than you might think. If you take a look at the code under ‘/src/Compilers/CSharp/Portable/Lowering’ (VB.NET equivalent here), you see the following folders:
Which correspond to some C# language features you might be familar with, such as ‘lambdas’, i.e. x => x.Name > 5
, ‘iterators’ used by yield
(above) and the async
keyword.
However if we look at bit deeper, under the ‘LocalRewriter’ folder we can see lots more scenarios that we might never have considered ‘lowering’, such as:
So a big thank-you is due to all the past and present C# language developers and designers, they did all this work for us. Imagine that C# didn’t have all these high-level features, we’d be stuck writing them by hand.
It would be like writing Java :-)
What happens when you misuse it
But of course the real fun part is ‘misusing’ or outright ‘abusing’ the compiler. So I set up a little twitter competition just how much ‘lowering’ could we get the compiler to do for us (i.e the highest ratio of ‘input’ lines of code to ‘output’ lines).
It had the following rules (see this gist for more info):
- You can have as many lines as you want within method
M()
- No single line can be longer than 100 chars
- To get your score, divide the ‘# of expanded lines’ by the ‘# of original line(s)’
- Based on the default output formatting of https://sharplab.io, no re-formatting allowed!!
- But you can format the intput however you want, i.e. make use of the full 100 chars
- Must compile with no warnings on https://sharplab.io (allows C# 7 features)
- But doesn’t have to do anything sensible when run
- You cannot modify the code that is already there, i.e.
public class C {}
and public void M()
- Cannot just add
async
to public void M()
, that’s too easy!!
- You can add new
using ...
declarations, these do not count towards the line count
For instance with the following code (interactive version available on sharplab.io):
using System;
public class C {
public void M() {
Func test = () => "blah"?.ToString();
}
}
This counts as 1 line of original code (only code inside method M()
is counted)
This expands to 23 lines (again only lines of code inside the braces ({
, }
) of class C
are counted.
Giving a total score of 23 (23 / 1)
....
public class C
{
[CompilerGenerated]
[Serializable]
private sealed class c
{
public static readonly C.c 9;
public static Func 9__0_0;
static c()
{
// Note: this type is marked as 'beforefieldinit'.
C.c.9 = new C.c();
}
internal string b__0_0()
{
return "blah";
}
}
public void M()
{
if (C.c.9__0_0 == null)
{
C.c.9__0_0 = new Func(C.c.9.b__0_0);
}
}
}
Results
The first place entry was the following entry from Schabse Laks, which contains 9 lines-of-code inside the M()
method:
using System.Linq;
using Y = System.Collections.Generic.IEnumerable;
public class C {
public void M() {
((Y)null).Select(async x => await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await x.x()());
}
}
this expands to an impressive 7964 lines of code (yep you read that right!!) for a score of 885 (7964 / 9). The main trick he figured out was that adding more lines to the input increased the score, i.e is scales superlinearly. Although it you take things too far the compiler bails out with a pretty impressive error message:
error CS8078: An expression is too long or complex to compile
Here’s the Top 6 top results:
Submitter
Entry
Score
Schabse Laks
link
885 (7964 / 9)
Andrey Dyatlov
link
778 (778 / 1)
alrz
link
755 (755 / 1)
Andy Gocke *
link
633 (633 / 1)
Jared Parsons *
link
461 (461 / 1)
Jonathan Chambers
link
384 (384 / 1)
*
= member of the Roslyn compiler team (they’re not disqualified, but maybe they should have some kind of handicap applied to ‘even out’ the playing field?)
Honourable mentions
However there were some other entries that whilst they didn’t make it into the Top 6, are still worth a mention due to the ingenuity involved:
Discuss this post on HackerNews, /r/programming or /r/csharp (whichever takes your fancy!!)
The post Lowering in the C# Compiler (and what happens when you misuse it) first appeared on my blog Performance is a Feature!
CodeProject
Thu, 25 May 2017, 12:00 am
Adding a new Bytecode Instruction to the CLR
Now that the CoreCLR is open-source we can do fun things, for instance find out if it’s possible to add new IL (Intermediate Language) instruction to the runtime.
TL;DR it turns out that it’s easier than you might think!! Here are the steps you need to go through:
Update: turns out that I wasn’t the only person to have this idea, see Beachhead implements new opcode on CLR JIT for another implementation by Kouji Matsui.
Step 0
But first a bit of background information. Adding a new IL instruction to the CLR is a pretty rare event, that last time is was done for real was in .NET 2.0 when support for generics was added. This is part of the reason why .NET code had good backwards-compatibility, from Backward compatibility and the .NET Framework 4.5:
The .NET Framework 4.5 and its point releases (4.5.1, 4.5.2, 4.6, 4.6.1, 4.6.2, and 4.7) are backward-compatible with apps that were built with earlier versions of the .NET Framework. In other words, apps and components built with previous versions will work without modification on the .NET Framework 4.5.
Side note: The .NET framework did break backwards compatibility when moving from 1.0 to 2.0, precisely so that support for generics could be added deep into the runtime, i.e. with support in the IL. Java took a different decision, I guess because it had been around longer, breaking backwards-comparability was a bigger issue. See the excellent blog post Comparing Java and C# Generics for more info.
Step 1
For this exercise I plan to add a new IL instruction (op-code) to the CoreCLR runtime and because I’m a raving narcissist (not really, see below) I’m going to name it after myself. So let me introduce the matt
IL instruction, that you can use like so:
.method private hidebysig static int32 TestMattOpCodeMethod(int32 x, int32 y)
cil managed noinlining
{
.maxstack 2
ldarg.0
ldarg.1
matt // yay, my name as an IL op-code!!!!
ret
}
But because I’m actually a bit-British (i.e. I don’t like to ‘blow my own trumpet’), I’m going to make the matt
op-code almost completely pointless, it’s going to do exactly the same thing as calling Math.Max(x, y)
, i.e. just return the largest of the 2 numbers.
The other reason for naming it matt
is that I’d really like someone to make a version of the C# (Roslyn) compiler that allows you to write code like this:
Console.WriteLine("{0} m@ {1} = {2}", 1, 7, 1 m@ 7)); // prints '1 m@ 7 = 7'
I definitely want the m@
operator to be a thing (pronounced ‘matt’, not ‘m-at’), maybe the other ‘Matt Warren’ who works at Microsoft on the C# Language Design Team can help out!! Seriously though, if anyone reading this would like to write a similar blog post, showing how you’d add the m@
operator to the Roslyn compiler, please let me know I’d love to read it.
Update: Thanks to Marcin Juraszek (@mmjuraszek) you can now use the m@
in a C# program, see Adding Matt operator to Roslyn - Syntax, Lexer and Parser, Adding Matt operator to Roslyn - Binder and Adding Matt operator to Roslyn - Emitter for the full details.
Now we’ve defined the op-code, the first step is to ensure that the run-time and tooling can recognise it. In particular we need the IL Assembler (a.k.a ilasm
) to be able to take the IL code above (TestMattOpCodeMethod(..)
) and produce a .NET executable.
As the .NET runtime source code is nicely structured (+1 to the runtime devs), to make this possible we only need to makes changes in opcode.def:
--- a/src/inc/opcode.def
+++ b/src/inc/opcode.def
@@ -154,7 +154,7 @@ OPDEF(CEE_NEWOBJ, "newobj", VarPop, Pu
OPDEF(CEE_CASTCLASS, "castclass", PopRef, PushRef, InlineType, IObjModel, 1, 0xFF, 0x74, NEXT)
OPDEF(CEE_ISINST, "isinst", PopRef, PushI, InlineType, IObjModel, 1, 0xFF, 0x75, NEXT)
OPDEF(CEE_CONV_R_UN, "conv.r.un", Pop1, PushR8, InlineNone, IPrimitive, 1, 0xFF, 0x76, NEXT)
-OPDEF(CEE_UNUSED58, "unused", Pop0, Push0, InlineNone, IPrimitive, 1, 0xFF, 0x77, NEXT)
+OPDEF(CEE_MATT, "matt", Pop1+Pop1, Push1, InlineNone, IPrimitive, 1, 0xFF, 0x77, NEXT)
OPDEF(CEE_UNUSED1, "unused", Pop0, Push0, InlineNone, IPrimitive, 1, 0xFF, 0x78, NEXT)
OPDEF(CEE_UNBOX, "unbox", PopRef, PushI, InlineType, IPrimitive, 1, 0xFF, 0x79, NEXT)
OPDEF(CEE_THROW, "throw", PopRef, Push0, InlineNone, IObjModel, 1, 0xFF, 0x7A, THROW)
I just picked the first available unused
slot and added matt
in there. It’s defined as Pop1+Pop1
because it takes 2 values from the stack as input and Push1
because after is has executed, a single result is pushed back onto the stack.
Note: all the changes I made are available in one-place on GitHub if you’d rather look at them like that.
Once this change was done ilasm
will successfully assembly the test code file HelloWorld.il
that contains TestMattOpCodeMethod(..)
as shown above:
λ ilasm /EXE /OUTPUT=HelloWorld.exe -NOLOGO HelloWorld.il
Assembling 'HelloWorld.il' to EXE --> 'HelloWorld.exe'
Source file is ANSI
Assembled method HelloWorld::Main
Assembled method HelloWorld::TestMattOpCodeMethod
Creating PE file
Emitting classes:
Class 1: HelloWorld
Emitting fields and methods:
Global
Class 1 Methods: 2;
Resolving local member refs: 1 -> 1 defs, 0 refs, 0 unresolved
Emitting events and properties:
Global
Class 1
Resolving local member refs: 0 -> 0 defs, 0 refs, 0 unresolved
Writing PE file
Operation completed successfully
Step 2
However at this point the matt
op-code isn’t actually executed, at runtime the CoreCLR just throws an exception because it doesn’t know what to do with it. As a first (simpler) step, I just wanted to make the .NET Interpreter work, so I made the following changes to wire it up:
--- a/src/vm/interpreter.cpp
+++ b/src/vm/interpreter.cpp
@@ -2726,6 +2726,9 @@ void Interpreter::ExecuteMethod(ARG_SLOT* retVal, __out bool* pDoJmpCall, __out
case CEE_REM_UN:
BinaryIntOp();
break;
+ case CEE_MATT:
+ BinaryArithOp();
+ break;
case CEE_AND:
BinaryIntOp();
break;
--- a/src/vm/interpreter.hpp
+++ b/src/vm/interpreter.hpp
@@ -298,10 +298,14 @@ void Interpreter::BinaryArithOpWork(T val1, T val2)
{
res = val1 / val2;
}
- else
+ else if (op == BA_Rem)
{
res = RemFunc(val1, val2);
}
+ else if (op == BA_Matt)
+ {
+ res = MattFunc(val1, val2);
+ }
}
and then I added the methods that would actually implement the interpreted code:
--- a/src/vm/interpreter.cpp
+++ b/src/vm/interpreter.cpp
@@ -10801,6 +10804,26 @@ double Interpreter::RemFunc(double v1, double v2)
return fmod(v1, v2);
}
+INT32 Interpreter::MattFunc(INT32 v1, INT32 v2)
+{
+ return v1 > v2 ? v1 : v2;
+}
+
+INT64 Interpreter::MattFunc(INT64 v1, INT64 v2)
+{
+ return v1 > v2 ? v1 : v2;
+}
+
+float Interpreter::MattFunc(float v1, float v2)
+{
+ return v1 > v2 ? v1 : v2;
+}
+
+double Interpreter::MattFunc(double v1, double v2)
+{
+ return v1 > v2 ? v1 : v2;
+}
So fairly straight-forward and the bonus is that at this point the matt
operator is fully operational, you can actually write IL using it and it will run (interpreted only).
Step 3
However not everyone wants to re-compile the CoreCLR just to enable the Interpreter, so I want to also make it work for real via the Just-in-Time (JIT) compiler.
The full changes to make this work were spread across multiple files, but were mostly housekeeping so I won’t include them all here, check-out the full diff if you’re interested. But the significant parts are below:
--- a/src/jit/importer.cpp
+++ b/src/jit/importer.cpp
@@ -11112,6 +11112,10 @@ void Compiler::impImportBlockCode(BasicBlock* block)
oper = GT_UMOD;
goto MATH_MAYBE_CALL_NO_OVF;
+ case CEE_MATT:
+ oper = GT_MATT;
+ goto MATH_MAYBE_CALL_NO_OVF;
+
MATH_MAYBE_CALL_NO_OVF:
ovfl = false;
MATH_MAYBE_CALL_OVF:
--- a/src/vm/jithelpers.cpp
+++ b/src/vm/jithelpers.cpp
@@ -341,6 +341,14 @@ HCIMPL2(UINT32, JIT_UMod, UINT32 dividend, UINT32 divisor)
HCIMPLEND
/*********************************************************************/
+HCIMPL2(INT32, JIT_Matt, INT32 x, INT32 y)
+{
+ FCALL_CONTRACT;
+ return x > y ? x : y;
+}
+HCIMPLEND
+
+/*********************************************************************/
HCIMPL2_VV(INT64, JIT_LDiv, INT64 dividend, INT64 divisor)
{
FCALL_CONTRACT;
In summary, these changes mean that during the JIT’s ‘Morph phase’ the IL containing the matt
op code is converted from:
fgMorphTree BB01, stmt 1 (before)
[000004] ------------ ▌ return int
[000002] ------------ │ ┌──▌ lclVar int V01 arg1
[000003] ------------ └──▌ m@ int
[000001] ------------ └──▌ lclVar int V00 arg0
into this:
fgMorphTree BB01, stmt 1 (after)
[000004] --C--+------ ▌ return int
[000003] --C--+------ └──▌ call help int HELPER.CORINFO_HELP_MATT
[000001] -----+------ arg0 in rcx ├──▌ lclVar int V00 arg0
[000002] -----+------ arg1 in rdx └──▌ lclVar int V01 arg1
Note the call to HELPER.CORINFO_HELP_MATT
When this is finally compiled into assembly code it ends up looking like so:
// Assembly listing for method HelloWorld:TestMattOpCodeMethod(int,int):int
// Emitting BLENDED_CODE for X64 CPU with AVX
// optimized code
// rsp based frame
// partially interruptible
// Final local variable assignments
//
// V00 arg0 [V00,T00] ( 3, 3 ) int -> rcx
// V01 arg1 [V01,T01] ( 3, 3 ) int -> rdx
// V02 OutArgs [V02 ] ( 1, 1 ) lclBlk (32) [rsp+0x00]
//
// Lcl frame size = 40
G_M9261_IG01:
4883EC28 sub rsp, 40
G_M9261_IG02:
E8976FEB5E call CORINFO_HELP_MATT
90 nop
G_M9261_IG03:
4883C428 add rsp, 40
C3 ret
I’m not entirely sure why there is a nop
instruction in there? But it works, which is the main thing!!
Step 4
In the CLR you can also dynamically emit code at runtime using the methods that sit under the ‘System.Reflection.Emit’ namespace, so the last task is to add the OpCodes.Matt
field and have it emit the correct values for the matt
op-code.
--- a/src/mscorlib/src/System/Reflection/Emit/OpCodes.cs
+++ b/src/mscorlib/src/System/Reflection/Emit/OpCodes.cs
@@ -139,6 +139,7 @@ internal enum OpCodeValues
Castclass = 0x74,
Isinst = 0x75,
Conv_R_Un = 0x76,
+ Matt = 0x77,
Unbox = 0x79,
Throw = 0x7a,
Ldfld = 0x7b,
@@ -1450,6 +1451,16 @@ private OpCodes()
(0
Fri, 19 May 2017, 12:00 am
Arrays and the CLR - a Very Special Relationship
A while ago I wrote about the ‘special relationship’ that exists between Strings and the CLR, well it turns out that Arrays and the CLR have an even deeper one, the type of closeness where you hold hands on your first meeting
As an aside, if you like reading about CLR internals you may find these other posts interesting:
Fundamental to the Common Language Runtime (CLR)
Arrays are such a fundamental part of the CLR that they are included in the ECMA specification, to make it clear that the runtime has to implement them:
In addition, there are several IL (Intermediate Language) instructions that specifically deal with arrays:
newarr
- Create a new array with elements of type etype.
ldelem.ref
- Load the element at index onto the top of the stack as an O. The type of the O is the same as the element type of the array pushed on the CIL stack.
stelem
- Replace array element at index with the value on the stack (also
stelem.i
, stelem.i1
, stelem.i2
, stelem.r4
etc)
ldlen
- Push the length (of type native unsigned int) of array on the stack.
This makes sense because arrays are the building blocks of so many other data types, you want them to be available, well defined and efficient in a modern high-level language like C#. Without arrays you can’t have lists, dictionaries, queues, stacks, trees, etc, they’re all built on-top of arrays which provided low-level access to contiguous pieces of memory in a type-safe way.
Memory and Type Safety
This memory and type-safety is important because without it .NET couldn’t be described as a ‘managed runtime’ and you’d be left having to deal with the types of issues you get when you are writing code in a more low-level language.
More specifically, the CLR provides the following protections when you are using arrays (from the section on Memory and Type Safety in the BOTR ‘Intro to the CLR’ page):
While a GC is necessary to ensure memory safety, it is not sufficient. The GC will not prevent the program from indexing off the end of an array or accessing a field off the end of an object (possible if you compute the field’s address using a base and offset computation). However, if we do prevent these cases, then we can indeed make it impossible for a programmer to create memory-unsafe programs.
While the common intermediate language (CIL) does have operators that can fetch and set arbitrary memory (and thus violate memory safety), it also has the following memory-safe operators and the CLR strongly encourages their use in most programming:
- Field-fetch operators (LDFLD, STFLD, LDFLDA) that fetch (read), set and take the address of a field by name.
- Array-fetch operators (LDELEM, STELEM, LDELEMA) that fetch, set and take the address of an array element by index. All arrays include a tag specifying their length. This facilitates an automatic bounds check before each access.
Also, from the section on Verifiable Code - Enforcing Memory and Type Safety in the same BOTR page
In practice, the number of run-time checks needed is actually very small. They include the following operations:
- Casting a pointer to a base type to be a pointer to a derived type (the opposite direction can be checked statically)
- Array bounds checks (just as we saw for memory safety)
- Assigning an element in an array of pointers to a new (pointer) value. This particular check is only required because CLR arrays have liberal casting rules (more on that later…)
However you don’t get this protection for free, there’s a cost to pay:
Note that the need to do these checks places requirements on the runtime. In particular:
- All memory in the GC heap must be tagged with its type (so the casting operator can be implemented). This type information must be available at runtime, and it must be rich enough to determine if casts are valid (e.g., the runtime needs to know the inheritance hierarchy). In fact, the first field in every object on the GC heap points to a runtime data structure that represents its type.
- All arrays must also have their size (for bounds checking).
- Arrays must have complete type information about their element type.
Implementation Details
It turns out that large parts of the internal implementation of arrays is best described as magic, this Stack Overflow comment from Marc Gravell sums it up nicely
Arrays are basically voodoo. Because they pre-date generics, yet must allow on-the-fly type-creation (even in .NET 1.0), they are implemented using tricks, hacks, and sleight of hand.
Yep that’s right, arrays were parametrised (i.e. generic) before generics even existed. That means you could create arrays such as int[]
and string[]
, long before you were able to write List
or List
, which only became possible in .NET 2.0.
Special helper classes
All this magic or sleight of hand is made possible by 2 things:
- The CLR breaking all the usual type-safety rules
- A special array helper class called
SZArrayHelper
But first the why, why were all these tricks needed? From .NET Arrays, IList, Generic Algorithms, and what about STL?:
When we were designing our generic collections classes, one of the things that bothered me was how to write a generic algorithm that would work on both arrays and collections. To drive generic programming, of course we must make arrays and generic collections as seamless as possible. It felt that there should be a simple solution to this problem that meant you shouldn’t have to write the same code twice, once taking an IList and again taking a T[]. The solution that dawned on me was that arrays needed to implement our generic IList. We made arrays in V1 implement the non-generic IList, which was rather simple due to the lack of strong typing with IList and our base class for all arrays (System.Array). What we needed was to do the same thing in a strongly typed way for IList.
But it was only done for the common case, i.e. ‘single dimensional’ arrays:
There were some restrictions here though – we didn’t want to support multidimensional arrays since IList only provides single dimensional accesses. Also, arrays with non-zero lower bounds are rather strange, and probably wouldn’t mesh well with IList, where most people may iterate from 0 to the return from the Count property on that IList. So, instead of making System.Array implement IList, we made T[] implement IList. Here, T[] means a single dimensional array with 0 as its lower bound (often called an SZArray internally, but I think Brad wanted to promote the term ‘vector’ publically at one point in time), and the element type is T. So Int32[] implements IList, and String[] implements IList.
Also, this comment from the array source code sheds some further light on the reasons:
//----------------------------------------------------------------------------------
// Calls to (IList)(array).Meth are actually implemented by SZArrayHelper.Meth
// This workaround exists for two reasons:
//
// - For working set reasons, we don't want insert these methods in the array
// hierachy in the normal way.
// - For platform and devtime reasons, we still want to use the C# compiler to
// generate the method bodies.
//
// (Though it's questionable whether any devtime was saved.)
//
// ....
//----------------------------------------------------------------------------------
So it was done for convenience and efficiently, as they didn’t want every instance of System.Array
to carry around all the code for the IEnumerable
and IList
implementations.
This mapping takes places via a call to GetActualImplementationForArrayGenericIListOrIReadOnlyListMethod(..), which wins the prize for the best method name in the CoreCLR source!! It’s responsible for wiring up the corresponding method from the SZArrayHelper class, i.e. IList.Count
-> SZArrayHelper.Count
or if the method is part of the IEnumerator
interface, the SZGenericArrayEnumerator is used.
But this has the potential to cause security holes, as it breaks the normal C# type system guarantees, specifically regarding the this
pointer. To illustrate the problem, here’s the source code of the Count
property, note the call to JitHelpers.UnsafeCast
:
internal int get_Count()
{
//! Warning: "this" is an array, not an SZArrayHelper. See comments above
//! or you may introduce a security hole!
T[] _this = JitHelpers.UnsafeCast(this);
return _this.Length;
}
Yikes, it has to remap this
to be able to call Length
on the correct object!!
And just in case those comments aren’t enough, there is a very strongly worded comment at the top of the class that further spells out the risks!!
Generally all this magic is hidden from you, but occasionally it leaks out. For instance if you run the code below, SZArrayHelper
will show up in the StackTrace
and TargetSite
of properties of the NotSupportedException
:
try {
int[] someInts = { 1, 2, 3, 4 };
IList collection = someInts;
// Throws NotSupportedException 'Collection is read-only'
collection.Clear();
} catch (NotSupportedException nsEx) {
Console.WriteLine("{0} - {1}", nsEx.TargetSite.DeclaringType, nsEx.TargetSite);
Console.WriteLine(nsEx.StackTrace);
}
Removing Bounds Checks
The runtime also provides support for arrays in more conventional ways, the first of which is related to performance. Array bounds checks are all well and good when providing memory-safety, but they have a cost, so where possible the JIT removes any checks that it knows are redundant.
It does this by calculating the range of values that a for
loop access and compares those to the actual length of the array. If it determines that there is never an attempt to access an item outside the permissible bounds of the array, the run-time checks are then removed.
For more information, the links below take you to the areas of the JIT source code that deal with this:
And if you are really keen, take a look at this gist that I put together to explore the scenarios where bounds checks are ‘removed’ and ‘not removed’.
Allocating an array
Another task that the runtime helps with is allocating arrays, using hand-written assembly code so the methods are as optimised as possible, see:
Run-time treats arrays differently
Finally, because arrays are so intertwined with the CLR, there are lots of places in which they are dealt with as a special-case. For instance a search for ‘IsArray()’ in the CoreCLR source returns over 60 hits, including:
- The method table for an array is built differently
- When you call
ToString()
on an array, you get special formatting, i.e. ‘System.Int32[]’ or ‘MyClass[,]’
So yes, it’s fair to say that arrays and the CLR have a Very Special Relationship
Further Reading
As always, here are some more links for your enjoyment!!
Array source code references
The post Arrays and the CLR - a Very Special Relationship first appeared on my blog Performance is a Feature!
CodeProject
Mon, 8 May 2017, 12:00 am
The CLR Thread Pool 'Thread Injection' Algorithm
If you’re near London at the end of April, I’ll be speaking at ProgSCon 2017 on Microsoft and Open-Source – A ‘Brave New World’. ProgSCon is 1-day conference, with talks covering an eclectic range of topics, you’ll learn lots!!
As part of a never-ending quest to explore the CoreCLR source code I stumbled across the intriguing titled ‘HillClimbing.cpp’ source file. This post explains what it does and why.
What is ‘Hill Climbing’
It turns out that ‘Hill Climbing’ is a general technique, from the Wikipedia page on the Hill Climbing Algorithm:
In computer science, hill climbing is a mathematical optimization technique which belongs to the family of local search. It is an iterative algorithm that starts with an arbitrary solution to a problem, then attempts to find a better solution by incrementally changing a single element of the solution. If the change produces a better solution, an incremental change is made to the new solution, repeating until no further improvements can be found.
But in the context of the CoreCLR, ‘Hill Climbing’ (HC) is used to control the rate at which threads are added to the Thread Pool, from the MSDN page on ‘Parallel Tasks’:
Thread Injection
The .NET thread pool automatically manages the number of worker threads in the pool. It adds and removes threads according to built-in heuristics. The .NET thread pool has two main mechanisms for injecting threads: a starvation-avoidance mechanism that adds worker threads if it sees no progress being made on queued items and a hill-climbing heuristic that tries to maximize throughput while using as few threads as possible.
…
A goal of the hill-climbing heuristic is to improve the utilization of cores when threads are blocked by I/O or other wait conditions that stall the processor
….
The .NET thread pool has an opportunity to inject threads every time a work item completes or at 500 millisecond intervals, whichever is shorter. The thread pool uses this opportunity to try adding threads (or taking them away), guided by feedback from previous changes in the thread count. If adding threads seems to be helping throughput, the thread pool adds more; otherwise, it reduces the number of worker threads. This technique is called the hill-climbing heuristic.
For more specifics on what the algorithm is doing, you can read the research paper Optimizing Concurrency Levels in the .NET ThreadPool published by Microsoft, although it you want a brief outline of what it’s trying to achieve, this summary from the paper is helpful:
In addition the controller should have:
- short settling times so that cumulative throughput is maximized
- minimal oscillations since changing control settings incurs overheads that reduce throughput
- fast adaptation to changes in workloads and resource characteristics.
So reduce throughput, don’t add and then remove threads too fast, but still adapt quickly to changing work-loads, simple really!!
As an aside, after reading (and re-reading) the research paper I found it interesting that a considerable amount of it was dedicated to testing, as the following excerpt shows:
In fact the approach to testing was considered so important that they wrote an entire follow-up paper that discusses it, see Configuring Resource Managers Using Model Fuzzing.
Why is it needed?
Because, in short, just adding new threads doesn’t always increase throughput and ultimately having lots of threads has a cost. As this comment from Eric Eilebrecht, one of the authors of the research paper explains:
Throttling thread creation is not only about the cost of creating a thread; it’s mainly about the cost of having a large number of running threads on an ongoing basis. For example:
- More threads means more context-switching, which adds CPU overhead. With a large number of threads, this can have a significant impact.
- More threads means more active stacks, which impacts data locality. The more stacks a CPU is having to juggle in its various caches, the less effective those caches are.
The advantage of more threads than logical processors is, of course, that we can keep the CPU busy if some of the threads are blocked, and so get more work done. But we need to be careful not to “overreact” to blocking, and end up hurting performance by having too many threads.
Or in other words, from Concurrency - Throttling Concurrency in the CLR 4.0 ThreadPool
As opposed to what may be intuitive, concurrency control is about throttling and reducing the number of work items that can be run in parallel in order to improve the worker ThreadPool throughput (that is, controlling the degree of concurrency is preventing work from running).
So the algorithm was designed with all these criteria in mind and was then tested over a large range of scenarios, to ensure it actually worked! This is why it’s often said that you should just leave the .NET ThreadPool alone, not try and tinker with it. It’s been heavily tested to work across a multiple situations and it was designed to adapt over time, so it should have you covered! (although of course, there are times when it doesn’t work perfectly!!)
The Algorithm in Action
As the source in now available, we can actually play with the algorithm and try it out in a few scenarios to see what it does. It needs very few dependences and therefore all the relevant code is contained in the following files:
(For comparison, there’s an implementation of the same algorithm in the Mono source code)
I have a project up on my GitHub page that allows you to test the hill-climbing algorithm in a self-contained console app. If you’re interested you can see the changes/hacks I had to do to get it building, although in the end it was pretty simple! (Update Kudos to Christian Klutz who ported my self-contained app to C#, nice job!!)
The algorithm is controlled via the following HillClimbing_XXX
settings:
Setting
Default Value
Notes
HillClimbing_WavePeriod
4
HillClimbing_TargetSignalToNoiseRatio
300
HillClimbing_ErrorSmoothingFactor
1
HillClimbing_WaveMagnitudeMultiplier
100
HillClimbing_MaxWaveMagnitude
20
HillClimbing_WaveHistorySize
8
HillClimbing_Bias
15
The ‘cost’ of a thread. 0 means drive for increased throughput regardless of thread count; higher values bias more against higher thread counts
HillClimbing_MaxChangePerSecond
4
HillClimbing_MaxChangePerSample
20
HillClimbing_MaxSampleErrorPercent
15
HillClimbing_SampleIntervalLow
10
HillClimbing_SampleIntervalHigh
200
HillClimbing_GainExponent
200
The exponent to apply to the gain, times 100. 100 means to use linear gain, higher values will enhance large moves and damp small ones
Because I was using the code in a self-contained console app, I just hard-coded the default values into the source, but in the CLR it appears that you can modify these values at runtime.
Working with the Hill Climbing code
There are several things I discovered when implementing a simple test app that works with the algorithm:
- The calculation is triggered by calling the function
HillClimbingInstance.Update(currentThreadCount, sampleDuration, numCompletions, &threadAdjustmentInterval)
and the return value is the new ‘maximum thread count’ that the algorithm is proposing.
- It calculates the desired number of threads based on the ‘current throughput’, which is the ‘# of tasks completed’ (
numCompletions
) during the current time-period (sampleDuration
in seconds).
- It also takes the current thread count (
currentThreadCount
) into consideration.
- The core calculations (excluding error handling and house-keeping) are only just over 100 LOC, so it’s not too hard to follow.
- It works on the basis of ‘transitions’ (
HillClimbingStateTransition
), first Warmup
, then Stabilizing
and will only recommend a new value once it’s moved into the ClimbingMove
state.
- The real .NET Thread Pool only increases the thread-count by one thread every 500 milliseconds. It keeps doing this until the ‘# of threads’ has reached the amount that the hill-climbing algorithm suggests. See ThreadpoolMgr::ShouldAdjustMaxWorkersActive() and ThreadpoolMgr::AdjustMaxWorkersActive() for the code that handles this.
- If it hasn’t got enough samples to do a ‘statistically significant’ calculation this algorithm will indicate this via the
threadAdjustmentInterval
variable. This means that you should not call HillClimbingInstance.Update(..)
until another threadAdjustmentInterval
milliseconds have elapsed. (link to source code that calculates this)
- The current thread count is only decreased when threads complete their current task. At that point the current count is compared to the desired amount and if necessary a thread is ‘retired’
- The algorithm with only returns values that respect the limits specified by ThreadPool.SetMinThreads(..) and ThreadPool.SetMaxThreads(..) (link to the code that handles this)
- In addition, it will only recommend increasing the thread count if the CPU Utilization is below 95%
First lets look at the graphs that were published in the research paper from Microsoft (Optimizing Concurrency Levels in the .NET ThreadPool):
They clearly show the thread-pool adapting the number of threads (up and down) as the throughput changes, so it appears the algorithm is doing what it promises.
Now for a similar image using the self-contained test app I wrote. Now, my test app only pretends to add/remove threads based on the results for the Hill Climbing algorithm, so it’s only an approximation of the real behaviour, but it does provide a nice way to see it in action outside of the CLR.
In this simple scenario, the work-load that we are asking the thread-pool to do is just moving up and then down (click for full-size image):
Finally, we’ll look at what the algorithm does in a more noisy scenario, here the current ‘work load’ randomly jumps around, rather than smoothly changing:
So with a combination of a very detailed MSDN article, a easy-to-read research paper and most significantly having the source code available, we are able to get an understanding of what the .NET Thread Pool is doing ‘under-the-hood’!
References
- Concurrency - Throttling Concurrency in the CLR 4.0 ThreadPool (I recommend reading this article before reading the research papers)
- Optimizing Concurrency Levels in the .NET ThreadPool: A case study of controller design and implementation
- Configuring Resource Managers Using Model Fuzzing: A Case Study of the .NET Thread Pool
- MSDN page on ‘Parallel Tasks’ (see section on ‘Thread Injection’)
- Patent US20100083272 - Managing pools of dynamic resources
Further Reading
- Erika Parsons and Eric Eilebrecht - CLR 4 - Inside the Thread Pool - Channel 9
- New and Improved CLR 4 Thread Pool Engine (Work-stealing and Local Queues)
- .NET CLR Thread Pool Internals (compares the new Hill Climbing algorithm, to the previous algorithm used in the Legacy Thread Pool)
- CLR thread pool injection, stuttering problems
- Why the CLR 2.0 SP1’s threadpool default max thread count was increased to 250/CPU
- Use a more dependable policy for thread pool thread injection (CoreCLR GitHub Issue)
- Use a more dependable policy for thread pool thread injection (CoreFX GitHub Issue)
- ThreadPool Growth: Some Important Details
- .NET’s ThreadPool Class - Behind The Scenes (Based on SSCLI source, not CoreCLR)
- CLR Execution Context (in Russian, but Google Translate does a reasonable job)
- Thread Pool + Task Testing (by Ben Adams)
- The Injector: A new Executor for Java (an improved thread-injector for the Java Thread Pool)
Discuss this post on Hacker News and /r/programming
The post The CLR Thread Pool 'Thread Injection' Algorithm first appeared on my blog Performance is a Feature!
CodeProject
Thu, 13 Apr 2017, 12:00 am
The .NET IL Interpreter
Whilst writing a previous blog post I stumbled across the .NET Interpreter, tucked away in the source code. Although, it I’d made even the smallest amount of effort to look for it, I’d have easily found it via the GitHub ‘magic’ file search:
Usage Scenarios
Before we look at how to use it and what it does, it’s worth pointing out that the Interpreter is not really meant for production code. As far as I can tell, its main purpose is to allow you to get the CLR up and running on a new CPU architecture. Without the interpreter you wouldn’t be able to test any C# code until you had a fully functioning JIT that could emit machine code for you. For instance see ‘[ARM32/Linux] Initial bring up of FEATURE_INTERPRETER’ and ‘[aarch64] Enable the interpreter on linux as well.
Also it doesn’t have a few key features, most notable debugging support, that is you can’t debug through C# code that has been interpreted, although you can of course debug the interpreter itself. From ‘Tiered Compilation step 1’:
…. - the interpreter is not in good enough shape to run production code as-is. There are also some significant issues if you want debugging and profiling tools to work (which we do).
You can see an example of this in ‘Interpreter: volatile ldobj appears to have incorrect semantics?’ (thanks to alexrp for telling me about this issue). There is also a fair amount of TODO
comments in the code, although I haven’t verified what (if any) specific C# code breaks due to the missing functionality.
However, I think another really useful scenario for the Interpreter is to help you learn about the inner workings of the CLR. It’s only 8,000 lines long, but it’s all in one file and most significantly it’s written in C++. The code that the CLR/JIT uses when compiling for real is in multiple several files (the JIT on it’s own is over 200,000 L.O.C, spread across 100’s of files) and there are large amounts hand-written written in raw assembly.
In theory the Interpreter should work in the same way as the full runtime, albeit not as optimised. This means that it much simpler and those of us who aren’t CLR and/or assembly experts can have a chance of working out what’s going on!
Enabling the Interpreter
The Interpreter is disabled by default, so you have to build the CoreCLR from source to make it work (it used to be the fallback for ARM64 but that’s no longer the case), here’s the diff of the changes you need to make:
--- a/src/inc/switches.h
+++ b/src/inc/switches.h
@@ -233,5 +233,8 @@
#define FEATURE_STACK_SAMPLING
#endif // defined (ALLOW_SXS_JIT)
+// Let's test the .NET Interpreter!!
+#define FEATURE_INTERPRETER
+
#endif // !defined(CROSSGEN_COMPILE)
You also need to enable some environment variables, the ones that I used are in the table below. For the full list, take a look at Host Configuration Knobs and search for ‘Interpreter’.
Name
Description
Interpret
Selectively uses the interpreter to execute the specified methods
InterpreterDoLoopMethods
If set, don’t check for loops, start by interpreting
all methods
InterpreterPrintPostMortem
Prints summary information about the execution to the console
DumpInterpreterStubs
Prints all interpreter stubs that are created to the console
TraceInterpreterEntries
Logs entries to interpreted methods to the console
TraceInterpreterIL
Logs individual instructions of interpreted methods to the console
TraceInterpreterVerbose
Logs interpreter progress with detailed messages to the console
TraceInterpreterJITTransition
Logs when the interpreter determines a method should be JITted
To test out the Interpreter, I will be using the code below:
public static void Main(string[] args)
{
var max = 1000 * 1000;
if (args.Length > 0)
int.TryParse(args[0], out max);
var timer = Stopwatch.StartNew();
for (int i = 1; i
Thu, 30 Mar 2017, 12:00 am
A Hitchhikers Guide to the CoreCLR Source Code
photo by Alan O’Rourke
Just over 2 years ago Microsoft open-sourced the entire .NET framework, this posts attempts to provide a ‘Hitchhikers Guide’ to the source-code found in the CoreCLR GitHub repository.
To make it easier for you to get to the information you’re interested in, this post is split into several parts
It’s worth pointing out that .NET Developers have provided 2 excellent glossaries, the CoreCLR one and the CoreFX one, so if you come across any unfamiliar terms or abbreviations, check these first. Also there is extensive documentation available and if you are interested in the low-level details I really recommend checking out the ‘Book of the Runtime’ (BotR).
Overall Stats
If you take a look at the repository on GitHub, it shows the following stats for the entire repo
But most of the C# code is test code, so if we just look under /src
(i.e. ignore any code under /tests
) there are the following mix of Source file types, i.e. no ‘.txt’, ‘.dat’, etc:
- 2,012 .cpp
- 1,183 .h
- 956 .cs
- 113 .inl
- 98 .hpp
- 51 .S
- 43 .py
- 42 .asm
- 24 .idl
- 20 .c
So by far the majority of the code is written in C++, but there is still also a fair amount of C# code (all under ‘mscorlib’). Clearly there are low-level parts of the CLR that have to be written in C++ or Assembly code because they need to be ‘close to the metal’ or have high performance, but it’s interesting that there are large parts of the runtime written in managed code itself.
Note: All stats/lists in the post were calculated using commit 51a6b5c from the 9th March 2017.
Compared to ‘Rotor’
As a comparison here’s what the stats for ‘Rotor’ the Shared Source CLI looked like back in October 2002. Rotor was ‘Shared Source’, not truly ‘Open Source’, so it didn’t have the same community involvements as the CoreCLR.
Note: SSCLI aka ‘Rotor’ includes the fx or base class libraries (BCL), but the CoreCLR doesn’t as they are now hosted separately in the CoreFX GitHub repository
For reference, the equivalent stats for the CoreCLR source in March 2017 look like this:
- Packaged as 61.2 MB .zip archive
- Over 10.8 million lines of code (2.6 million of source code, under \src)
- 24,485 Files (7,466 source)
- 6,626 C# (956 source)
- 2,074 C and C++
- 3,701 IL
- 93 Assembler
- 43 Python
- 6 Perl
- Over 8.2 million lines of test code
- Build output expands to over 1.2 G with tests
- Product binaries 342 MB
- Test binaries 909 MB
Top 10 lists
These lists are mostly just for fun, but they do give some insights into the code-base and how it’s structured.
Top 10 Largest Files
You might have heard about the mammoth source file that is gc.cpp, which is so large that GitHub refuses to display it.
But it turns out it’s not the only large file in the source, there are also several files in the JIT that are around 20K LOC. However it seems that all the large files are C++ source code, so if you’re only interested in C# code, you don’t have to worry!!
File
# Lines of Code
Type
Location
gc.cpp
37,037
.cpp
\src\gc\
flowgraph.cpp
24,875
.cpp
\src\jit\
codegenlegacy.cpp
21,727
.cpp
\src\jit\
importer.cpp
18,680
.cpp
\src\jit\
morph.cpp
18,381
.cpp
\src\jit\
isolationpriv.h
18,263
.h
\src\inc\
cordebug.h
18,111
.h
\src\pal\prebuilt\inc\
gentree.cpp
17,177
.cpp
\src\jit\
debugger.cpp
16,975
.cpp
\src\debug\ee\
Top 10 Longest Methods
The large methods aren’t actually that hard to find, because they’re all have #pragma warning(disable:21000)
before them, to keep the compiler happy! There are ~40 large methods in total, here’s the ‘Top 10’
Method
# Lines of Code
MarshalInfo::MarshalInfo(Module* pModule,
1,507
void gc_heap::plan_phase (int condemned_gen_number)
1,505
void CordbProcess::DispatchRCEvent()
1,351
void DbgTransportSession::TransportWorker()
1,238
LPCSTR Exception::GetHRSymbolicName(HRESULT hr)
1,216
BOOL Disassemble(IMDInternalImport *pImport, BYTE *ILHeader,…
1,081
bool Debugger::HandleIPCEvent(DebuggerIPCEvent * pEvent)
1,050
void LazyMachState::unwindLazyState(LazyMachState* baseState…
901
VOID ParseNativeType(Module* pModule,
886
VOID StubLinkerCPU::EmitArrayOpStub(const ArrayOpScript* pAr…
839
Top 10 files with the Most Commits
Finally, lets look at which files have been changed the most since the initial commit on GitHub back in January 2015 (ignore ‘merge’ commits)
File
# Commits
src\jit\morph.cpp
237
src\jit\compiler.h
231
src\jit\importer.cpp
196
src\jit\codegenxarch.cpp
190
src\jit\flowgraph.cpp
171
src\jit\compiler.cpp
161
src\jit\gentree.cpp
157
src\jit\lower.cpp
147
src\jit\gentree.h
137
src\pal\inc\pal.h
136
High-level Overview
Next we’ll take a look at how the source code is structured and what are the main components.
They say “A picture is worth a thousand words”, so below is a treemap with the source code files grouped by colour into the top-level sections they fall under. You can hover over an individual box to get more detailed information and can click on the different radio buttons to toggle the sizing (LOC/Files/Commits)
Total L.O.C
# Files
# Commits
Notes and Observations
- The ‘# Commits’ only represent the commits made on GitHub, in the 2 1/2 years since the CoreCLR was open-sourced. So they are skewed to the recent work and don’t represent changes made over the entire history of the CLR. However it’s interesting to see which components have had more ‘churn’ in the last few years (i.e ‘jit’) and which have been left alone (e.g. ‘pal’)
- From the number of LOC/files it’s clear to see what the significant components are within the CoreCLR source, e.g ‘vm’, ‘jit’, ‘pal’ & ‘mscorlib’ (these are covered in detail in the next part of this post)
- In the ‘VM’ section it’s interesting to see how much code is generic ~650K LOC and how much is per-CPU architecture 25K LOC for ‘i386’, 16K for ‘amd64’, 14K for ‘arm’ and 7K for ‘arm64’. This suggests that the code is nicely organised so that the per-architecture work is minimised and cleanly separated out.
- It’s surprising (to me) that the ‘GC’ section is as small as it is, I always thought of the GC is a very complex component, but there is way more code in the ‘debugger’ and the ‘pal’.
- Likewise, I never really appreciated the complexity if the ‘JIT’, it’s the 2nd largest component, comprising over 370K LOC.
If you’re interested, this raw numbers for the code under ‘/src’ are available in this gist and for the code under ‘/tests/src’ in this gist.
Deep Dive into Individual Areas
As the source code is well organised, the top-level folders (under /src) correspond to the logical components within the CoreCLR. We’ll start off by looking at the most significant components, i.e. the ‘Debugger’, ‘Garbage Collector’ (GC), ‘Just-in-Time compiler’ (JIT), ‘mscorlib’ (all the C# code), ‘Platform Adaptation Layer’ (PAL) and the CLR ‘Virtual Machine’ (VM).
The ‘mscorlib’ folder contains all the C# code within the CoreCLR, so it’s the place that most C# developers would start looking if they wanted to contribute. For this reason it deserves it’s own treemap, so we can see how it’s structured:
Total L.O.C
# Files
# Commits
So by-far the bulk of the code is at the ‘top-level’, i.e. directly in the ‘System’ namespace, this contains the fundamental types that have to exist for the CLR to run, such as:
AppDomain
, WeakReference
, Type
,
Array
, Delegate
, Object
, String
Boolean
, Byte
, Char
, Int16
, Int32
, etc
Tuple
, Span
, ArraySegment
, Attribute
, DateTime
Where possible the CoreCLR is written in C#, because of the benefits that ‘managed code’ brings, so there is a significant amount of code within the ‘mscorlib’ section. Note that anything under here is not externally exposed, when you write C# code that runs against the CoreCLR, you actually access everything through the CoreFX, which then type-forwards to the CoreCLR where appropriate.
I don’t know the rules for what lives in CoreCLR v CoreFX, but based on what I’ve read on various GitHub issues, it seems that over time, more and more code is moving from CoreCLR -> CoreFX.
However the managed C# code is often deeply entwined with unmanaged C++, for instance several types are implemented across multiple files, e.g.
From what I understand this is done for performance reasons, any code that is perf sensitive will end up being implemented in C++ (or even Assembly), unless the JIT can suitable optimise the C# code.
Code shared with CoreRT
Recently there has been a significant amount of work done to moved more and more code over into the ‘shared partition’. This is the area of the CoreCLR source code that is shared with CoreRT (‘the .NET Core runtime optimized for AOT compilation’). Because certain classes are implemented in both runtimes, they’ve ensured that the work isn’t duplicated and any fixes are shared in both locations. You can see how this works by looking at the links below:
Other parts of mscorlib
All the other sections of mscorlib line up with namespaces
available in the .NET runtime and contain functionality that most C# devs will have used at one time or another. The largest ones in there are shown below (click to go directly to the source code):
- System.Reflection and System.Reflection.Emit
FieldInfo
, PropertyInfo
, MethodInfo
, AssemblyBuilder
, TypeBuilder
, MethodBuilder
, ILGenerator
- System.Globalization
CultureInfo
, CalendarInfo
, DateTimeParse
, JulianCalendar
, HebrewCalendar
- System.Threading and System.Threading.Tasks
Thread
, Timer
, Semaphore
, Mutex
, AsyncLocal
, Task
, Task
, CancellationToken
- System.Runtime.CompilerServices and System.Runtime.InteropServices
Unsafe
, [CallerFilePath]
, [CallerLineNumber]
, [CallerMemberName]
, GCHandle
, [LayoutKind]
, [MarshalAs(..)]
, [StructLayout(LayoutKind ..)]
- System.Diagnostics
Assert
, Debugger
, Stacktrace
- System.Text
StringBuilder
, ASCIIEncoding
, UTF8Encoding
, UnicodeEncoding
- System.Collections
- System.Collections.Generic
- System.IO
Stream
, MemoryStream
, File
, TestReader
, TestWriter
The VM, not surprisingly, is the largest component of the CoreCLR, with over 640K L.O.C spread across 576 files, and it contains the guts of the runtime. The bulk of the code is OS and CPU independent and written in C++, however there is also a significant amount of architecture-specific assembly code, see the section ‘CPU Architecture-specific code’ for more info.
The VM contains the main start-up routine of the entire runtime EEStartupHelper()
in ceemain.cpp, see ‘The 68 things the CLR does before executing a single line of your code’ for all the details. In addition it provides the following functionality:
- Type System
- Loading types/classes
- Threading
- Exception Handling and Stack Walking
- Fundamental Types
- Generics
- An entire Interpreter (yes .NET can run interpreted!!)
- Function calling mechanisms (see BotR for more info)
- Stubs (used for virtual dispatch and delegates amongst other things)
- Event Tracing
- Profiler
- P/Invoke
- Reflection
CPU Architecture-specific code
All the architecture-specific code is kept separately in several sub-folders, amd64, arm, arm64 and i386. For example here’s the various implementations of the WriteBarrier
function used by the GC:
Before we look at the actual source code, it’s worth looking at the different ‘flavours’ or the JIT that are available:
Fortunately one of the Microsoft developers has clarified which one should be used
Here’s my guidance on how non-MS contributors should think about contributing to the JIT: If you want to help advance the state of the production code-generators for .NET, then contribute to the new RyuJIT x86/ARM32 backend. This is our long term direction. If instead your interest is around getting the .NET Core runtime working on x86 or ARM32 platforms to do other things, by all means use and contribute bug fixes if necessary to the LEGACY_BACKEND paths in the RyuJIT code base today to unblock yourself. We do run testing on these paths today in our internal testing infrastructure and will do our best to avoid regressing it until we can replace it with something better. We just want to make sure that there will be no surprises or hard feelings for when the time comes to remove them from the code-base.
JIT Phases
The JIT has almost 90 source files, but fortunately they correspond to the different phases it goes through, so it’s not too hard to find your way around. Using the table from ‘Phases of RyuyJIT’, I added the right-hand column so you can jump to the relevant source file(s):
Phase
IR Transformations
File
Pre-import
Compiler->lvaTable
created and filled in for each user argument and variable. BasicBlock list initialized.
compiler.hpp
Importation
GenTree
nodes created and linked in to Statements, and Statements into BasicBlocks. Inlining candidates identified.
importer.cpp
Inlining
The IR for inlined methods is incorporated into the flowgraph.
inline.cpp and
inlinepolicy.cpp
Struct Promotion
New lvlVars are created for each field of a promoted struct.
morph.cpp
Mark Address-Exposed Locals
lvlVars with references occurring in an address-taken context are marked. This must be kept up-to-date.
compiler.hpp
Morph Blocks
Performs localized transformations, including mandatory normalization as well as simple optimizations.
morph.cpp
Eliminate Qmarks
All
GT_QMARK
nodes are eliminated, other than simple ones that do not require control flow.
compiler.cpp
Flowgraph Analysis
BasicBlock
predecessors are computed, and must be kept valid. Loops are identified, and normalized, cloned and/or unrolled.
flowgraph.cpp
Normalize IR for Optimization
lvlVar references counts are set, and must be kept valid. Evaluation order of
GenTree
nodes (
gtNext
/
gtPrev
) is determined, and must be kept valid.
compiler.cpp and
lclvars.cpp
SSA and Value Numbering Optimizations
Computes liveness (
bbLiveIn
and
bbLiveOut
on
BasicBlocks
), and dominators. Builds SSA for tracked lvlVars. Computes value numbers.
liveness.cpp
Loop Invariant Code Hoisting
Hoists expressions out of loops.
optimizer.cpp
Copy Propagation
Copy propagation based on value numbers.
copyprop.cpp
Common Subexpression Elimination (CSE)
Elimination of redundant subexressions based on value numbers.
optcse.cpp
Assertion Propagation
Utilizes value numbers to propagate and transform based on properties such as non-nullness.
assertionprop.cpp
Range analysis
Eliminate array index range checks based on value numbers and assertions
rangecheck.cpp
Rationalization
Flowgraph order changes from
FGOrderTree
to
FGOrderLinear
. All
GT_COMMA
,
GT_ASG
and
GT_ADDR
nodes are transformed.
rationalize.cpp
Lowering
Register requirements are fully specified (
gtLsraInfo
). All control flow is explicit.
lower.cpp,
lowerarm.cpp,
lowerarm64.cpp and
lowerxarch.cpp
Register allocation
Registers are assigned (
gtRegNum
and/or
gtRsvdRegs
),and the number of spill temps calculated.
regalloc.cpp and
register_arg_convention.cp
Code Generation
Determines frame layout. Generates code for each
BasicBlock
. Generates prolog & epilog code for the method. Emit EH, GC and Debug info.
codegenarm.cpp,
codegenarm64.cpp,
codegencommon.cpp,
codegenlegacy.cpp,
codegenlinear.cpp and
codegenxarch.cpp
The PAL provides an OS independent layer to give access to common low-level functionality such as:
As .NET was originally written to run on Windows, all the APIs look very similar to the Win32 APIs. However for non-Windows platforms they are actually implemented using the functionality available on that OS. For example this is what PAL code to read/write a file looks like:
int main(int argc, char *argv[])
{
WCHAR src[4] = {'f', 'o', 'o', '\0'};
WCHAR dest[4] = {'b', 'a', 'r', '\0'};
WCHAR dir[5] = {'/', 't', 'm', 'p', '\0'};
HANDLE h;
unsigned int b;
PAL_Initialize(argc, (const char**)argv);
SetCurrentDirectoryW(dir);
SetCurrentDirectoryW(dir);
h = CreateFileW(src, GENERIC_WRITE, FILE_SHARE_READ, NULL, CREATE_NEW, 0, NULL);
WriteFile(h, "Testing\n", 8, &b, FALSE);
CloseHandle(h);
CopyFileW(src, dest, FALSE);
DeleteFileW(src);
PAL_Terminate();
return 0;
}
The PAL does contain some per-CPU assembly code, but it’s only for very low-level functionality, for instance here’s the different implementations of the DebugBreak
function:
The GC is clearly a very complex piece of code, lying right at the heart of the CLR, so for more information about what it does I recommend reading the BotR entry on ‘Garbage Collection Design’ and if you’re interested I’ve also written several blog posts looking at its functionality.
However from a source code point-of-view the GC is pretty simple, it’s spread across just 19 .cpp files, but the bulk of the work is in gc.cpp (raw version) all ~37K L.O.C of it!!
If you want to get deeper into the GC code (warning, it’s pretty dense), a good way to start is to search for the occurrences of various ETW
events that are fired as the GC moves through the phases outlined in the BotR post above, these events are listed below:
FireEtwGCTriggered(..)
FireEtwGCAllocationTick_V1(..)
FireEtwGCFullNotify_V1(..)
FireEtwGCJoin_V2(..)
FireEtwGCMarkWithType(..)
FireEtwGCPerHeapHistory_V3(..)
FireEtwGCGlobalHeapHistory_V2(..)
FireEtwGCCreateSegment_V1(..)
FireEtwGCFreeSegment_V1(..)
FireEtwBGCAllocWaitBegin(..)
FireEtwBGCAllocWaitEnd(..)
FireEtwBGCDrainMark(..)
FireEtwBGCRevisit(..)
FireEtwBGCOverflow(..)
FireEtwPinPlugAtGCTime(..)
FireEtwGCCreateConcurrentThread_V1(..)
FireEtwGCTerminateConcurrentThread_V1(..)
But the GC doesn’t work in isolation, it also requires help from the Execute Engine (EE), this is done via the GCToEEInterface
which is implemented in gcenv.ee.cpp.
Local GC and GC Sample
Finally, there are 2 others ways you can get into the GC code and understand what it does.
Firstly there is a GC sample the lets you use the full GC independent of the rest of the runtime. It shows you how to ‘create type layout information in format that the GC expects’, ‘implement fast object allocator and write barrier’ and ‘allocate objects and work with GC handles’, all in under 250 LOC!!
Also worth mentioning is the ‘Local GC’ project, which is an ongoing effort to decouple the GC from the rest of the runtime, they even have a dashboard so you can track its progress. Currently the GC code is too intertwined with the runtime and vica-versa, so ‘Local GC’ is aiming to break that link by providing a set of clear interfaces, GCToOSInterface
and GCToEEInterface
. This will help with the CoreCLR cross-platform efforts, making the GC easier to port to new OSes.
The CLR is a ‘managed runtime’ and one of the significant components it provides is a advanced debugging experience, via Visual Studio or WinDBG. This debugging experience is very complex and I’m not going to go into it in detail here, however if you want to learn more I recommend you read ‘Data Access Component (DAC) Notes’.
But what does the source look like, how is it laid out? Well the a several main sub-components under the top-level /debug
folder:
- dacaccess - the provides the ‘Data Access Component’ (DAC) functionality as outlined in the BotR page linked to above. The DAC is an abstraction layer over the internal structures in the runtime, which the debugger uses to inspect objects/classes
- di - this contains the exposed APIs (or entry points) of the debugger, implemented by
CoreCLRCreateCordbObject(..)
in cordb.cpp
- ee - the section of debugger that works with the Execution Engine (EE) to do things like stack-walking
- inc - all the interfaces (.h) files that the debugger components implement
All the rest
As well as the main components, there are various other top-level folders in the source, the full list is below:
- binder
- The ‘binder’ is responsible for loading assemblies within a .NET program (except the mscorlib binder which is elsewhere). The ‘binder’ comprises low-level code that controls Assemblies, Application Contexts and the all-important Fusion Log for diagnosing why assemblies aren’t loading!
- classlibnative
- Code for native implementations of many of the core data types in the CoreCLR, e.g. Arrays, System.Object, String, decimal, float and double.
- Also includes all the native methods exposed in the ‘System.Environment’ namespace, e.g.
Environment.ProcessorCount
, Environment.TickCount
, Environment.GetCommandLineArgs()
, Environment.FailFast()
, etc
- coreclr
- corefx
- dlls
- gcdump and gcinfo
- Code that will write-out the
GCInfo
that is produced by the JIT to help the GC do it’s job. This GCInfo
includes information about the ‘liveness’ of variables within a section of code and whether the method is fully or partially interruptible, which enables the EE to suspend methods when the GC is working.
- ilasm
- IL (Intermediate Language) Assembler is a tool for converting IL code into a .NET executable, see the MSDN page for more info and usage examples.
- ildasm
- Tool for disassembling a .NET executable into the corresponding IL source code, again, see the MSDN page for info and usage examples.
- inc
- Header files that define the ‘interfaces’ between the sub-components that make up the CoreCLR. For example corjit.h covers all communication between the Execution Engine (EE) and the JIT, that is ‘EE -> JIT’ and corinfo.h is the interface going the other way, i.e. ‘JIT -> EE’
- ipcman
- Code that enables the ‘Inter-Process Communication’ (IPC) used in .NET (mostly legacy and probably not cross-platform)
- md
- The MetaData (md) code provides the ability to gather information about methods, classes, types and assemblies and is what makes Reflection possible.
- nativeresources
- A simple tool that is responsible for converting/extracting resources from a Windows Resource File.
- palrt
- The PAL (Platform Adaptation Layer) Run-Time, contains specific parts of the PAL layer.
- scripts
- Several Python scripts for auto-generating various files in the source (e.g. ETW events).
- strongname
- ToolBox
- Contains 2 stand-alone tools
- SOS (son-of-strike) the CLR debugging extension that enables reporting of .NET specific information when using WinDBG
- SuperPMI which enables testing of the JIT without requiring the full Execution Engine (EE)
- tools
- unwinder
- Provides the low-level functionality to make it possible for the debugger and exception handling components to walk or unwind the stack. This is done via 2 functions,
GetModuleBase(..)
and GetFunctionEntry(..)
which are implemented in CPU architecture-specific code, see amd64, arm, arm64 and i386
- utilcode
- Shared utility code that is used by the VM, Debugger and JIT
- zap
If you’ve read this far ‘So long and thanks for all the fish’ (YouTube)
Discuss this post on Hacker News and /r/programming
Thu, 23 Mar 2017, 12:00 am
The 68 things the CLR does before executing a single line of your code (*)
Because the CLR is a managed environment there are several components within the runtime that need to be initialised before any of your code can be executed. This post will take a look at the EE (Execution Engine) start-up routine and examine the initialisation process in detail.
(*) 68 is only a rough guide, it depends on which version of the runtime you are using, which features are enabled and a few other things
‘Hello World’
Imagine you have the simplest possible C# program, what has to happen before the CLR prints ‘Hello World’ out to the console?
using System;
namespace ConsoleApplication
{
public class Program
{
public static void Main(string[] args)
{
Console.WriteLine("Hello World!");
}
}
}
The code path into the EE (Execution Engine)
When a .NET executable runs, control gets into the EE via the following code path:
- _CorExeMain() (the external entry point)
- _CorExeMainInternal()
- EnsureEEStarted()
- EEStartup()
- EEStartupHelper()
(if you’re interested in what happens before this, i.e. how a CLR Host can start-up the runtime, see my previous post ‘How the dotnet CLI tooling runs your code’)
And so we end up in EEStartupHelper()
, which at a high-level does the following (from a comment in ceemain.cpp):
EEStartup is responsible for all the one time initialization of the runtime.
Some of the highlights of what it does include
- Creates the default and shared, appdomains.
- Loads mscorlib.dll and loads up the fundamental types (System.Object …)
The main phases in EE (Execution Engine) start-up routine
But let’s look at what it does in detail, the lists below contain all the individual function calls made from EEStartupHelper() (~500 L.O.C). To make them easier to understand, we’ll split them up into separate phases:
- Phase 1 - Set-up the infrastructure that needs to be in place before anything else can run
- Phase 2 - Initialise the core, low-level components
- Phase 3 - Start-up the low-level components, i.e. error handling, profiling API, debugging
- Phase 4 - Start the main components, i.e. Garbage Collector (GC), AppDomains, Security
- Phase 5 - Final setup and then notify other components that the EE has started
Note some items in the list below are only included if a particular feature is defined at build-time, these are indicated by the inclusion on an ifdef
statement. Also note that the links take you to the code for the function being called, not the line of code within EEStartupHelper()
.
Phase 1 - Set-up the infrastructure that needs to be in place before anything else can run
- Wire-up console handling - SetConsoleCtrlHandler(..) (
ifndef FEATURE_PAL
)
- Initialise the internal
SString
class (everything uses strings!) - SString::Startup()
- Make sure the configuration is set-up, so settings that control run-time options can be accessed - EEConfig::Set-up() and InitializeHostConfigFile() (
#if !defined(CROSSGEN_COMPILE)
)
- Initialize Numa and CPU group information - NumaNodeInfo::InitNumaNodeInfo() and CPUGroupInfo::EnsureInitialized() (
#ifndef CROSSGEN_COMPILE
)
- Initialize global configuration settings based on startup flags - InitializeStartupFlags()
- Set-up the Thread Manager that gives the runtime access to the OS threading functionality (
StartThread()
, Join()
, SetThreadPriority()
etc) - InitThreadManager()
- Initialize Event Tracing (ETW) and fire off the CLR startup events - InitializeEventTracing() and ETWFireEvent(EEStartupStart_V1) (
#ifdef FEATURE_EVENT_TRACE
)
- Set-up the GS Cookie (Buffer Security Check) to help prevent buffer overruns - InitGSCookie()
- Create the data-structures needed to hold the ‘frames’ used for stack-traces - Frame::Init()
- Ensure initialization of Apphacks environment variables - GetGlobalCompatibilityFlags() (
#ifndef FEATURE_CORECLR
)
- Create the diagnostic and performance logs used by the runtime - InitializeLogging() (
#ifdef LOGGING
) and PerfLog::PerfLogInitialize() (#ifdef ENABLE_PERF_LOG
)
Phase 2 - Initialise the core, low-level components
- Write to the log
===================EEStartup Starting===================
- Ensure that the Runtime Library functions (that interact with ntdll.dll) are enabled - EnsureRtlFunctions() (
#ifndef FEATURE_PAL
)
- Set-up the global store for events (mutexes, semaphores) used for synchronisation within the runtime - InitEventStore()
- Create the Assembly Binding logging mechanism a.k.a Fusion - InitializeFusion() (
#ifdef FEATURE_FUSION
)
- Then initialize the actual Assembly Binder infrastructure - CCoreCLRBinderHelper::Init() which in turn calls AssemblyBinder::Startup() (
#ifdef FEATURE_FUSION
is NOT defined)
- Set-up the heuristics used to control Monitors, Crsts, and SimpleRWLocks - InitializeSpinConstants()
- Initialize the InterProcess Communication with COM (IPC) - InitializeIPCManager() (
#ifdef FEATURE_IPCMAN
)
- Set-up and enable Performance Counters - PerfCounters::Init() (
#ifdef ENABLE_PERF_COUNTERS
)
- Set-up the CLR interpreter - Interpreter::Initialize() (
#ifdef FEATURE_INTERPRETER
), turns out that the CLR has a mode where your code is interpreted instead of compiled!
- Initialise the stubs that are used by the CLR for calling methods and triggering the JIT - StubManager::InitializeStubManagers(), also Stub::Init() and StubLinkerCPU::Init()
- Set up the core handle map, used to load assemblies into memory - PEImage::Startup()
- Startup the access checks options, used for granting/denying security demands on method calls - AccessCheckOptions::Startup()
- Startup the mscorlib binder (used for loading “known” types from mscorlib.dll) - MscorlibBinder::Startup()
- Initialize remoting, which allows out-of-process communication - CRemotingServices::Initialize() (
#ifdef FEATURE_REMOTING
)
- Set-up the data structures used by the GC for weak, strong and no-pin references - Ref_Initialize()
- Set-up the contexts used to proxy method calls across App Domains - Context::Initialize()
- Wire-up events that allow the EE to synchronise shut-down -
g_pEEShutDownEvent->CreateManualEvent(FALSE)
- Initialise the process-wide data structures used for reader-writer lock implementation - CRWLock::ProcessInit() (
#ifdef FEATURE_RWLOCK
)
- Initialize the debugger manager - CCLRDebugManager::ProcessInit() (
#ifdef FEATURE_INCLUDE_ALL_INTERFACES
)
- Initialize the CLR Security Attribute Manager - CCLRSecurityAttributeManager::ProcessInit() (
#ifdef FEATURE_IPCMAN
)
- Set-up the manager for Virtual call stubs - VirtualCallStubManager::InitStatic()
- Initialise the lock that that GC uses when controlling memory pressure - GCInterface::m_MemoryPressureLock.Init(CrstGCMemoryPressure)
- Initialize Assembly Usage Logger - InitAssemblyUsageLogManager() (
#ifndef FEATURE_CORECLR
)
Phase 3 - Start-up the low-level components, i.e. error handling, profiling API, debugging
- Set-up the App Domains used by the CLR - SystemDomain::Attach() (also creates the DefaultDomain and the SharedDomain by calling SystemDomain::CreateDefaultDomain() and SharedDomain::Attach())
- Start up the ECall interface, a private native calling interface used within the CLR - ECall::Init()
- Set-up the caches for the stubs used by
delegates
- COMDelegate::Init()
- Set-up all the global/static variables used by the EE itself - ExecutionManager::Init()
- Initialise Watson, for windows error reporting - InitializeWatson(fFlags) (
#ifndef FEATURE_PAL
)
- Initialize the debugging services, this must be done before any EE thread objects are created, and before any classes or modules are loaded - InitializeDebugger() (
#ifdef DEBUGGING_SUPPORTED
)
- Activate the Managed Debugging Assistants that the CLR provides - ManagedDebuggingAssistants::EEStartupActivation() (
ifdef MDA_SUPPORTED
)
- Initialise the Profiling API - ProfilingAPIUtility::InitializeProfiling() (
#ifdef PROFILING_SUPPORTED
)
- Initialise the exception handling mechanism -