Analysing .NET start-up time with Flamegraphs

Recently I gave a talk at the NYAN Conference called ‘From ‘dotnet run’ to ‘hello world’:

In the talk I demonstrate how you can use PerfView to analyse where the .NET Runtime is spending it’s time during start-up:

From 'dotnet run' to 'hello world' from Matt Warren

This post is a step-by-step guide to that demo.

Code Sample

For this exercise I delibrately only look at what the .NET Runtime is doing during program start-up, so I ensure the minimum amount of user code is runing, hence the following ‘Hello World’:

using System;

namespace HelloWorld
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello World!");
            Console.WriteLine("Press  to exit");
            Console.ReadLine();
        }
    }
}

The Console.ReadLine() call is added because I want to ensure the process doesn’t exit whilst PerfView is still collecting data.

Data Collection

PerfView is a very powerful program, but not the most user-friendly of tools, so I’ve put togerther a step-by-step guide:

  1. Download and run a recent version of ‘PerfView.exe’
  2. Click ‘Run a command’ or (Alt-R’) and “collect data while the command is running”
  3. Ensure that you’ve entered values for:
    1. Command
    2. Current Dir
  4. Tick ‘Cpu Samples’ if it isn’t already selected
  5. Set ‘Max Collect Sec’ to 15 seconds (because our ‘HelloWorld’ app never exits, we need to ensure PerfView stops collecting data at some point)
  6. Ensure that ‘.NET Symbol Collection’ is selected
  7. Hit ‘Run Command

If you then inspect the log you can see that it’s collecting data, obtaining symbols and then finally writing everything out to a .zip file. Once the process is complete you should see the newly created file in the left-hand pane of the main UI, in this case it’s called ‘PerfViewData.etl.zip’

Data Processing

Once you have your ‘.etl.zip’ file, double-click on it and you will see a tree-view with all the available data. Now, select ‘CPU Stacks’ and you’ll be presented with a view like this:

Notice there’s alot of ‘?’ characters in the list, this means that PerfView is not able to work out the method names as it hasn’t resolved the necessary symbols for the Runtime dlls. Lets fix that:

  1. Open ‘CPU Stacks
  2. In the list, select the ‘HelloWorld’ process (PerfView collects data machine-wide)
  3. In the ‘GroupPats’ drop-down, select ‘[no grouping]’
  4. Optional, change the ‘Symbol Path’ from the default to something else
  5. In the ‘By name’ tab, hit ‘Ctrl+A’ to select all the rows
  6. Right-click and select ‘Lookup Symbols’ (or just hit ‘Alt+S’)

Now the ‘CPU Stacks’ view should look something like this:

Finally, we can get the data we want:

  1. Select the ‘Flame Graph’ tab
  2. Change ‘GroupPats’ to one of the following for a better flame graph:
    1. [group module entries] {%}!=>module $1
    2. [group class entries] {%!*}.%(=>class $1;{%!*}::=>class $1
  3. Change ‘Fold%’ to a higher number, maybe 3%, to get rid of any thin bars (any higher and you start to loose information)

Now, at this point I actually recommend exporting the PerfView data into a format that can be loaded into https://speedscope.app/ as it gives you a much better experience. To do this click File -> Save View As and then in the ‘Save as type’ box select Speed Scope Format. Once that’s done you can ‘browse’ that file at speedscope.app, or if you want you can just take a look at one I’ve already created.

Note: If you’ve never encountered ‘flamegraphs’ before, I really recommend reading this excellent explanation by Julia Evans:

perf & flamegraphs pic.twitter.com/duzWs2hoLT

— 🔎Julia Evans🔍 (@b0rk) December 26, 2017

Anaylsis of .NET Runtime Startup

Finally, we can answer our original question:

Where does the .NET Runtime spend time during start-up?

Here’s the data from the flamegraph summarised as text, with links the corresponding functions in the ‘.NET Core Runtime’ source code:

  1. Entire Application - 100% - 233.28ms
  2. Everything except helloworld!wmain - 21%
  3. helloworld!wmain - 79% - 184.57ms
    1. hostpolicy!create_hostpolicy_context - 30% - 70.92ms here
    2. hostpolicy!create_coreclr - 22% - 50.51ms here
      1. coreclr!CorHost2::Start - 9% - 20.98ms here
      2. coreclr!CorHost2::CreateAppDomain - 10% - 23.52ms here
    3. hostpolicy!runapp - 20% - 46.20ms here, ends up calling into Assembly::ExecuteMainMethod here
      1. coreclr!RunMain - 9.9% - 23.12ms here
      2. coreclr!RunStartupHooks - 8.1% - 19.00ms here
    4. hostfxr!resolve_frameworks_for_app - 3.4% - 7.89ms here

So, the main places that the runtime spends time are:

  1. 30% of total time is spent Launching the runtime, controlled via the ‘host policy’, which mostly takes place in hostpolicy!create_hostpolicy_context (30% of total time)
  2. 22% of time is spend on Initialisation of the runtime itself and the initial (and only) AppDomain it creates, this can be see in CorHost2::Start (native) and CorHost2::CreateAppDomain (managed). For more info on this see The 68 things the CLR does before executing a single line of your code
  3. 20% was used JITting and executing the Main method in our ‘Hello World’ code sample, this started in Assembly::ExecuteMainMethod above.

To confirm the last point, we can return to PerfView and take a look at the ‘JIT Stats Summary’ it produces. From the main menu, under ‘Advanced Group’ -> ‘JIT Stats’ we see that 23.1 ms or 9.1% of the total CPU time was spent JITing:


Tue, 3 Mar 2020, 12:00 am


Under the hood of "Default Interface Methods"

Background

‘Default Interface Methods’ (DIM) sometimes referred to as ‘Default Implementations in Interfaces’, appeared in C# 8. In case you’ve never heard of the feature, here’s some links to get you started:

Also, there are quite a few other blogs posts discussing this feature, but as you can see opinion is split on whether it’s useful or not:

But this post isn’t about what they are, how you can use them or if they’re useful or not. Instead we will be exploring how ‘Default Interface Methods’ work under-the-hood, looking at what the .NET Core Runtime has to do to make them work and how the feature was developed.

Table of Contents

Development Timeline and PRs

First of all, there are a few places you can go to get a ‘high-level’ understanding of what was done:

Initial work, Prototype and Timeline

Interesting PR’s done after the prototype (newest -> oldest)

Once the prototype was merged in, there was additional feature work done to ensure that DIM’s worked across different scenarios:

Bug fixes done since the Prototype (newest -> oldest)

In addition, there were various bugs fixes done to ensure that existing parts of the CLR played nicely with DIMs:

Possible future work

Finally, there’s no guarantee if or when this will be done, but here are the remaining issues associated with the project:

Default Interface Methods ‘in action’

Now that we’ve seen what was done, let’s look at what that all means, starting with this code that simply demonstrates ‘Default Interface Methods’ in action:

interface INormal {
    void Normal();
}

interface IDefaultMethod {
    void Default() => WriteLine("IDefaultMethod.Default");
}

class CNormal : INormal {
    public void Normal() => WriteLine("CNormal.Normal");
}

class CDefault : IDefaultMethod {
    // Nothing to do here!
}

class CDefaultOwnImpl : IDefaultMethod {
    void IDefaultMethod.Default() => WriteLine("CDefaultOwnImpl.IDefaultMethod.Default");
}

// Test out the Normal/DefaultMethod Interfaces
INormal iNormal = new CNormal();
iNormal.Normal(); // prints "CNormal.Normal"

IDefaultMethod iDefault = new CDefault();
iDefault.Default(); // prints "IDefaultMethod.Default"

IDefaultMethod iDefaultOwnImpl = new CDefaultOwnImpl();
iDefaultOwnImpl.Default(); // prints "CDefaultOwnImpl.IDefaultMethod.Default"

The first way we can understand how they are implemented is by using Type.GetInterfaceMap(Type) (which actually had to be fixed to work with DIMs), this can be done with code like this:

private static void ShowInterfaceMapping(Type @implemetation, Type @interface) {
    InterfaceMapping map = @implemetation.GetInterfaceMap(@interface);
    Console.WriteLine($"{map.TargetType}: GetInterfaceMap({map.InterfaceType})");
    for (int counter = 0; counter  TestApp.CNormal::Normal (different)
       MethodHandle 0x7FF993916A80 --> MethodHandle 0x7FF993916B10
       FunctionPtr  0x7FF99385FC50 --> FunctionPtr  0x7FF993861880

TestApp.CDefault: GetInterfaceMap(TestApp.IDefaultMethod)
   TestApp.IDefaultMethod::Default --> TestApp.IDefaultMethod::Default (same)
       MethodHandle 0x7FF993916BD8 --> MethodHandle 0x7FF993916BD8
       FunctionPtr  0x7FF99385FC78 --> FunctionPtr  0x7FF99385FC78

TestApp.CDefaultOwnImpl: GetInterfaceMap(TestApp.IDefaultMethod)
   TestApp.IDefaultMethod::Default --> TestApp.CDefaultOwnImpl::TestApp.IDefaultMethod.Default (different)
       MethodHandle 0x7FF993916BD8 --> MethodHandle 0x7FF993916D10
       FunctionPtr  0x7FF99385FC78 --> FunctionPtr  0x7FF9938663A0

So here we can see that in the case of IDefaultMethod interface on the CDefault class the interface and method implementations are the same. As you can see, in the other scenarios the interface method maps to a different method implementation.

But lets look at bit lower, making use of WinDBG and the SOS extension to get a peek into the internal ‘data structures’ that the runtime uses.

First, lets take a look at the MethodTable (dumpmt) for the INormal interface:

> dumpmt -md 00007ff8bcc31dd8
EEClass:         00007FF8BCC2C420
Module:          00007FF8BCC0F788
Name:            TestApp.INormal
mdToken:         0000000002000002
File:            C:\DefaultInterfaceMethods\TestApp\bin\Debug\netcoreapp3.0\TestApp.dll
BaseSize:        0x0
ComponentSize:   0x0
Slots in VTable: 1
Number of IFaces in IFaceMap: 0
--------------------------------------
MethodDesc Table
           Entry       MethodDesc    JIT Name
00007FF8BCB70580 00007FF8BCC31DC8   NONE TestApp.INormal.Normal()

So we can see that the interface has an entry for the Normal() method, as expected, but lets look in more detail at the MethodDesc (dumpmd):

> dumpmd 00007FF8BCC31DC8                                    
Method Name:          TestApp.INormal.Normal()               
Class:                00007ff8bcc2c420                       
MethodTable:          00007ff8bcc31dd8                       
mdToken:              0000000006000001                       
Module:               00007ff8bcc0f788                       
IsJitted:             no                                     
Current CodeAddr:     ffffffffffffffff                       
Version History:                                             
  ILCodeVersion:      0000000000000000                       
  ReJIT ID:           0                                      
  IL Addr:            0000000000000000                       
     CodeAddr:           0000000000000000  (MinOptJitted)    
     NativeCodeVersion:  0000000000000000 

So whilst the method exists in the interface definition, it’s clear that the method has not been jitted (IsJitted: no) and in fact it never will, as it can never be executed.

Now lets compare that output with the one for the IDefaultMethod interface, again the MethodTable (dumpmt) and the MethodDesc (dumpmd):

> dumpmt -md 00007ff8bcc31e68
EEClass:         00007FF8BCC2C498
Module:          00007FF8BCC0F788
Name:            TestApp.IDefaultMethod
mdToken:         0000000002000003
File:            C:\DefaultInterfaceMethods\TestApp\bin\Debug\netcoreapp3.0\TestApp.dll
BaseSize:        0x0
ComponentSize:   0x0
Slots in VTable: 1
Number of IFaces in IFaceMap: 0
--------------------------------------
MethodDesc Table
           Entry       MethodDesc    JIT Name
00007FF8BCB70590 00007FF8BCC31E58    JIT TestApp.IDefaultMethod.Default()

> dumpmd 00007FF8BCC31E58
Method Name:          TestApp.IDefaultMethod.Default()
Class:                00007ff8bcc2c498
MethodTable:          00007ff8bcc31e68
mdToken:              0000000006000002
Module:               00007ff8bcc0f788
IsJitted:             yes
Current CodeAddr:     00007ff8bcb765c0
Version History:
  ILCodeVersion:      0000000000000000
  ReJIT ID:           0
  IL Addr:            0000000000000000
     CodeAddr:           00007ff8bcb765c0  (MinOptJitted)
     NativeCodeVersion:  0000000000000000

Here we see something very different, the MethodDesc entry in the MethodTable actually has jitted, executable code associated with it.

Enabling Methods on an Interface

So we’ve seen that ‘default interface methods’ are wired up by the runtime, but how does that happen?

Firstly, it’s very illuminating to look at the initial prototype of the feature in CoreCLR PR #10505, because we can understand at the lowest level what the feature is actually enabling, from /src/vm/classcompat.cpp:

Here we see why DIM didn’t require any changes to the .NET ‘Intermediate Language’ (IL) op-codes, instead they are enabled by relaxing a previous restriction. Before this change, you weren’t able to add ‘virtual, non-abstract’ or ‘non-virtual’ methods to an interface:

  • “Virtual Non-Abstract Interface Method.” (BFA_VIRTUAL_NONAB_INT_METHOD)
  • “Nonvirtual Instance Interface Method.” (BFA_NONVIRT_INST_INT_METHOD)

This ties in with the proposed changes to the ECMA-335 specification, from the ‘Default interface methods’ design doc:

The major changes are:

  • Interfaces are now allowed to have instance methods (both virtual and non-virtual). Previously we only allowed abstract virtual methods.
    • Interfaces obviously still can’t have instance fields.
  • Interface methods are allowed to MethodImpl other interface methods the interface requires (but we require the MethodImpls to be final to keep things simple) - i.e. an interface is allowed to provide (or override) an implementation of another interface’s method

However, just allowing ‘virtual, non-abstract’ or ‘non-virtual’ methods to exist on an interface is only the start, the runtime then needs to allow code to call those methods and that is far harder!

Resolving the Method Dispatch

In .NET, since version 2.0, all interface methods calls have taken place via a mechanism known as Virtual Stub Dispatch:

Virtual stub dispatching (VSD) is the technique of using stubs for virtual method invocations instead of the traditional virtual method table. In the past, interface dispatch required that interfaces had process-unique identifiers, and that every loaded interface was added to a global interface virtual table map. This requirement meant that all interfaces and all classes that implemented interfaces had to be restored at runtime in NGEN scenarios, causing significant startup working set increases. The motivation for stub dispatching was to eliminate much of the related working set, as well as distribute the remaining work throughout the lifetime of the process.

Although it is possible for VSD to dispatch both virtual instance and interface method calls, it is currently used only for interface dispatch.

For more information I recommend reading the section on C#’s slotmaps in the excellent article on ‘Interface Dispatch’ by Lukas Atkinson.

So, to make DIM work, the runtime has to wire up any ‘default methods’, so that they integrate with the ‘virtual stub dispatch’ mechanism. We can see this in action by looking at the call stack from the hand-crafted assembly stub (ResolveWorkerAsmStub) all the way down to FindDefaultInterfaceImplementation(..) which finds the correct method, given an interface (pInterfaceMD) and the default method to call (pInterfaceMT):

- coreclr.dll!MethodTable::FindDefaultInterfaceImplementation(MethodDesc *pInterfaceMD, MethodTable *pInterfaceMT, MethodDesc **ppDefaultMethod, int allowVariance, int throwOnConflict) Line 6985	C++
- coreclr.dll!MethodTable::FindDispatchImpl(unsigned int typeID, unsigned int slotNumber, DispatchSlot *pImplSlot, int throwOnConflict) Line 6851	C++
- coreclr.dll!MethodTable::FindDispatchSlot(unsigned int typeID, unsigned int slotNumber, int throwOnConflict) Line 7251	C++
- coreclr.dll!VirtualCallStubManager::Resolver(MethodTable *pMT, DispatchToken token, OBJECTREF *protectedObj, unsigned __int64 *ppTarget, int throwOnConflict) Line 2208	C++
- coreclr.dll!VirtualCallStubManager::ResolveWorker(StubCallSite *pCallSite, OBJECTREF *protectedObj, DispatchToken token, VirtualCallStubManager::StubKind stubKind) Line 1874	C++
- coreclr.dll!VSD_ResolveWorker(TransitionBlock *pTransitionBlock, unsigned __int64 siteAddrForRegisterIndirect, unsigned __int64 token, unsigned __int64 flags) Line 1683	C++
- coreclr.dll!ResolveWorkerAsmStub() Line 42	Unknown

If you want to explore the call-stack in more detail, you can follow the links below:

  • ResolveWorkerAsmStub here
  • VSD_ResolveWorker(..) here
  • VirtualCallStubManager::ResolveWorker(..) here
  • VirtualCallStubManager::Resolver(..)here
  • MethodTable::FindDispatchSlot(..) here [MethodTable::FindDispatchImpl(..) here or here
  • Finally ending up in MethodTable::FindDefaultInterfaceImplementation(..) here

Analysis of FindDefaultInterfaceImplementation(..)

So the code in FindDefaultInterfaceImplementation(..) is at the heart of the feature, but what does it need to do and how does it do it? This list from Finalize override lookup algorithm #12753 gives us some idea of the complexity:

  • properly detect diamond shape positive case (where I4 overrides both I2/I3 which both overrides I1) by keep tracking of a current list of best candidates. I went for the simplest algorithm and didn’t build any complex graph / DFS since the majority case the list of interfaces would be small, and interface dispatch cache would ensure majority of cases we don’t need to redo the (slow) dispatch. If needed we can revisit this to make it a proper topological sort.
  • VerifyVirtualMethodsImplemented now properly validates default interface scenarios - it is happy if there is at least one implementation and early returns. It doesn’t worry about conflicting overrides, for performance reasons.
  • NotSupportedException thrown in conflicting override scenario now has a proper error message
  • properly supports GVM when detecting method impl overrides
  • Revisited code that adds method impl for interfaces. added proper methodimpl validation and ensure methodimpl are virtual and final (and throw exception if it is not final)
  • Added test scenario with method that has multiple method impl. found and fixed a bug where the slot array is not big enough when building method impls for interfaces.

In addition, the ‘two-pass’ algorithm was implemented in Implement two pass algorithm for variant interface dispatch #21355, which contains an interesting discussion of the edge-cases that need to be handled.

So onto the code, this is the high-level view of the algorithm:

Diamond Inheritance Problem

Finally, the ‘diamond inheritance problem’ was mentioned in a few of the PRs/Issues related to the feature, but what is it?

A good place to starts is one of the test cases, diamondshape.cs. However there’s a more concise example in the C#8 Language Proposal:

interface IA
{
    void M();
}
interface IB : IA
{
    override void M() { WriteLine("IB"); }
}
class Base : IA
{
    void IA.M() { WriteLine("Base"); }
}
class Derived : Base, IB // allowed?
{
    static void Main()
    {
        Ia a = new Derived();
        a.M();           // what does it do?
    }
}

So the issue is which of the matching interface methods should be used, in this case IB.M() or Base.IA.M()? The resolution, as outlined in the C#8 language proposal was to use the most specific override:

Closed Issue: Confirm the draft spec, above, for most specific override as it applies to mixed classes and interfaces (a class takes priority over an interface). See https://github.com/dotnet/csharplang/blob/master/meetings/2017/LDM-2017-04-19.md#diamonds-with-classes.

Which ties in with the ‘more-specific’ and ‘less-specific’ steps we saw in the outline of FindDefaultInterfaceImplementation above.

Summary

So there you have it, an entire feature delivered end-to-end, yay for .NET (Core) being open source! Thanks to the runtime engineers for making their Issues and PRs easy to follow and for adding such great comments to their code! Also kudos to the language designers for making their proposals and meeting notes available for all to see (e.g. LDM-2017-04-19).

Whether you think they are useful or not, it’s hard to argue that ‘Default Interface Methods’ aren’t well designed and well implemented.

But what makes it even more unique feature is that it required the compiler and runtime teams working together to make it possible!


Wed, 19 Feb 2020, 12:00 am


Research based on the .NET Runtime

Over the last few years, I’ve come across more and more research papers based, in some way, on the ‘Common Language Runtime’ (CLR).

So armed with Google Scholar and ably assisted by Semantic Scholar, I put together the list below.

Note: I put the papers into the following categories to make them easier to navigate (papers in each category are sorted by date, newest -> oldest):

  • Using the .NET Runtime as a case-study
    • to prove its correctness, study how it works or analyse its behaviour
  • Research carried out by Microsoft Research, the research subsidiary of Microsoft.
    • It was formed in 1991, with the intent to advance state-of-the-art computing and solve difficult world problems through technological innovation in collaboration with academic, government, and industry researchers” (according to Wikipedia)
  • Papers based on the Mono Runtime
    • a ‘Cross-Platform, open-source .NET framework
  • Using ‘Rotor’, real name ‘Shared Source CLI (SSCLI)’
    • from WikipediaMicrosoft provides the Shared Source CLI as a reference CLI implementation suitable for educational use

Any papers I’ve missed? If so, please let me know in the comments or on Twitter

.NET Runtime as a Case-Study

Pitfalls of C# Generics and Their Solution Using Concepts (Belyakova & Mikhalkovich, 2015)

Abstract

In comparison with Haskell type classes and C ++ concepts, such object-oriented languages as C# and Java provide much limited mechanisms of generic programming based on F-bounded polymorphism. Main pitfalls of C# generics are considered in this paper. Extending C# language with concepts which can be simultaneously used with interfaces is proposed to solve the problems of generics; a design and translation of concepts are outlined.

Efficient Compilation of .NET Programs for Embedded Systems (Sallenaveab & Ducournaub, 2011)

Abstract

Compiling under the closed-world assumption (CWA) has been shown to be an appropriate way for implementing object-oriented languages such as Java on low-end embedded systems. In this paper, we explore the implications of using whole program optimizations such as Rapid Type Analysis (RTA) and coloring on programs targeting the .NET infrastructure. We extended RTA so that it takes into account .NET specific features such as (i) array covariance, a language feature also supported in Java, (ii) generics, whose specifications in .Net impacts type analysis and (iii) delegates, which encapsulate methods within objects. We also use an intraprocedural control flow analysis in addition to RTA . We eval-uated the optimizations that we implemented on programs written in C#. Preliminary results show a noticeable reduction of the code size, class hierarchy and polymorphism of the programs we optimize. Array covariance is safe in almost all cases, and some delegate calls can be implemented as direct calls.

Type safety of C# and .Net CLR (Fruja, 2007)

Abstract

Type safety plays a crucial role in the security enforcement of any typed programming language. This thesis presents a formal proof of C#’s type safety. For this purpose, we develop an abstract framework for C#, comprising formal specifications of the language’s grammar, of the statically correct programs, and of the static and operational semantics. Using this framework, we prove that C# is type-safe, by showing that the execution of statically correct C# programs does not lead to type errors.

Modeling the .NET CLR Exception Handling Mechanism for a Mathematical Analysis (Fruja & Börger, 2006)

Abstract

This work is part of a larger project which aims at establishing some important properties of C# and CLR by mathematical proofs. Examples are the correctness of the bytecode verifier of CLR, the type safety (along the lines of the first author’s correctness proof for the definite assignment rules) of C#, the correctness of a general compilation scheme.

Analysis of the .NET CLR Exception Handling Mechanism (Fruja & Börger, 2005)

Abstract

We provide a complete mathematical model for the exception handling mechanism of the Common Language Runtime (CLR), the virtual machine underlying the interpretation of .NET programs. The goal is to use this rigorous model in the corresponding part of the still-to-be-developed soundness proof for the CLR bytecode verifier.

A Modular Design for the Common Language Runtime (CLR) Architecture (Fruja, 2005)

Abstract

This paper provides a modular high-level design of the Common Language Runtime (CLR) architecture. Our design is given in terms of Abstract State Machines (ASMs) and takes the form of an interpreter. We describe the CLR as a hierarchy of eight submachines, which correspond to eight submodules into which the Common Intermediate Language (CIL) instruction set can be decomposed.

Cross-language Program Slicing in the .NET Framework (Pócza, Biczó & Porkoláb, 2005)

Abstract

Dynamic program slicing methods are very attractive for debugging because many statements can be ignored in the process of localizing a bug. Although language interoperability is a key concept in modern development platforms, current slicing techniques are still restricted to a single language. In this paper a cross-language dynamic program slicing technique is introduced for the .NET environment. The method is utilizing the CLR Debugging Services API, hence it can be applied to large multi-language applications.

Design and Implementation of a high-level multi-language . NET Debugger (Strein, 2005)

Abstract

The Microsoft .NET Common Language Runtime (CLR) provides a low-level debugging application programmers interface (API), which can be used to implement traditional source code debuggers but can also be useful to implement other dynamic program introspection tools. This paper describes our experience in using this API for the implementation of a high-level debugger. The API is difficult to use from a technical point of view because it is implemented as a set of Component Object Model (COM) interfaces instead of a managed .NET API. Nevertheless, it is possible to implement a debugger in managed C# code using COM-interop. We describe our experience in taking this approach. We define a high-level debugging API and implement it in the C# language using COM-interop to access the low-level debugging API. Furthermore, we describe the integration of this high-level API in the multi-language development environment X-develop to enable source code debugging of .NET languages. This paper can be useful for anybody who wants to take the same approach to implement debuggers or other tools for dynamic program introspection.

A High-Level Modular Definition of the Semantics of C# (Börger, Fruja, Gervasi & Stärk, 2004)

Abstract

We propose a structured mathematical definition of the semantics of programs to provide a platform-independent interpreter view of the language for the programmer, which can also be used for a precise analysis of the ECMA standard of the language and as a reference model for teaching. The definition takes care to reflect directly and faithfully—as much as possible without becoming inconsistent or incomplete—the descriptions in the standard to become comparable with the corresponding models for Java in Stärk et al. (Java and Java Virtual Machine—Definition, Verification, Validation, Springer, Berlin, 2001) and to provide for implementors the possibility to check their basic design decisions against an accurate high-level model. The model sheds light on some of the dark corners of and on some critical differences between the ECMA standard and the implementations of the language.

An ASM Specification of C# Threads and the .NET Memory Model (Stärk and Börger, 2004)

Abstract

We present a high-level ASM model of C# threads and the .NET memory model. We focus on purely managed, fully portable threading features of C#. The sequential model interleaves the computation steps of the currently running threads and is suitable for uniprocessors. The parallel model addresses problems of true concurrency on multiprocessor systems. The models provide a sound basis for the development of multi-threaded applications in C#. The thread and memory models complete the abstract operational semantics of C# in.

Common Language Runtime : a new virtual machine (Ferreira, 2004)

Abstract

Virtual Machines provide a runtime execution platform combining bytecode portability with a performance close to native code. An overview of current approaches precedes an insight into Microsoft CLR (Common Language Runtime), comparing it to Sun JVM (Java Virtual Machine) and to a native execution environment (IA 32). A reference is also made to CLR in a Unix platform and to techniques on how CLR improves code execution.

JVM versus CLR: a comparative study (Singer, 2003)

Abstract

We present empirical evidence to demonstrate that there is little or no difference between the Java Virtual Machine and the .NET Common Language Runtime, as regards the compilation and execution of object-oriented programs. Then we give details of a case study that proves the superiority of the Common Language Runtime as a target for imperative programming language compilers (in particular GCC).

Runtime Code Generation with JVM And CLR (Sestoft, 2002)

Abstract

Modern bytecode execution environments with optimizing just-in-time compilers, such as Sun’s Hotspot Java Virtual Machine, IBM’s Java Virtual Machine, and Microsoft’s Common Language Runtime, provide an infrastructure for generating fast code at runtime. Such runtime code generation can be used for efficient implementation of parametrized algorithms. More generally, with runtime code generation one can introduce an additional binding-time without performance loss. This permits improved performance and improved static correctness guarantees.

Microsoft Research

Project Snowflake: Non-blocking safe manual memory management in .NET (Parkinson, Vaswani, Costa, Deligiannis, Blankstein, McDermott, Balkind & Vytiniotis, 2017)

Abstract

Garbage collection greatly improves programmer productivity and ensures memory safety. Manual memory management on the other hand often delivers better performance but is typically unsafe and can lead to system crashes or security vulnerabilities. We propose integrating safe manual memory management with garbage collection in the .NET runtime to get the best of both worlds. In our design, programmers can choose between allocating objects in the garbage collected heap or the manual heap. All existing applications run unmodified, and without any performance degradation, using the garbage collected heap. Our programming model for manual memory management is flexible: although objects in the manual heap can have a single owning pointer, we allow deallocation at any program point and concurrent sharing of these objects amongst all the threads in the program. Experimental results from our .NET CoreCLR implementation on real-world applications show substantial performance gains especially in multithreaded scenarios: up to 3x savings in peak working sets and 2x improvements in runtime.

Simple, Fast and Safe Manual Memory Management (Kedia, Costa, Vytiniotis, Parkinson, Vaswani & Blankstein, 2017)

Abstract

Safe programming languages are readily available, but many applications continue to be written in unsafe languages, because the latter are more efficient. As a consequence, many applications continue to have exploitable memory safety bugs. Since garbage collection is a major source of inefficiency in the implementation of safe languages, replacing it with safe manual memory management would be an important step towards solving this problem.

Previous approaches to safe manual memory management use programming models based on regions, unique pointers, borrowing of references, and ownership types. We propose a much simpler programming model that does not require any of these concepts. Starting from the design of an imperative type safe language (like Java or C#), we just add a delete operator to free memory explicitly and an exception which is thrown if the program dereferences a pointer to freed memory. We propose an efficient implementation of this programming model that guarantees type safety. Experimental results from our implementation based on the C# native compiler show that this design achieves up to 3x reduction in peak working set and run time.

Uniqueness and Reference Immutability for Safe Parallelism (Gordon, Parkinson, Parsons, Bromfield & Duffy, 2012)

Abstract

A key challenge for concurrent programming is that side-effects (memory operations) in one thread can affect the behavior of another thread. In this paper, we present a type system to restrict the updates to memory to prevent these unintended side-effects. We provide a novel combination of immutable and unique (isolated) types that ensures safe parallelism (race freedom and deterministic execution). The type system includes support for polymorphism over type qualifiers, and can easily create cycles of immutable objects. Key to the system’s flexibility is the ability to recover immutable or externally unique references after violating uniqueness without any explicit alias tracking. Our type system models a prototype extension to C# that is in active use by a Microsoft team. We describe their experiences building large systems with this extension. We prove the soundness of the type system by an embedding into a program logic.

A study of concurrent real-time garbage collectors (Pizlo, Petrank & Steensgaard, 2008)

Abstract

Concurrent garbage collection is highly attractive for real-time systems, because offloading the collection effort from the executing threads allows faster response, allowing for extremely short deadlines at the microseconds level. Concurrent collectors also offer much better scalability over incremental collectors. The main problem with concurrent real-time collectors is their complexity. The first concurrent real-time garbage collector that can support fine synchronization, STOPLESS, has recently been presented by Pizlo et al. In this paper, we propose two additional (and different) algorithms for concurrent real-time garbage collection: CLOVER and CHICKEN. Both collectors obtain reduced complexity over the first collector STOPLESS, but need to trade a benefit for it. We study the algorithmic strengths and weaknesses of CLOVER and CHICKEN and compare them to STOPLESS. Finally, we have implemented all three collectors on the Bartok compiler and runtime for C# and we present measurements to compare their efficiency and responsiveness.

Optimizing concurrency levels in the. net threadpool: A case study of controller design and implementation (Hellerstein, Morrison & Eilebrecht, 2008)

Abstract

This paper presents a case study of developing a hill climb-ing concurrency controller (HC 3) for the .NET ThreadPool. The intent of the case study is to provide insight into soft-ware considerations for controller design, testing, and imple-mentation. The case study is structured as a series of issues encountered and approaches taken to their resolution. Ex-amples of issues and approaches include: (a) addressing the need to combine a hill climbing control law with rule-based techniques by the use of hybrid control; (b) increasing the ef-ficiency and reducing the variability of the test environment by using resource emulation; and (c) effectively assessing design choices by using test scenarios for which the optimal concurrency level can be computed analytically and hence desired test results are known a priori. We believe that these issues and approaches have broad application to controllers for resource management of software systems.

Stopless: a real-time garbage collector for multiprocessors. (Pizlo, Frampton, Petrank & Steensgaard, 2007)

Abstract

We present STOPLESS: a concurrent real-time garbage collector suitable for modern multiprocessors running parallel multithreaded applications. Creating a garbage-collected environment that sup- ports real-time on modern platforms is notoriously hard, especially if real-time implies lock-freedom. Known real-time collectors ei- ther restrict the real-time guarantees to uniprocessors only, rely on special hardware, or just give up supporting atomic operations (which are crucial for lock-free software). STOPLESS is the first collector that provides real-time responsiveness while preserving lock-freedom, supporting atomic operations, controlling fragmen- tation by compaction, and supporting modern parallel platforms. STOPLESS is adequate for modern languages such as C# or Java. It was implemented on top of the Bartok compiler and runtime for C# and measurements demonstrate high responsiveness (a factor of a 100 better than previously published systems), virtually no pause times, good mutator utilization, and acceptable overheads.

Securing the .NET Programming Model (Kennedy, 2006)

Abstract

The security of the .NET programming model is studied from the standpoint of fully abstract compilation of C#. A number of failures of full abstraction are identified, and fixes described. The most serious problems have recently been fixed for version 2.0 of the .NET Common Language Runtime.

Combining Generics, Pre-compilation and Sharing Between Software-Based Processes (Syme & Kennedy, 2004)

Abstract

We describe problems that have arisen when combining the proposed design for generics for the Microsoft .NET Common Language Runtime (CLR) with two resource-related features supported by the Microsoft CLR implementation: application domains and pre-compilation. Application domains are “software based processes” and the interaction between application domains and generics stems from the fact that code and descriptors are generated on a pergeneric-instantiation basis, and thus instantiations consume resources which are preferably both shareable and recoverable. Pre-compilation runs at install-time to reduce startup overheads. This interacts with application domain unloading: compilation units may contain shareable generated instantiations. The paper describes these interactions and the different approaches that can be used to avoid or ameliorate the problems.

Formalization of Generics for the .NET Common Language Runtime (Yu, Kennedy & Syme, 2004)

Abstract

We present a formalization of the implementation of generics in the .NET Common Language Runtime (CLR), focusing on two novel aspects of the implementation: mixed specialization and sharing, and efficient support for run-time types. Some crucial constructs used in the implementation are dictionaries and run-time type representations. We formalize these aspects type-theoretically in a way that corresponds in spirit to the implementation techniques used in practice. Both the techniques and the formalization also help us understand the range of possible implementation techniques for other languages, e.g., ML, especially when additional source language constructs such as run-time types are supported. A useful by-product of this study is a type system for a subset of the polymorphic IL proposed for the .NET CLR.

Runtime Verification of .NET Contracts (Barnett & Schulte, 2003)

Abstract

We propose a method for implementing behavioral interface specifications on the .NET platform. Our interface specifications are expressed as executable model programs. Model programs can be run either as stand-alone simulations or used as contracts to check the conformance of an implementation class to its specification. We focus on the latter, which we call runtime verification.In our framework, model programs are expressed in the new specification language AsmL. We describe how AsmL can be used to describe contracts independently from any implementation language, how AsmL allows properties of component interaction to be specified using mandatory calls, and how AsmL is used to check the behavior of a component written in any of the .NET languages, such as VB, C#, or C++.

Design and Implementation of Generics for the .NET Common Language Runtime (Kennedy & Syme, 2001)

Abstract

The Microsoft .NET Common Language Runtime provides a shared type system, intermediate language and dynamic execution environment for the implementation and inter-operation of multiple source languages. In this paper we extend it with direct support for parametric polymorphism (also known as generics), describing the design through examples written in an extended version of the C# programming language, and explaining aspects of implementation by reference to a prototype extension to the runtime. Our design is very expressive, supporting parameterized types, polymorphic static, instance and virtual methods, “F-bounded” type parameters, instantiation at pointer and value types, polymorphic recursion, and exact run-time types. The implementation takes advantage of the dynamic nature of the runtime, performing justin-time type specialization, representation-based code sharing and novel techniques for efficient creation and use of run-time types. Early performance results are encouraging and suggest that programmers will not need to pay an overhead for using generics, achieving performance almost matching hand-specialized code.

Typing a Multi-Language Intermediate Code (Gordon & Syme, 2001)

Abstract

The Microsoft .NET Framework is a new computing architecture designed to support a variety of distributed applications and web-based services. .NET software components are typically distributed in an object-oriented intermediate language, Microsoft IL, executed by the Microsoft Common Language Runtime. To allow convenient multi-language working, IL supports a wide variety of high-level language constructs, including class-based objects, inheritance, garbage collection, and a security mechanism based on type safe execution. This paper precisely describes the type system for a substantial fragment of IL that includes several novel features: certain objects may be allocated either on the heap or on the stack; those on the stack may be boxed onto the heap, and those on the heap may be unboxed onto the stack; methods may receive arguments and return results via typed pointers, which can reference both the stack and the heap, including the interiors of objects on the heap. We present a formal semantics for the fragment. Our typing rules determine well-typed IL instruction sequences that can be assembled and executed. Of particular interest are rules to ensure no pointer into the stack outlives its target. Our main theorem asserts type safety, that well-typed programs in our IL fragment do not lead to untrapped execution errors. Our main theorem does not directly apply to the product. Still, the formal system of this paper is an abstraction of informal and executable specifications we wrote for the full product during its development. Our informal specification became the basis of the product team’s working specification of type-checking. The process of writing this specification, deploying the executable specification as a test oracle, and applying theorem proving techniques, helped us identify several security critical bugs during development.

Mono Runtime

Static and Dynamic Analysis of Android Malware and Goodware Written with Unity Framework (Shim, Lim, Cho, Han & Park, 2018)

Abstract

Unity is the most popular cross-platform development framework to develop games for multiple platforms such as Android, iOS, and Windows Mobile. While Unity developers can easily develop mobile apps for multiple platforms, adversaries can also easily build malicious apps based on the “write once, run anywhere” (WORA) feature. Even thoughmalicious apps were discovered among Android apps written with Unity framework (Unity apps), little research has been done on analysing the malicious apps. We propose static and dynamic reverse engineering techniques for malicious Unity apps. We first inspect the executable file format of a Unity app and present an effective static analysis technique of the Unity app. Then, we also propose a systematic technique to analyse dynamically the Unity app. Using the proposed techniques, the malware analyst can statically and dynamically analyse Java code, native code in C or C ++, and the Mono runtime layer where the C# code is running.

Reducing startup time of a deterministic virtualizing runtime environment (Däumler & Werner, 2013)

Abstract

Virtualized runtime environments like Java Virtual Machine (JVM) or Microsoft .NET’s Common Language Runtime (CLR) introduce additional challenges to real-time software development. Since applications for such environments are usually deployed in platform independent intermediate code, one issue is the timing of code transformation from intermediate code into native code. We have developed a solution for this problem, so that code transformation is suitable for real-time systems. It combines pre-compilation of intermediate code with the elimination of indirect references in native code. The gain of determinism comes with an increased application startup time. In this paper we present an optimization that utilizes an Ahead-of-Time compiler to reduce the startup time while keeping the real-time suitable timing behaviour. In an experiment we compare our approach with existing ones and demonstrate its benefits for certain application cases.

Detecting Clones Across Microsoft .NET Programming Languages (Al-Omari, Keivanloo, Roy & Rilling, 2012)

Abstract

The Microsoft .NET framework and its language family focus on multi-language development to support interoperability across several programming languages. The framework allows for the development of similar applications in different languages through the reuse of core libraries. As a result of such a multi-language development, the identification and trace ability of similar code fragments (clones) becomes a key challenge. In this paper, we present a clone detection approach for the .NET language family. The approach is based on the Common Intermediate Language, which is generated by the .NET compiler for the different languages within the .NET framework. In order to achieve an acceptable recall while maintaining the precision of our detection approach, we define a set of filtering processes to reduce noise in the raw data. We show that these filters are essential for Intermediate Language-based clone detection, without significantly affecting the precision of the detection approach. Finally, we study the quantitative and qualitative performance aspects of our clone detection approach. We evaluate the number of reported candidate clone-pairs, as well as the precision and recall (using manual validation) for several open source cross-language systems, to show the effectiveness of our proposed approach.

Language-independent sandboxing of just-in-time compilation and self-modifying code (Ansel & Marchenko, 2012)

Abstract

When dealing with dynamic, untrusted content, such as on the Web, software behavior must be sandboxed, typically through use of a language like JavaScript. However, even for such specially-designed languages, it is difficult to ensure the safety of highly-optimized, dynamic language runtimes which, for efficiency, rely on advanced techniques such as Just-In-Time (JIT) compilation, large libraries of native-code support routines, and intricate mechanisms for multi-threading and garbage collection. Each new runtime provides a new potential attack surface and this security risk raises a barrier to the adoption of new languages for creating untrusted content. Removing this limitation, this paper introduces general mechanisms for safely and efficiently sandboxing software, such as dynamic language runtimes, that make use of advanced, low-level techniques like runtime code modification. Our language-independent sandboxing builds on Software-based Fault Isolation (SFI), a traditionally static technique. We provide a more flexible form of SFI by adding new constraints and mechanisms that allow safety to be guaranteed despite runtime code modifications. We have added our extensions to both the x86-32 and x86-64 variants of a production-quality, SFI-based sandboxing platform; on those two architectures SFI mechanisms face different challenges. We have also ported two representative language platforms to our extended sandbox: the Mono common language runtime and the V8 JavaScript engine. In detailed evaluations, we find that sandboxing slowdown varies between different benchmarks, languages, and hardware platforms. Overheads are generally moderate and they are close to zero for some important benchmark/platform combinations.

VMKit: a Substrate for Managed Runtime Environments (Geoffray, Thomas, Lawall, Muller & Folliot, 2010)

Abstract

Managed Runtime Environments (MREs), such as the JVM and the CLI, form an attractive environment for program execution, by providing portability and safety, via the use of a bytecode language and automatic memory management, as well as good performance, via just-in-time (JIT) compilation. Nevertheless, developing a fully featured MRE, including e.g. a garbage collector and JIT compiler, is a herculean task. As a result, new languages cannot easily take advantage of the benefits of MREs, and it is difficult to experiment with extensions of existing MRE based languages. This paper describes and evaluates VMKit, a first attempt to build a common substrate that eases the development of high-level MREs. We have successfully used VMKit to build two MREs: a Java Virtual Machine and a Common Language Runtime. We provide an extensive study of the lessons learned in developing this infrastructure, and assess the ease of implementing new MREs or MRE extensions and the resulting performance. In particular, it took one of the authors only one month to develop a Common Language Runtime using VMKit. VMKit furthermore has performance comparableto the well established open source MREs Cacao, Apache Harmony and Mono, and is 1.2 to 3 times slower than JikesRVM on most of the Dacapo benchmarks.

MMC: the Mono Model Checker (Ruys & Aan de Brugh, 2007)

Abstract

The Mono Model Checker (mmc) is a software model checker for cil bytecode programs. mmc has been developed on the Mono platform. mmc is able to detect deadlocks and assertion violations in cil programs. The design of mmc is inspired by the Java PathFinder (jpf), a model checker for Java programs. The performance of mmc is comparable to jpf. This paper introduces mmc and presents its main architectural characteristics.

Numeric performance in C, C# and Java (Sestoft, 2007)

Abstract

We compare the numeric performance of C, C# and Java on three small cases.

Mono versus .Net: A Comparative Study of Performance for Distributed Processing. (Blajian, Eggen, Eggen & Pitts, 2006)

Abstract

Microsoft has released .NET, a platform dependent standard for the C#,programming language. Sponsored by Ximian/Novell, Mono, the open source development platform based on the .NET framework, has been developed to be a platform independent version of the C#,programming environment. While .NET is platform dependent, Mono allows developers to build Linux and crossplatform applications. Mono’s .NET implementation is based on the ECMA standards for C#. This paper examines both of these programming environments with the goal of evaluating the performance characteristics of each. Testing is done with various algorithms. We also assess the trade-offs associated with using a cross-platform versus a platform.

Automated detection of performance regressions: the mono experience (Kalibera, Bulej & Tuma, 2005)

Abstract

Engineering a large software project involves tracking the impact of development and maintenance changes on the software performance. An approach for tracking the impact is regression benchmarking, which involves automated benchmarking and evaluation of performance at regular intervals. Regression benchmarking must tackle the nondeterminism inherent to contemporary computer systems and execution environments and the impact of the nondeterminism on the results. On the example of a fully automated regression benchmarking environment for the mono open-source project, we show how the problems associated with nondeterminism can be tackled using statistical methods.

Shared Source Common Language Infrastructure (SSCLI) - a.k.a ‘Rotor

Efficient virtual machine support of runtime structural reflection (Ortina, Redondoa & Perez-Schofield, 2009)

Abstract

Increasing trends towards adaptive, distributed, generative and pervasive software have made object-oriented dynamically typed languages become increasingly popular. These languages offer dynamic software evolution by means of reflection, facilitating the development of dynamic systems. Unfortunately, this dynamism commonly imposes a runtime performance penalty. In this paper, we describe how to extend a production JIT-compiler virtual machine to support runtime object-oriented structural reflection offered by many dynamic languages. Our approach improves runtime performance of dynamic languages running on statically typed virtual machines. At the same time, existing statically typed languages are still supported by the virtual machine.

We have extended the .Net platform with runtime structural reflection adding prototype-based object-oriented semantics to the statically typed class-based model of .Net, supporting both kinds of programming languages. The assessment of runtime performance and memory consumption has revealed that a direct support of structural reflection in a production JIT-based virtual machine designed for statically typed languages provides a significant performance improvement for dynamically typed languages.

Extending the SSCLI to Support Dynamic Inheritance (Redondo, Ortin & Perez-Schofield, 2008)

Abstract

This paper presents a step forward on a research trend focused on increasing runtime adaptability of commercial JIT-based virtual machines, describing how to include dynamic inheritance into this kind of platforms. A considerable amount of research aimed at improving runtime performance of virtual machines has converted them into the ideal support for developing different types of software products. Current virtual machines do not only provide benefits such as application interoperability, distribution and code portability, but they also offer a competitive runtime performance.

Since JIT compilation has played a very important role in improving runtime performance of virtual machines, we first extended a production JIT-based virtual machine to support efficient language-neutral structural reflective primitives of dynamically typed programming languages. This article presents the next step in our research work: supporting language-neutral dynamic inheritance for both statically and dynamically typed programming languages. Executing both kinds of programming languages over the same platform provides a direct interoperation between them.

Sampling profiler for Rotor as part of optimizing compilation system (Chilingarova & Safonov, 2006)

Abstract

This paper describes a low-overhead self-tuning sampling-based runtime profiler integrated into SSCLI virtual machine. Our profiler estimates how “hot” a method is and builds a call context graph based on managed stack samples analysis. The frequency of sampling is tuned dynamically at runtime, based on the information of how often the same activation record appears on top of the stack. The call graph is presented as a novel Call Context Map (CC-Map) structure that combines compact representation and accurate information about the context. It enables fast extraction of data helpful in making compilation decisions, as well as fast placing data into the map. Sampling mechanism is integrated with intrinsic Rotor mechanisms of thread preemption and stack walk. A separate system thread is responsible for organizing data in the CC-Map. This thread gathers and stores samples quickly queued by managed threads, thus decreasing the time they must hold up their user-scheduled job

To JIT or not to JIT: The effect of code-pitching on the performance of .NET framework (Anthony, Leung & Srisa-an, 2005)

Abstract

The.NET Compact Framework is designed to be a highperformance virtual machine for mobile and embedded devices that operate on Windows CE (version 4.1 and later). It achieves fast execution time by compiling methods dynamically instead of using interpretation. Once compiled, these methods are stored in a portion of the heap called code-cache and can be reused quickly to satisfy future method calls. While code-cache provides a high-level of reusability, it can also use a large amount of memory. As a result, the Compact Framework provides a “code pitching ” mechanism that can be used to discard the previously compiled methods as needed. In this paper, we study the effect of code pitching on the overall performance and memory utilization of.NET applications. We conduct our experiments using Microsoft’s Shared-Source Common Language Infrastructure (SSCLI). We profile the access behavior of the compiled methods. We also experiment with various code-cache configurations to perform pitching. We find that programs can operate efficiently with a small code-cache without incurring substantial recompilation and execution overheads.

Adding structural reflection to the SSCLI (Ortin, Redondo, Vinuesa & Lovelle, 2005)

Abstract

Although dynamic languages are becoming widely used due to the flexibility needs of specific software prod- ucts, their major drawback is their runtime performance. Compiling the source program to an abstract machine’s intermediate language is the current technique used to obtain the best performance results. This intermediate code is then executed by a virtual machine developed as an interpreter. Although JIT adaptive optimizing com- pilation is currently used to speed up Java and .net intermediate code execution, this practice has not been em- ployed successfully in the implementation of dynamically adaptive platforms yet. We present an approach to improve the runtime performance of a specific set of structural reflective primitives, extensively used in adaptive software development. Looking for a better performance, as well as interaction with other languages, we have employed the Microsoft Shared Source CLI platform, making use of its JIT compiler. The SSCLI computational model has been enhanced with semantics of the prototype-based object-oriented com- putational model. This model is much more suitable for reflective environments. The initial assessment of per- formance results reveals that augmenting the semantics of the SSCLI model, together with JIT generation of native code, produces better runtime performance than the existing implementations.

Static Analysis for Identifying and Allocating Clusters of Immortal Objects (Ravindar & Srikant, 2005)

Abstract

Long living objects lengthen the trace time which is a critical phase of the garbage collection process. However, it is possible to recognize object clusters i.e. groups of long living objects having approximately the same lifetime and treat them separately to reduce the load on the garbage collector and hence improve overall performance. Segregating objects this way leaves the heap for objects with shorter lifetimes and now a typical collection can nd more garbage than before. In this paper, we describe a compile time analysis strategy to identify object clusters in programs. The result of the compile time analysis is the set of allocation sites that contribute towards allocating objects belonging to such clusters. All such allocation sites are replaced by a new allocation method that allocates objects into the cluster area rather than the heap. This study was carried out for a concurrent collector which we developed for Rotor, Microsoft’s Shared Source Implementation of .NET. We analyze the performance of the program with combina- tions of the cluster and stack allocation optimizations. Our results show that the clustering optimization reduces the number of collections by 66.5% on average, even eliminating the need for collection in some programs. As a result, the total pause time reduces by 62.8% on average. Using both stack allocation and the cluster optimizations brings down the number of collections by 91.5% thereby improving the total pause time by 79.33%.

An Optimizing Just-InTime Compiler for Rotor (Trindade & Silva, 2005)

Abstract

The Shared Source CLI (SSCLI), also known as Rotor, is an implementation of the CLI released by Microsoft in source code. Rotor includes a single pass just-in-time compiler that generates non-optimized code for Intel IA-32 and IBM PowerPC processors. We extend Rotor with an optimizing justin-time compiler for IA-32. This compiler has three passes: control flow graph generation, data dependence graph generation and final code generation. Dominance relations in the control flow graph are used to detect natural loops. A number of optimizations are performed during the generation of the data dependence graph. During native code generation, the rich address modes of IA32 are used for instruction folding, reducing code size and usage of register names. Despite the overhead of three passes and optimizations, this compiler is only 1.4 to 1.9 times slower than the original SSCLI compiler and generates code that runs 6.4 to 10 times faster.

Software Interactions into the SSCLI platform (Charfi & Emsellem, 2004)

Abstract

By using an Interaction Specification Language (ISL), interactions between components can be expressed in a language independent way. At class level, interaction pattern specified in ISLrepresent model s of future interactions when applied on some component instances. The Interaction Server is in charge of managing the life cycle of interactions (interaction pattern registration and instantiation, destruction of interactions, merging). It acts as a central repository that keeps the global coherency of the adaptations realized on the component instances.The Interaction service allows creati ng interactions between heterogeneous components. Noah is an implementation of this Interaction Service. It can be thought as a dynamic aspect repository with a weaver that uses an aspect composition mechanism that insures commutable and associative adaptations. In this paper, we propose the implementation of the Interaction Service in the SSCLI. In contrast to other implementations such as Java where interaction management represents an additional layer, SSCLI enables us to integrate Interaction Management as in intrinsic part of the CLI runtime.

Experience Integrating a New Compiler and a New Garbage Collector Into Rotor (Anderson, Eng, Glew, Lewis, Menon & Stichnoth, 2004)

Abstract

Microsoft’s Rotor is a shared-source CLI implementation intended for use as a research platform. It is particularly attractive for research because of its complete implementation and extensive libraries, and because its modular design allows dierent implementations of certain components such as just-in-time compilers (JITs). Our group has independently developed our own high-performance JIT and garbage collector (GC) and wanted to take advantage of Rotor to experiment with these components in a CLI environment. In this paper, we describe our experience integrating these components into Rotor and evaluate the flexibility of Rotor’s design toward this goal. We found it easier to integrate our JIT than our GC because Rotor has a well-defined interface for the former but not the latter. However, our JIT integration still required significant changes to both Rotor and our JIT. For example, we modified Rotor to support multiple JITs. We also added support for a second JIT manager in Rotor, and implemented a new code manager compatible with our JIT. We had to change our JIT compiler to support Rotor’s calling conventions, helper functions, and exception model. Our GC integration was complicated by the many places in Rotor where components make assumptions about how its garbage collector is implemented, as well as Rotor’s lack of a well-defined GC interface. We also had to reconcile the dierent assumptions made by Rotor and our garbage collector about the layout of objects, virtual-method tables, and thread structures.


Fri, 25 Oct 2019, 12:00 am


"Stubs" in the .NET Runtime

As the saying goes:

“All problems in computer science can be solved by another level of indirection”

- David Wheeler

and it certainly seems like the ‘.NET Runtime’ Engineers took this advice to heart!

‘Stubs’, as they’re known in the runtime (sometimes ‘Thunks’), provide a level of indirection throughout the source code, there’s almost 500 mentions of them!

This post will explore what they are, how they work and why they’re needed.

Table of Contents

What are stubs?

In the context of the .NET Runtime, ‘stubs’ look something like this:

   Call-site                                         Callee
+--------------+           +---------+           +-------------+
|              |           |         |           |             |
|              +---------->+  Stub   + - - - - ->+             |
|              |           |         |           |             |
+--------------+           +---------+           +-------------+

So they sit between a method ‘call-site’ (i.e. code such as var result = Foo(..);) and the ‘callee’ (where the method itself is implemented, the native/assembly code) and I like to think of them as doing tidy-up or fix-up work. Note that moving from the ‘stub’ to the ‘callee’ isn’t another full method call (hence the dotted line), it’s often just a single jmp or call assembly instruction, so the 2nd transition doesn’t involve all the same work that was initially done at the call-site (pushing/popping arguments into registers, increasing the stack space, etc).

The stubs themselves can be as simple as just a few assembly instructions or something more complicated, we’ll look at individual examples later on in this post.

Now, to be clear, not all method calls require a stub, if you’re doing a regular call to an static or instance method that just goes directly from the ‘call-site’ to the ‘callee’. But once you involve virtual methods, delegates or generics things get a bit more complicated.

Why are stubs needed?

There are several reasons that stubs need to be created by the runtime:

  • Required Functionality
    • For instance Delegates and Arrays must be provided but the runtime, their method bodies are not generated by the C#/F#/VB.NET compiler and neither do they exist in the Base-Class Libraries. This requirement is outlined in the ECMA 355 Spec, for instance ‘Partition I’ in section ‘8.9.1 Array types’ says:

      Exact array types are created automatically by the VES when they are required. Hence, the operations on an array type are defined by the CTS. These generally are: allocating the array based on size and lower-bound information, indexing the array to read and write a value, computing the address of an element of the array (a managed pointer), and querying for the rank, bounds, and the total number of values stored in the array.

      Likewise for delegates, which are covered in ‘I.8.9.3 Delegates’:

      While, for the most part, delegates appear to be simply another kind of user-defined class, they are tightly controlled. The implementations of the methods are provided by the VES, not user code. The only additional members that can be defined on delegate types are static or instance methods.

  • Performance
  • Consistent method calls
    • A final factor is that having ‘stubs’ makes the work of the JIT compiler easier. As we will see in the rest of the post, stubs deal with a variety of different types of method calls. This means the the JIT can generate more straightforward code for any given ‘call site’, because it (mostly) doesn’t care whats happening in the ‘callee’. If stubs didn’t exist, for a given method call the JIT would have to generate different code depending on whether generics where involved or not, if it was a virtual or non-virtual call, if it was going via a delegate, etc. Stubs abstact a lot of this behaviour away from the JIT, allowing it to deal with a more simple ‘Application Binary Interface’ (ABI).

CLR ‘Application Binary Interface’ (ABI)

Therefore, another way to think about ‘stubs’ is that they are part of what makes the CLR-specific ‘Application Binary Interface’ (ABI) work.

All code needs to work with the ABI or ‘calling convention’ of the CPU/OS that it’s running on, for instance by following the x86 calling convention, x64 calling convention or System V ABI. This applies across runtimes, for more on this see:

As an aside, if you want more information about ‘calling conventions’ here’s some links that I found useful:

However, on-top of what the CLR has to support due to the CPU/OS conventions, it also has it’s own extended ABI for .NET-specific use cases, including:

  • “this” pointer:

    The managed “this” pointer is treated like a new kind of argument not covered by the native ABI, so we chose to always pass it as the first argument in (AMD64) RCX or (ARM, ARM64) R0. AMD64-only: Up to .NET Framework 4.5, the managed “this” pointer was treated just like the native “this” pointer (meaning it was the second argument when the call used a return buffer and was passed in RDX instead of RCX). Starting with .NET Framework 4.5, it is always the first argument.

  • Generics or more specifically to handle ‘Shared generics’:

    In cases where the code address does not uniquely identify a generic instantiation of a method, then a ‘generic instantiation parameter’ is required. Often the “this” pointer can serve dual-purpose as the instantiation parameter. When the “this” pointer is not the generic parameter, the generic parameter is passed as an additional argument..

  • Hidden Parameters, covering ‘Stub dispatch’, ‘Fast Pinvoke’, ‘Calli Pinvoke’ and ‘Normal PInvoke’. For instance, here’s why ‘PInvoke’ has a hidden parameter:

    Normal PInvoke - The VM shares IL stubs based on signatures, but wants the right method to show up in call stack and exceptions, so the MethodDesc for the exact PInvoke is passed in the (x86) EAX / (AMD64) R10 / (ARM, ARM64) R12 (in the JIT: REG_SECRET_STUB_PARAM). Then in the IL stub, when the JIT gets CORJIT_FLG_PUBLISH_SECRET_PARAM, it must move the register into a compiler temp.

Not all of these scenarios need a stub, for instance the ‘this’ pointer is handled directly by the JIT, but many do as we’ll see in the rest of the post.

Stub Management

So we’ve seen why stubs are needed and what type of functionality they can provide. But before we look at all the specific examples that exist in the CoreCLR source, I just wanted to take some time to understand the common or shared concerns that apply to all stubs.

Stubs in the CLR are snippets of assembly code, but they have to be stored in memory and have their life-time managed. Also, they have to play nice with the debugger, from What Every CLR Developer Must Know Before Writing Code:

2.8 Is your code compatible with managed debugging?

  • ..
  • If you add a new stub (or way to call managed code), make sure that you can source-level step-in (F11) it under the debugger. The debugger is not psychic. A source-level step-in needs to be able to go from the source-line before a call to the source-line after the call, or managed code developers will be very confused. If you make that call transition be a giant 500 line stub, you must cooperate with the debugger for it to know how to step-through it. (This is what StubManagers are all about. See src\vm\stubmgr.h). Try doing a step-in through your new codepath under the debugger.

So every type of stub has a StubManager which deals with the allocation, storage and lookup of the stubs. The lookup is significant, as it provides the mapping from an arbitrary memory address to the type of stub (if any) that created the code. As an example, here’s what the CheckIsStub_Internal(..) method here and DoTraceStub(..) method here look like for the DelegateInvokeStubManager:

BOOL DelegateInvokeStubManager::CheckIsStub_Internal(PCODE stubStartAddress)
{
    LIMITED_METHOD_DAC_CONTRACT;

    bool fIsStub = false;

#ifndef DACCESS_COMPILE
#ifndef _TARGET_X86_
    fIsStub = fIsStub || (stubStartAddress == GetEEFuncEntryPoint(SinglecastDelegateInvokeStub));
#endif
#endif // !DACCESS_COMPILE

    fIsStub = fIsStub || GetRangeList()->IsInRange(stubStartAddress);

    return fIsStub;
}

BOOL DelegateInvokeStubManager::DoTraceStub(PCODE stubStartAddress, TraceDestination *trace)
{
    LIMITED_METHOD_CONTRACT;

    LOG((LF_CORDB, LL_EVERYTHING, "DelegateInvokeStubManager::DoTraceStub called\n"));

    _ASSERTE(CheckIsStub_Internal(stubStartAddress));

    // If it's a MC delegate, then we want to set a BP & do a context-ful
    // manager push, so that we can figure out if this call will be to a
    // single multicast delegate or a multi multicast delegate
    trace->InitForManagerPush(stubStartAddress, this);

    LOG_TRACE_DESTINATION(trace, stubStartAddress, "DelegateInvokeStubManager::DoTraceStub");

    return TRUE;
}

The code to initialise the various stub managers is here in SystemDomain::Attach() and by working through the list we can get a sense of what each category of stub does (plus the informative comments in the code help!)

  • PrecodeStubManager implemented here
    • Stub manager functions & globals
  • DelegateInvokeStubManager implemented here
    • Since we don’t generate delegate invoke stubs at runtime on IA64, we can’t use the StubLinkStubManager for these stubs. Instead, we create an additional DelegateInvokeStubManager instead.
  • JumpStubStubManager implemented here
    • Stub manager for jump stubs created by ExecutionManager::jumpStub() These are currently used only on the 64-bit targets IA64 and AMD64
  • RangeSectionStubManager implemented here
    • Stub manager for code sections. It forwards the query to the more appropriate stub manager, or handles the query itself.
  • ILStubManager implemented here
    • This is the stub manager for IL stubs
  • InteropDispatchStubManager implemented here
    • This is used to recognize GenericComPlusCallStub, VarargPInvokeStub, and GenericPInvokeCalliHelper.
  • StubLinkStubManager implemented here
  • ThunkHeapStubManager implemented here
    • Note, the only reason we have this stub manager is so that we can recgonize UMEntryThunks for IsTransitionStub. ..
  • TailCallStubManager implemented here
    • This is the stub manager to help the managed debugger step into a tail call. It helps the debugger trace through JIT_TailCall().’ (from stubmgr.h)
  • ThePreStubManager implemented here (in prestub.cpp)
    • The following code manages the PreStub. All method stubs initially use the prestub.
  • VirtualCallStubManager implemented here (in virtualcallstub.cpp)

Finally, we can also see the ‘StubManagers’ in action if we use the eeheap SOS command to inspect the ‘heap dump’ of a .NET Process, as it helps report the size of the different ‘stub heaps’:

> !eeheap -loader

Loader Heap:
--------------------------------------
System Domain: 704fd058
LowFrequencyHeap: Size: 0x0(0)bytes.
HighFrequencyHeap: 002e2000(8000:1000) Size: 0x1000(4096)bytes.
StubHeap: 002ea000(2000:1000) Size: 0x1000(4096)bytes.
Virtual Call Stub Heap:
- IndcellHeap: Size: 0x0(0)bytes.
- LookupHeap: Size: 0x0(0)bytes.
- ResolveHeap: Size: 0x0(0)bytes.
- DispatchHeap: Size: 0x0(0)bytes.
- CacheEntryHeap: Size: 0x0(0)bytes.
Total size: 0x2000(8192)bytes
--------------------------------------

(output taken from .NET Generics and Code Bloat (or its lack thereof))

You can see that in this case the entire ‘stub heap’ is taking up 4096 bytes and in addition there are more in-depth statistics covering the heaps used by virtual call dispatch.

Types of stubs

The different stubs used by the runtime fall into 3 main categories:

Most stubs are wired up in MethodDesc::DoPrestub(..), in this section of code or this section for COM Interop. The stubs generated include the following (definitions taken from BOTR - ‘Kinds of MethodDescs’, also see enum MethodClassification here):

  • Instantiating in (FEATURE_SHARE_GENERIC_CODE, on by default) in MakeInstantiatingStubWorker(..) here
    • Used for less common IL methods that have generic instantiation or that do not have preallocated slot in method table.
  • P/Invoke (a.k.a NDirect) in GetStubForInteropMethod(..) here
    • P/Invoke methods. These are methods marked with DllImport attribute.
  • FCall methods in ECall::GetFCallImpl(..) here
    • Internal methods implemented in unmanaged code. These are methods marked with MethodImplAttribute(MethodImplOptions.InternalCall) attribute, delegate constructors and tlbimp constructors.
  • Array methods in GenerateArrayOpStub(..) here
    • Array methods whose implementation is provided by the runtime (Get, Set, Address)
  • EEImpl in PCODE COMDelegate::GetInvokeMethodStub(EEImplMethodDesc* pMD) here
    • Delegate methods, implementation provided by the runtime
  • COM Interop (FEATURE_COMINTEROP, on by default) in GetStubForInteropMethod(..) here
    • COM interface methods. Since the non-generic interfaces can be used for COM interop by default, this kind is usually used for all interface methods.
  • Unboxing in Stub * MakeUnboxingStubWorker(MethodDesc *pMD) here

Right, now lets look at the individual stub in more detail.

Precode

First up, we’ll take a look at ‘precode’ stubs, because they are used by all other types of stubs, as explained in the BotR page on Method Descriptors:

The precode is a small fragment of code used to implement temporary entry points and an efficient wrapper for stubs. Precode is a niche code-generator for these two cases, generating the most efficient code possible. In an ideal world, all native code dynamically generated by the runtime would be produced by the JIT. That’s not feasible in this case, given the specific requirements of these two scenarios. The basic precode on x86 may look like this:

mov eax,pMethodDesc // Load MethodDesc into scratch register
jmp target          // Jump to a target

Efficient Stub wrappers: The implementation of certain methods (e.g. P/Invoke, delegate invocation, multi dimensional array setters and getters) is provided by the runtime, typically as hand-written assembly stubs. Precode provides a space-efficient wrapper over stubs, to multiplex them for multiple callers.

The worker code of the stub is wrapped by a precode fragment that can be mapped to the MethodDesc and that jumps to the worker code of the stub. The worker code of the stub can be shared between multiple methods this way. It is an important optimization used to implement P/Invoke marshalling stubs.

By providing a ‘pointer’ to the MethodDesc class, the precode allows any subsequent stub to have access to a lot of information about a method call and it’s containing Type via the MethodTable (‘hot’) and EEClass (‘cold’) data structures. The MethodDesc data-structure is one of the most fundamental types in the runtime, hence why it has it’s own BotR page.

Each ‘precode’ is created in MethodDesc::GetOrCreatePrecode() here and there are several different types as we can see in this enum from /vm/precode.h:

enum PrecodeType {
    PRECODE_INVALID         = InvalidPrecode::Type,
    PRECODE_STUB            = StubPrecode::Type,
#ifdef HAS_NDIRECT_IMPORT_PRECODE
    PRECODE_NDIRECT_IMPORT  = NDirectImportPrecode::Type,
#endif // HAS_NDIRECT_IMPORT_PRECODE
#ifdef HAS_FIXUP_PRECODE
    PRECODE_FIXUP           = FixupPrecode::Type,
#endif // HAS_FIXUP_PRECODE
#ifdef HAS_THISPTR_RETBUF_PRECODE
    PRECODE_THISPTR_RETBUF  = ThisPtrRetBufPrecode::Type,
#endif // HAS_THISPTR_RETBUF_PRECODE
};

As always, the BotR page describes the different types in great detail, but in summary:

  • StubPrecode - .. is the basic precode type. It loads MethodDesc into a scratch register and then jumps. It must be implemented for precodes to work. It is used as fallback when no other specialized precode type is available.
  • FixupPrecode - .. is used when the final target does not require MethodDesc in scratch register. The FixupPrecode saves a few cycles by avoiding loading MethodDesc into the scratch register. The most common usage of FixupPrecode is for method fixups in NGen images.
  • ThisPtrRetBufPrecode - .. is used to switch a return buffer and the this pointer for open instance delegates returning valuetypes. It is used to convert the calling convention of MyValueType Bar(Foo x) to the calling convention of MyValueType Foo::Bar().
  • NDirectImportPrecode (a.k.a P/Invoke) - .. is used for lazy binding of unmanaged P/Invoke targets. This precode is for convenience and to reduce amount of platform specific plumbing.

Finally, to give you an idea of some real-world scenarios for ‘precode’ stubs, take a look at this comment from the DoesSlotCallPrestub(..) method (AMD64):

// AMD64 has the following possible sequences for prestub logic:
// 1. slot -> temporary entrypoint -> prestub
// 2. slot -> precode -> prestub
// 3. slot -> precode -> jumprel64 (jump stub) -> prestub
// 4. slot -> precode -> jumprel64 (NGEN case) -> prestub

‘Just-in-time’ (JIT) and ‘Tiered’ Compilation

However, another piece of functionality that ‘precodes’ provide is related to ‘just-in-time’ (JIT) compilation, again from the BotR page:

Temporary entry points: Methods must provide entry points before they are jitted so that jitted code has an address to call them. These temporary entry points are provided by precode. They are a specific form of stub wrappers.

This technique is a lazy approach to jitting, which provides a performance optimization in both space and time. Otherwise, the transitive closure of a method would need to be jitted before it was executed. This would be a waste, since only the dependencies of taken code branches (e.g. if statement) require jitting.

Each temporary entry point is much smaller than a typical method body. They need to be small since there are a lot of them, even at the cost of performance. The temporary entry points are executed just once before the actual code for the method is generated.

So these ‘temporary entry points’ provide something concrete that can be referenced before a method has been JITted. They then trigger the JIT-compilation which does the job of generating the native code for a method. The entire process looks like this (dotted lines represent a pointer indirection, solid lines are a ‘control transfer’ e.g. a jmp/call assembly instruction):

Before JITing

Here we see the ‘temporary entry point’ pointing to the ‘fixup precode’, which ultimately calls into the PrestubWorker() function here.

After JIting

Once the method has been JITted, we can see that the PrestubWorker is now out of the picture and instead we have the native code for the function. In addition, there is now a ‘stable entry point’ that can be used by any other code that wants to execute the function. Also, we can see that the ‘fixup precode’ has been ‘backpatched’ to also point at the ‘native code’. For an idea of how this ‘back-patching’ works, see the StubPrecode ::SetTargetInterlocked(..) method here (ARM64).

After JIting - Tiered Compilation

However, there is also another ‘after’ scenario, now that .NET Core has ‘Tiered Compilation’. Here we see that the ‘stable entry point’ still goes via the ‘fixup precode’, it doesn’t directly call into the ‘native code’. This is because ‘tiered compilation’ counts how many times a method is called and once it decides the method is ‘hot’, it re-compiles a more optimised version that will give better performance. This ‘call counting’ takes place in this code in MethodDesc::DoPrestub(..) which calls into CodeVersionManager::PublishNonJumpStampVersionableCodeIfNecessary(..) here and then if shouldCountCalls is true, it ends up calling CallCounter::OnMethodCodeVersionCalledSubsequently(..) here.

What’s been interesting to watch during the development of ‘tiered compilation’ is that (not surprisingly) there has been a significant amount of work to ensure that the extra level of indirection doesn’t make the entire process slower, for instance see Patch vtable slots and similar when tiering is enabled #21292.

Like all the other stubs, ‘precodes’ have different versions for different CPU architectures. As a reference, the list below contains links to all of them:

  1. Precodes (a.k.a ‘Precode Fixup Thunk’):
  2. ThePreStub:
  3. PreStubWorker(..) in /vm/prestub.cpp
  4. MethodDesc::DoPrestub(..) here
  5. MethodDesc::DoBackpatch(..) here

Finally, for even more information on the JITing process, see:

Stubs-as-IL

‘Stubs as IL’ actually describes several types of individual stubs, but what they all have in common is they’re generated from ‘Intermediate Language’ (IL) which is then compiled by the JIT, in exactly the same way it handles the code we write (after it’s first been compiled from C#/F#/VB.NET into IL by another compiler).

This makes sense, it’s far easier to write the IL once and then have the JIT worry about compiling it for different CPU architectures, rather than having to write raw assembly each time (for x86/x64/arm/etc). However all stubs were hand-written assembly in .NET Framework 1.0:

What you have described is how it actually works. The only difference is that the shuffle thunk is hand-emitted in assembly and not generated by the JIT for historic reasons. All stubs (including all interop stubs) were hand-emitted like this in .NET Framework 1.0. Starting with .NET Framework 2.0, we have been converting the stubs to be generated by the JIT (the runtime generates IL for the stub, and then the JIT compiles the IL as regular method). The shuffle thunk is one of the few remaining ones not converted yet. Also, we have the IL path on some platforms but not others - FEATURE_STUBS_AS_IL is related to it.

In the CoreCLR source code, ‘stubs as IL’ are controlled by the feature flag FEATURE_STUBS_AS_IL, with the following additional flags for each specific type:

  • StubsAsIL
  • ArrayStubAsIL
  • MulticastStubAsIL

On Windows only some features are implemented with IL stubs, see this code, e.g. ‘ArrayStubAsIL’ is disabled on ‘x86’, but enabled elsewhere.


   true
   true
   true
    ...

On Unix they are all done in IL, regardless of CPU Arch, as this code shows:


   ...
   true
   true
   true

Finally, here’s the complete list of stubs that can be implemented in IL from /vm/ilstubresolver.h:

 enum ILStubType
 {
     Unassigned = 0,
     CLRToNativeInteropStub,
     CLRToCOMInteropStub,
     CLRToWinRTInteropStub,
     NativeToCLRInteropStub,
     COMToCLRInteropStub,
     WinRTToCLRInteropStub,
#ifdef FEATURE_ARRAYSTUB_AS_IL 
     ArrayOpStub,
#endif
#ifdef FEATURE_MULTICASTSTUB_AS_IL
     MulticastDelegateStub,
#endif
#ifdef FEATURE_STUBS_AS_IL
     SecureDelegateStub,
     UnboxingILStub,
     InstantiatingStub,
#endif
 };

But the usage of IL stubs has grown over time and it seems that they are the preferred mechanism where possible as they’re easier to write and debug. See [x86/Linux] Enable FEATURE_ARRAYSTUB_AS_IL, Switch multicast delegate stub on Windows x64 to use stubs-as-il and Fix GenerateShuffleArray to support cyclic shuffles #26169 (comment) for more information.

P/Invoke, Reverse P/Invoke and ‘calli’

All these stubs have one thing in common, they allow a transition between ‘managed’ and ‘un-managed’ (or native) code. To make this safe and to preserve the guarantees that the .NET runtime provides, stubs are used every time the transition is made.

This entire process is outlined in great detail in the BotR page CLR ABI - PInvokes, from the ‘Per-call-site PInvoke work’ section:

  1. For direct calls, the JITed code sets InlinedCallFrame->m_pDatum to the MethodDesc of the call target.
    • For JIT64, indirect calls within IL stubs sets it to the secret parameter (this seems redundant, but it might have changed since the per-frame initialization?).
    • For JIT32 (ARM) indirect calls, it sets this member to the size of the pushed arguments, according to the comments. The implementation however always passed 0.
  2. For JIT64/AMD64 only: Next for non-IL stubs, the InlinedCallFrame is ‘pushed’ by setting Thread->m_pFrame to point to the InlinedCallFrame (recall that the per-frame initialization already set InlinedCallFrame->m_pNext to point to the previous top). For IL stubs this step is accomplished in the per-frame initialization.
  3. The Frame is made active by setting InlinedCallFrame->m_pCallerReturnAddress.
  4. The code then toggles the GC mode by setting Thread->m_fPreemptiveGCDisabled = 0.
  5. Starting now, no GC pointers may be live in registers. RyuJit LSRA meets this requirement by adding special refPositon RefTypeKillGCRefs before unmanaged calls and special helpers.
  6. Then comes the actual call/PInvoke.
  7. The GC mode is set back by setting Thread->m_fPreemptiveGCDisabled = 1.
  8. Then we check to see if g_TrapReturningThreads is set (non-zero). If it is, we call CORINFO_HELP_STOP_FOR_GC.
    • For ARM, this helper call preserves the return register(s): R0, R1, S0, and D0.
    • For AMD64, the generated code must manually preserve the return value of the PInvoke by moving it to a non-volatile register or a stack location.
  9. Starting now, GC pointers may once again be live in registers.
  10. Clear the InlinedCallFrame->m_pCallerReturnAddress back to 0.
  11. For JIT64/AMD64 only: For non-IL stubs ‘pop’ the Frame chain by resetting Thread->m_pFrame back to InlinedCallFrame.m_pNext.

Saving/restoring all the non-volatile registers helps by preventing any registers that are unused in the current frame from accidentally having a live GC pointer value from a parent frame. The argument and return registers are ‘safe’ because they cannot be GC refs. Any refs should have been pinned elsewhere and instead passed as native pointers.

For IL stubs, the Frame chain isn’t popped at the call site, so instead it must be popped right before the epilog and right before any jmp calls. It looks like we do not support tail calls from PInvoke IL stubs?

As you can see, quite a bit of the work is to keep the Garbage Collector (GC) happy. This makes sense because once execution moves into un-managed/native code the .NET runtime has no control over what’s happening, so it needs to ensure that the GC doesn’t clean up or move around objects that are being used in the native code. It achives this by constraining what the GC can do (on the current thread) from the time execution moves into un-managed code and keeps that in place until it returns back to the mamanged side.

On top of that, there needs to be support for allowing ‘stack walking’ or ‘unwinding, to allowing debugging and produce meaningful stack traces. This is done by setting up frames that are put in place when control transitions from managed -> un-managed, before being removed (‘popped’) when transitioning back. Here’s a list of the different scenarios that are covered, from /vm/frames.h:

This is the list of Interop stubs & transition helpers with information
regarding what (if any) Frame they used and where they were set up:

P/Invoke:
 JIT inlined: The code to call the method is inlined into the caller by the JIT.
    InlinedCallFrame is erected by the JITted code.
 Requires marshaling: The stub does not erect any frames explicitly but contains
    an unmanaged CALLI which turns it into the JIT inlined case.

Delegate over a native function pointer:
 The same as P/Invoke but the raw JIT inlined case is not present (the call always
 goes through an IL stub).

Calli:
 The same as P/Invoke.
 PInvokeCalliFrame is erected in stub generated by GenerateGetStubForPInvokeCalli
 before calling to GetILStubForCalli which generates the IL stub. This happens only
 the first time a call via the corresponding VASigCookie is made.

ClrToCom:
 Late-bound or eventing: The stub is generated by GenerateGenericComplusWorker
    (x86) or exists statically as GenericComPlusCallStub[RetBuffArg] (64-bit),
    and it erects a ComPlusMethodFrame frame.
 Early-bound: The stub does not erect any frames explicitly but contains an
    unmanaged CALLI which turns it into the JIT inlined case.

ComToClr:
 Normal stub:
 Interpreted: The stub is generated by ComCall::CreateGenericComCallStub
    (in ComToClrCall.cpp) and it erects a ComMethodFrame frame.
 Prestub:
  The prestub is ComCallPreStub (in ComCallableWrapper.cpp) and it erects a ComPrestubMethodFrame frame.

Reverse P/Invoke (used for C++ exports & fixups as well as delegates
obtained from function pointers):
 Normal stub:
  x86: The stub is generated by UMEntryThunk::CompileUMThunkWorker
    (in DllImportCallback.cpp) and it is frameless. It calls directly
    the managed target or to IL stub if marshaling is required.
  non-x86: The stub exists statically as UMThunkStub and calls to IL stub.
 Prestub:
  The prestub is generated by GenerateUMThunkPrestub (x86) or exists statically
  as TheUMEntryPrestub (64-bit), and it erects an UMThkCallFrame frame.

Reverse P/Invoke AppDomain selector stub:
 The asm helper is IJWNOADThunkJumpTarget (in asmhelpers.asm) and it is frameless.

The P/Invoke IL stubs are wired up in the MethodDesc::DoPrestub(..) method (note that P/Invoke is also known as ‘NDirect’), in addition they are also created here when being used for ‘COM Interop’. That code then calls into GetStubForInteropMethod(..) in /vm/dllimport.cpp, before branching off to handle each case:

  • P/Invoke calls into NDirect::GetStubForILStub(..) here
  • Reverse P/Invoke calls into another overload of NDirect::GetStubForILStub(..) here
  • COM Interop goes to ComPlusCall::GetStubForILStub(..) here in /vm/clrtocomcall.cpp
  • EE implemented methods end up in COMDelegate::GetStubForILStub(..) here (for more info on EEImpl methods see ‘Kinds of MethodDescs’)

There are also hand-written assembly stubs for the differents scenarios, such as JIT_PInvokeBegin, JIT_PInvokeEnd and VarargPInvokeStub, these can be seen in the files below:

As an example, calli method calls (see OpCodes.Calli) end up in GenericPInvokeCalliHelper, which has a nice bit of ASCII art in the i386 version:

// stack layout at this point:
//
// |         ...          |
// |   stack arguments    | ESP + 16
// +----------------------+
// |     VASigCookie*     | ESP + 12
// +----------------------+
// |    return address    | ESP + 8
// +----------------------+
// | CALLI target address | ESP + 4
// +----------------------+
// |   stub entry point   | ESP + 0
// ------------------------

However, all these stubs can have an adverse impact on start-up time, see Large numbers of Pinvoke stubs created on startup for example. This impact has been mitigated by compiling the stubs ‘Ahead-of-Time’ (AOT) and storing them in the ‘Ready-to-Run’ images (replacement format for NGEN (Native Image Generator)). From R2R ilstubs:

IL stub generation for interop takes measurable time at startup, and it is possible to generate some of them in an ahead of time

This change introduces ahead of time R2R compilation of IL stubs

Related work was done in Enable R2R compilation/inlining of PInvoke stubs where no marshalling is required and PInvoke stubs for Unix platforms (‘Enables inlining of PInvoke stubs for Unix platforms’).

Finally, for even more information on the issues involved, see:

Marshalling

However, dealing with the ‘managed’ to ‘un-managed’ transition is only one part of the story. The other is that there are also stubs created to deal with the ‘marshalling’ of arguments between the 2 sides. This process of ‘Interop Marshalling’ is explained nicely in the Microsoft docs:

Interop marshaling governs how data is passed in method arguments and return values between managed and unmanaged memory during calls. Interop marshaling is a run-time activity performed by the common language runtime’s marshaling service.

Most data types have common representations in both managed and unmanaged memory. The interop marshaler handles these types for you. Other types can be ambiguous or not represented at all in managed memory.

Like many stubs in the CLR, the marshalling stubs have evolved over time. As we can read in the excellent post Improvements to Interop Marshaling in V4: IL Stubs Everywhere:

History The 1.0 and 1.1 versions of the CLR had several different techniques for creating and executing these stubs that were each designed for marshaling different types of signatures. These techniques ranged from directly generated x86 assembly instructions for simple signatures to generating specialized ML (an internal marshaling language) and running them through an internal interpreter for the most complicated signatures. This system worked well enough – although not without difficulties – in 1.0 and 1.1 but presented us with a serious maintenance problem when 2.0, and its support for multiple processor architectures, came around.

That’s right, there was an internal interpreter built into early version of the .NET CLR that had the job of running the ‘marshalling language’ (ML) code!

However, it then goes on to explain why this process wasn’t sustainable:

We realized early in the process of adding 64 bit support to 2.0 that this approach was not sustainable across multiple architectures. Had we continued with the same strategy we would have had to create parallel marshaling infrastructures for each new architecture we supported (remember in 2.0 we introduced support for both x64 and IA64) which would, in addition to the initial cost, at least triple the cost of every new marshaling feature or bug fix. We needed one marshaling stub technology that would work on multiple processor architectures and could be efficiently executed on each one: enter IL stubs.

The solution was to implement all stubs using ‘Intermediate Language’ (IL) that is CPU-agnostic. Then the JIT-compiler is used to convert the IL into machine code for each CPU architecture, which makes sense because it’s exactly what the JIT is good at. Also worth noting is that this work still continues today, for instance see Implement struct marshalling via IL Stubs instead of via FieldMarshalers #26340.

Finally, there is a really nice investigation into the whole process in PInvoke: beyond the magic (also Compile time marshalling). What’s also nice is that you can use PerfView to see the stubs that the runtime generates.

Generics

It is reasonably well known that generics in .NET use ‘code sharing’ to save space. That is, given a generic method such as public void Insert(..), one method body of ‘native code’ will be created and shared by the instantiated types of Insert(..) and Insert(..) (assumning that Foo and Bar are references types), but different versions will be created for Insert(..) and Insert(..) (as int/double are value types). This is possible, for the reasons outlined by Jon Skeet in a StackOverflow question:

.. consider what the CLR needs to know about a type. It includes:

  • The size of a value of that type (i.e. if you have a variable of some type, how much space will that memory need?)
  • How to treat the value in terms of garbage collection: is it a reference to an object, or a value which may in turn contain other references?

For all reference types, the answers to these questions are the same. The size is just the size of a pointer, and the value is always just a reference (so if the variable is considered a root, the GC needs to recursively descend into it).

For value types, the answers can vary significantly.

But, this poses a problem. What about if the ‘shared’ method needs to do something specific for each type, like call typeof(T)?

This whole issue is explained in these 2 great posts, which I really recommend you take the time to read:

I’m not going to repeat what they cover here, except to say that (not surprisingly) ‘stubs’ are used to solve this issue, in conjunction with a ‘hidden’ parameter. These stubs are known as ‘instantiating’ stubs and we can find out more about them in this comment:

Instantiating Stubs - Return TRUE if this is this a special stub used to implement an instantiated generic method or per-instantiation static method. The action of an instantiating stub is - pass on a MethodTable or InstantiatedMethodDesc extra argument to shared code

The different scenarios are handled in MakeInstantiatingStubWorker(..) in /vm/prestub.cpp, you can see the check for HasMethodInstantiation and the fall-back to a ‘per-instantiation static method’:

    // It's an instantiated generic method
    // Fetch the shared code associated with this instantiation
    pSharedMD = pMD->GetWrappedMethodDesc();
    _ASSERTE(pSharedMD != NULL && pSharedMD != pMD);

    if (pMD->HasMethodInstantiation())
    {
        extraArg = pMD;
    }
    else
    {
        // It's a per-instantiation static method
        extraArg = pMD->GetMethodTable();
    }
    Stub *pstub = NULL;

#ifdef FEATURE_STUBS_AS_IL
    pstub = CreateInstantiatingILStub(pSharedMD, extraArg);
#else
    CPUSTUBLINKER sl;
    _ASSERTE(pSharedMD != NULL && pSharedMD != pMD);
    sl.EmitInstantiatingMethodStub(pSharedMD, extraArg);

    pstub = sl.Link(pMD->GetLoaderAllocator()->GetStubHeap());
#endif

As a reminder, FEATURE_STUBS_AS_IL is defined for all Unix versions of the CoreCLR, but on Windows it’s only used with ARM64.

  • When FEATURE_STUBS_AS_IL is defined, the code calls into CreateInstantiatingILStub(..) here. To get an overview of what it’s doing, we can take a look at the steps called-out in the code comments:
    • // 1. Build the new signature here
    • // 2. Emit the method body here
    • // 2.2 Push the rest of the arguments for x86 here
    • // 2.3 Push the hidden context param here
    • // 2.4 Push the rest of the arguments for not x86 here
    • // 2.5 Push the target address here
    • // 2.6 Do the calli here
  • When FEATURE_STUBS_AS_IL is note defined, per CPU/OS versions of EmitInstantiatingMethodStub(..) are used, they exist for:

In the last case, (EmitInstantiatingMethodStub(..) on ARM), the stub shares code with the instantiating version of the unboxing stub, so the heavy-lifting is done in StubLinkerCPU::ThumbEmitCallWithGenericInstantiationParameter(..) here. This method is over 400 lines for fairly complex code, althrough there is also a nice piece of ASCII art (for info on why this ‘complex’ case is needed see this comment):

// Complex case where we need to emit a new stack frame and copy the arguments.

// Calculate the size of the new stack frame:
//
//            +------------+
//      SP -> |            | srcofs & ShuffleEntry::OFSMASK));
        }
        else if (pEntry->dstofs & ShuffleEntry::REGMASK)
        {
            // source must be on the stack
            _ASSERTE(!(pEntry->srcofs & ShuffleEntry::REGMASK));

            EmitLoadStoreRegImm(eLOAD, IntReg(pEntry->dstofs & ShuffleEntry::OFSMASK), RegSp, pEntry->srcofs * sizeof(void*));
        }
        else
        {
            // source must be on the stack
            _ASSERTE(!(pEntry->srcofs & ShuffleEntry::REGMASK));

            // dest must be on the stack
            _ASSERTE(!(pEntry->dstofs & ShuffleEntry::REGMASK));

            EmitLoadStoreRegImm(eLOAD, IntReg(9), RegSp, pEntry->srcofs * sizeof(void*));
            EmitLoadStoreRegImm(eSTORE, IntReg(9), RegSp, pEntry->dstofs * sizeof(void*));
        }
    }

    // Tailcall to target
    // br x16
    EmitJumpRegister(IntReg(16));
}

Unboxing

I’ve written about this type of ‘stub’ before in A look at the internals of ‘boxing’ in the CLR, but in summary the unboxing stub needs to handle steps 2) and 3) from the diagram below:

1. MyStruct:         [0x05 0x00 0x00 0x00]

                     |   Object Header   |   MethodTable  |   MyStruct    |
2. MyStruct (Boxed): [0x40 0x5b 0x6f 0x6f 0xfe 0x7 0x0 0x0 0x5 0x0 0x0 0x0]
                                          ^
                    object 'this' pointer | 

                     |   Object Header   |   MethodTable  |   MyStruct    |
3. MyStruct (Boxed): [0x40 0x5b 0x6f 0x6f 0xfe 0x7 0x0 0x0 0x5 0x0 0x0 0x0]
                                                           ^
                                   adjusted 'this' pointer | 

Key to the diagram

  1. Original struct, on the stack
  2. The struct being boxed into an object that lives on the heap
  3. Adjustment made to this pointer so MyStruct::ToString() will work

These stubs make is possible for ‘value types’ (structs) to override methods from System.Object, such as ToString() and GetHashCode(). The fix-up is needed because structs don’t have an ‘object header’, but when they’re boxed into an Object they do. So the stub has the job of moving or adjusting the ‘this’ pointer so that the code in the ToString() method can work the same, regardless of whether it’s operating on a regular ‘struct’ or one that’s been boxed into an ‘object.

The unboxing stubs are created in MethodDesc::DoPrestub(..) here, which in turn calls into MakeUnboxingStubWorker(..) here

  • when FEATURE_STUBS_AS_IL is disabled it then calls EmitUnboxMethodStub(..) to create the stub, there are per-CPU versions:
  • when FEATURE_STUBS_AS_IL is enabled is instead calls into CreateUnboxingILStubForSharedGenericValueTypeMethods(..) here

For more information on some of the internal details of unboxing stubs and how they interact with ‘generic instantiations’ see this informative comment and one in the code for MethodDesc::FindOrCreateAssociatedMethodDesc(..) here.

Arrays

As discussed at the beginning, the method bodies for arrays is provided by the runtime, that is the array access methods, ‘get’ and ‘set’, that allow var a = myArray[5] and myArray[7] = 5 to work. Not surprisingly, these are done as stubs to allow them to be as small and efficient as possible.

Here is the flow for wiring up ‘array stubs’. It all starts up in MethodDesc::DoPrestub(..) here:

  • If FEATURE_ARRAYSTUB_AS_IL is defined (see ‘Stubs-as-IL’), it happens in GenerateArrayOpStub(ArrayMethodDesc* pMD) here
    • Then ArrayOpLinker::EmitStub() here, which is responsible for generating 3 types of stubs { ILSTUB_ARRAYOP_GET, ILSTUB_ARRAYOP_SET, ILSTUB_ARRAYOP_ADDRESS }.
    • Before calling ILStubCache::CreateAndLinkNewILStubMethodDesc(..) here
    • Finally ending up in JitILStub(..) here
  • When FEATURE_ARRAYSTUB_AS_IL isn’t defined, happens in another version of GenerateArrayOpStub(ArrayMethodDesc* pMD) lower down
    • Then void GenerateArrayOpScript(..) here
    • Followed by a call to StubCacheBase::Canonicalize(..) here, that ends up in ArrayStubCache::CompileStub(..) here.
    • Eventually, we end up in StubLinkerCPU::EmitArrayOpStub(..) here, which does the heavy lifting (despite being under ‘\src\vm\i386' seems to support x86 and AMD64?)

I’m not going to include the code for the ‘stub-as-IL’ (ArrayOpLinker::EmitStub()) or the assembly code (StubLinkerCPU::EmitArrayOpStub(..)) versions of the array stubs because they’re both 100’s of lines long, dealing with type and bounds checking, computing address, multi-dimensional arrays and mode. But to give an idea of the complexities, take a look at this comment from StubLinkerCPU::EmitArrayOpStub(..) here:

// Register usage
//
//                                          x86                 AMD64
// Inputs:
//  managed array                           THIS_kREG (ecx)     THIS_kREG (rcx)
//  index 0                                 edx                 rdx
//  index 1/value                                        r8
//  index 2/value                                        r9
//  expected element type for LOADADDR      eax                 rax                 rdx
// Working registers:
//  total (accumulates unscaled offset)     edi                 r10
//  factor (accumulates the slice factor)   esi                 r11

Finally, these stubs are still being improved, for example see Use unsigned index extension in muldi-dimensional array stubs.

Tail Calls

The .NET runtime provides a nice optimisation when doing ‘tail calls’, that (amoung other things) will prevent StackoverflowExceptions in recursive scenarios. For more on why these tail call optimisations are useful and how they work, take a look at:

In summary, a tail call optimisation allows the same stack frame to be re-used if in the caller, there is no work done after the function call to the callee (see Tail call JIT conditions (2007) for a more precise definition).

And why is this beneficial? From Tail Call Improvements in .NET Framework 4:

The primary reason for a tail call as an optimization is to improve data locality, memory usage, and cache usage. By doing a tail call the callee will use the same stack space as the caller. This reduces memory pressure. It marginally improves the cache because the same memory is reused for subsequent callers and thus can stay in the cache, rather than evicting some older cache line to make room for a new cache line.

To make this clear, the code below can benefit from the optimisation, because both functions return straight after calling each the other:

public static long Ping(int cnt, long val)
{
    if (cnt-- == 0)
        return val;

    return Pong(cnt, val + cnt);
}

public static long Pong(int cnt, long val)
{
    if (cnt-- == 0)
        return val;

    return Ping(cnt, val + cnt);
}

However, if the code was changed to the version below, the optimisation would no longer work because PingNotOptimised(..) does some extra work between calling Pong(..) and when it returns:

public static long PingNotOptimised(int cnt, long val)
{
    if (cnt-- == 0)
        return val;

    var result = Pong(cnt, val + cnt);
    result += 1; // prevents the Tail-call optimization
    return result;
}

public static long Pong(int cnt, long val)
{
    if (cnt-- == 0)
        return val;

    return PingNotOptimised(cnt, val + cnt);
}

You can see the difference in the code emitted by the JIT compiler for the different scenarios in SharpLab.

But where do the ‘tail call optimisation stubs’ come into play? Helpfully there is a tail call related design doc that explains, from ‘current way of handling tail-calls’:

Fast tail calls These are tail calls that are handled directly by the jitter and no runtime cooperation is needed. They are limited to cases where:

  • Return value and call target arguments are all either primitive types, reference types, or valuetypes with a single primitive type or reference type fields
  • The aligned size of call target arguments is less or equal to aligned size of caller arguments

So, the stubs aren’t always needed, sometimes the work can be done by the JIT, if there scenario is simple enough.

However for the more complex cases, a ‘helper’ stub is needed:

Tail calls using a helper Tail calls in cases where we cannot perform the call in a simple way are implemented using a tail call helper. Here is a rough description of how it works:

  • For each tail call target, the jitter asks runtime to generate an assembler argument copying routine. This routine reads vararg list of arguments and places the arguments in their proper slots in the CONTEXT or on the stack. Together with the argument copying routine, the runtime also builds a list of offsets of references and byrefs for return value of reference type or structs returned in a hidden return buffer and for structs passed by ref. The gc layout data block is stored at the end of the argument copying thunk.
  • At the time of the tail call, the caller generates a vararg list of all arguments of the tail called function and then calls JIT_TailCall runtime function. It passes it the copying routine address, the target address and the vararg list of the arguments.
  • The JIT_TailCall then performs the following: …

To see the rest of the steps that JIT_TailCall takes you can read the design doc or if you’re really keen you can look at the code in /vm/jithelpers.cpp. Also, there’s a useful explanation of what it needs to handle in the JIT code, see here and here.

However, we’re just going to focus on the stubs, refered to as an ‘assembler argument copying routine’. Firstly, we can see that they have their own stub manager, TailCallStubManager, which is implemented here and allows the stubs to play nicely with the debugger. Also interesting to look at is the TailCallFrame here that is used to ensure that the ‘stack walker’ can work well with tail calls.

Now, onto the stubs themselves, the ‘copying routines’ are provided by the runtime via a call to CEEInfo::getTailCallCopyArgsThunk(..) in /vm/jitinterface.cpp. This in turn calls the CPU specific versions of CPUSTUBLINKER::CreateTailCallCopyArgsThunk(..):

These routines have the complex and hairy job of dealing with the CPU registers and calling conventions. They achieve this by dynamicially emitting assembly instructions, to create a function that looks like the following pseudo code (X86 version):

    // size_t CopyArguments(va_list args,         (RCX)
    //                      CONTEXT *pCtx,        (RDX)
    //                      DWORD64 *pvStack,     (R8)
    //                      size_t cbStack)       (R9)
    // {
    //     if (pCtx != NULL) {
    //         foreach (arg in args) {
    //             copy into pCtx or pvStack
    //         }
    //     }
    //     return ;
    // }

In addition there is one other type of stub that is used. Known as the TailCallHelperStub, they also come in per-CPU versions:

Going forward, there are several limitations of to this approach of using per-CPU stubs, as the design doc explains:

  • It is expensive to port to new platforms
    • Parsing the vararg list is not possible to do in a portable way on Unix. Unlike on Windows, the list is not stored a linear sequence of the parameter data bytes in memory. va_list on Unix is an opaque data type, some of the parameters can be in registers and some in the memory.
    • Generating the copying asm routine needs to be done for each target architecture / platform differently. And it is also very complex, error prone and impossible to do on platforms where code generation at runtime is not allowed.
  • It is slower than it has to be
    • The parameters are copied possibly twice - once from the vararg list to the stack and then one more time if there was not enough space in the caller’s stack frame.
    • RtlRestoreContext restores all registers from the CONTEXT structure, not just a subset of them that is really necessary for the functionality, so it results in another unnecessary memory accesses.
  • Stack walking over the stack frames of the tail calls requires runtime assistance.

Fortunately, it then goes into great depth discussing how a new approach could be implemented and how it would solve these issues. Even better, work has already started and we can follow along in Implement portable tailcall helpers #26418 (currently sitting at ‘31 of 55’ tasks completed, with over 50 files modified, it’s not a small job!).

Finally, for other PRs related to tail calls, see:

Virtual Stub Dispatch (VSD)

I’ve saved the best for last, ‘Virtual Stub Dispatch’ or VSD is such an in-depth topic, that it an entire BotR page devoted to it!! From the introduction:

Virtual stub dispatching (VSD) is the technique of using stubs for virtual method invocations instead of the traditional virtual method table. In the past, interface dispatch required that interfaces had process-unique identifiers, and that every loaded interface was added to a global interface virtual table map. This requirement meant that all interfaces and all classes that implemented interfaces had to be restored at runtime in NGEN scenarios, causing significant startup working set increases. The motivation for stub dispatching was to eliminate much of the related working set, as well as distribute the remaining work throughout the lifetime of the process.

It then goes on to say:

Although it is possible for VSD to dispatch both virtual instance and interface method calls, it is currently used only for interface dispatch.

So despite having the work ‘virtual’ in the title, it’s not actually used for C# methods with the virtual modifier on them. However, if you look at the IL for interface methods you can see why they are also known as ‘virtual’.

Virtual Stub Dispatch is so complex, it actually has several different stub types, from /vm/virtualcallstub.h:

enum StubKind { 
  SK_UNKNOWN, 
  SK_LOOKUP,      // Lookup Stubs are SLOW stubs that simply call into the runtime to do all work.
  SK_DISPATCH,    // Dispatch Stubs have a fast check for one type otherwise jumps to runtime.  Works for monomorphic sites
  SK_RESOLVE,     // Resolve Stubs do a hash lookup before fallling back to the runtime.  Works for polymorphic sites.
  SK_VTABLECALL,  // Stub that jumps to a target method using vtable-based indirections. Works for non-interface calls.
  SK_BREAKPOINT 
};

So there are the following types (these are links to the AMD64 versions, x86 versions are in /vm/i386/virtualcallstubcpu.hpp):

  • Lookup Stubs:
    • // Virtual and interface call sites are initially setup to point at LookupStubs. This is because the runtime type of the pointer is not yet known, so the target cannot be resolved.
  • Dispatch Stubs:
    • // Monomorphic and mostly monomorphic call sites eventually point to DispatchStubs. A dispatch stub has an expected type (expectedMT), target address (target) and fail address (failure). If the calling frame does in fact have the type be of the expected type, then control is transfered to the target address, the method implementation. If not, then control is transfered to the fail address, a fail stub (see below) where a polymorphic lookup is done to find the correct address to go to.
    • There’s also specific versions, DispatchStubShort and DispatchStubLong, see this comment for why they are both needed.
  • Resolve Stubs:
    • // Polymorphic call sites and monomorphic calls that fail end up in a ResolverStub. There is only one resolver stub built for any given token, even though there may be many call sites that use that token and many distinct types that are used in the calling call frames. A resolver stub actually has two entry points, one for polymorphic call sites and one for dispatch stubs that fail on their expectedMT test. There is a third part of the resolver stub that enters the ee when a decision should be made about changing the callsite.
  • V-Table or Virtual Call Stubs
    • //These are jump stubs that perform a vtable-base virtual call. These stubs assume that an object is placed in the first argument register (this pointer). From there, the stub extracts the MethodTable pointer, followed by the vtable pointer, and finally jumps to the target method at a given slot in the vtable.

The below diagram shows the general control flow between these stubs

(Image from ‘Design of Virtual Stub Dispatch’)

Finally, if you want even more in-depth information see this comment.

However, these stubs come at a cost, which makes virtual method calls more expensive than direct ones. This is why de-virtualization is so important, i.e. the process of the .NET JIT detecting when a virtual call can instead be replaced by a direct one. There has been some work done in .NET Core to improve this, see Simple devirtualization #9230 which covers sealed classes/methods and when the object type is known exactly. However there is still more to be done, as shown in JIT: devirtualization next steps #9908, where ‘5 of 23’ tasks have been completed.

Other Types of Stubs

This post is already way too long, so I don’t intend to offer any analysis of the following stubs. Instead I’ve just included some links to more information so you can read up on any that interest you!

‘Jump’ stubs

‘Function Pointer’ stubs

  • ‘Function Pointer’ Stubs, see /vm/fptrstubs.cpp and /vm/fptrstubs.h
  • // FuncPtrStubs contains stubs that is used by GetMultiCallableAddrOfCode() if the function has not been jitted. Using a stub decouples ldftn from the prestub, so prestub does not need to be backpatched. This stub is also used in other places which need a function pointer

‘Thread Hijacking’ stubs

From the BotR page on ‘Threading’:

  • If fully interruptable, it is safe to perform a GC at any point, since the thread is, by definition, at a safe point. It is reasonable to leave the thread suspended at this point (because it’s safe) but various historical OS bugs prevent this from working, because the CONTEXT retrieved earlier may be corrupt). Instead, the thread’s instruction pointer is overwritten, redirecting it to a stub that will capture a more complete CONTEXT, leave cooperative mode, wait for the GC to complete, reenter cooperative mode, and restore the thread to its previous state.
  • If partially-interruptable, the thread is, by definition, not at a safe point. However, the caller will be at a safe point (method transition). Using that knowledge, the CLR “hijacks” the top-most stack frame’s return address (physically overwrite that location on the stack) with a stub similar to the one used for fully-interruptable code. When the method returns, it will no longer return to its actual caller, but rather to the stub (the method may also perform a GC poll, inserted by the JIT, before that point, which will cause it to leave cooperative mode and undo the hijack).

Done with the OnHijackTripThread method in /vm/amd64/AsmHelpers.asm, which calls into OnHijackWorker(..) in /vm/threadsuspend.cpp.

‘NGEN Fixup’ stubs

From CLR Inside Out - The Performance Benefits of NGen (2006):

Throughput of NGen-compiled code is lower than that of JIT-compiled code primarily for one reason: cross-assembly references. In JIT-compiled code, cross-assembly references can be implemented as direct calls or jumps since the exact addresses of these references are known at run time. For statically compiled code, however, cross-assembly references need to go through a jump slot that gets populated with the correct address at run time by executing a method pre-stub. The method pre-stub ensures, among other things, that the native images for assemblies referenced by that method are loaded into memory before the method is executed. The pre-stub only needs to be executed the first time the method is called; it is short-circuited out for subsequent calls. However, every time the method is called, cross-assembly references do need to go through a level of indirection. This is principally what accounted for the 5-10 percent drop in throughput for NGen-compiled code when compared to JIT-compiled code.

Also see the ‘NGEN’ section of the ‘jump stub’ design doc.

Stubs in the Mono Runtime

Mono refers to ‘Stubs’ as ‘Trampolines’ and they’re widely used in the source code.

The Mono docs have an excellent page all about ‘Trampolines’, that lists the following types:

  1. JIT Trampolines
  2. Virtual Call Trampolines
  3. Jump Trampolines
  4. Class Init Trampolines
  5. Generic Class Init Trampoline
  6. RGCTX Lazy Fetch Trampolines
  7. AOT Trampolines
  8. Delegate Trampolines
  9. Monitor Enter/Exit Trampolines

Also the docs page on Generic Sharing has some good, in-depth information.

Conclusion

So it turns out that ‘stubs’ are way more prevelant in the .NET Core Runtime that I imagined when I first started on this post. They are an interesting technique and they contain a fair amount of complexity. In addition, I only covered each stub in isolation, in reality many of them have to play nicely together, for instance imagine a delegate calling a virtual method that has generic type parameters and you can see that things start to get complex! (that scenario might contain 3 seperate stubs, although they are also shared where possible). If you were then to add array methods, P/Invoke marshalling and un-boxing to the mix, things get even more hairy and even more complex!

If anyone has read this far and wants a fun challenge, try and figure out what’s the most stubs you can force a single method call to go via! If you do, let me know in the comments or via twitter

Finally, by knowing where and when stubs are involved in our method calls, we can start to understand the overhead of each scenario. For instance, it explains why delegate method calls are a bit slower than calling a method directly and why ‘de-virtualization’ is so important. Having the JIT be able to perform extra analysis to determine that a virtual call can be converted into a direct one skips an entire level of indirection, for more on this see:


Thu, 26 Sep 2019, 12:00 am


ASCII Art in .NET Code

Who doesn’t like a nice bit of ‘ASCII Art’? I know I certainly do!

To see what Matt’s CLR was all about you can watch the recording of my talk ‘From ‘dotnet run’ to ‘Hello World!’’ (from about ~24:30 in)

So armed with a trusty regex /\*(.*?)\*/|//(.*?)\r?\n|"((\\[^\n]|[^"\n])*)"|@("[^"]*")+, I set out to find all the interesting ASCII Art used in source code comments in the following .NET related repositories:

  • dotnet/CoreCLR - “the runtime for .NET Core. It includes the garbage collector, JIT compiler, primitive data types and low-level classes.
  • Mono - “open source ECMA CLI, C# and .NET implementation.
  • dotnet/CoreFX - “the foundational class libraries for .NET Core. It includes types for collections, file systems, console, JSON, XML, async and many others.
  • dotnet/Roslyn - “provides C# and Visual Basic languages with rich code analysis APIs
  • aspnet/AspNetCore - “a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.

Note: Yes, I shamelessly ‘borrowed’ this idea from John Regehr, I was motivated to write this because his excellent post ‘Explaining Code using ASCII Art’ didn’t have any .NET related code in it!

If you’ve come across any interesting examples I’ve missed out, please let me know!

Table of Contents

To make the examples easier to browse, I’ve split them up into categories:

Dave Cutler

There’s no art in this one, but it deserves it’s own category as it quotes the amazing Dave Cutler who led the development of Windows NT. Therefore there’s no better person to ask a deep, technical question about how Thread Suspension works on Windows, from coreclr/src/vm/threadsuspend.cpp

// Message from David Cutler
/*
    After SuspendThread returns, can the suspended thread continue to execute code in user mode?

    [David Cutler] The suspended thread cannot execute any more user code, but it might be currently "running"
    on a logical processor whose other logical processor is currently actually executing another thread.
    In this case the target thread will not suspend until the hardware switches back to executing instructions
    on its logical processor. In this case even the memory barrier would not necessarily work - a better solution
    would be to use interlocked operations on the variable itself.

    After SuspendThread returns, does the store buffer of the CPU for the suspended thread still need to drain?

    Historically, we've assumed that the answer to both questions is No.  But on one 4/8 hyper-threaded machine
    running Win2K3 SP1 build 1421, we've seen two stress failures where SuspendThread returns while writes seem to still be in flight.

    Usually after we suspend a thread, we then call GetThreadContext.  This seems to guarantee consistency.
    But there are places we would like to avoid GetThreadContext, if it's safe and legal.

    [David Cutler] Get context delivers a APC to the target thread and waits on an event that will be set
    when the target thread has delivered its context.

    Chris.
*/

For more info on Dave Cutler, see this excellent interview ‘Internets of Interest #6: Dave Cutler on Dave Cutler’ or ‘The engineer’s engineer: Computer industry luminaries salute Dave Cutler’s five-decade-long quest for quality’

Syntax Trees

The inner workings of the .NET ‘Just-in-Time’ (JIT) Compiler have always been a bit of a mystery to me. But, having informative comments like this one from coreclr/src/jit/lsra.cpp go some way to showing what it’s doing

// For example, for this tree (numbers are execution order, lower is earlier and higher is later):
//
//                                   +---------+----------+
//                                   |       GT_ADD (3)   |
//                                   +---------+----------+
//                                             |
//                                           /   \
//                                         /       \
//                                       /           \
//                   +-------------------+           +----------------------+
//                   |         x (1)     | "tree"    |         y (2)        |
//                   +-------------------+           +----------------------+
//
// generate this tree:
//
//                                   +---------+----------+
//                                   |       GT_ADD (4)   |
//                                   +---------+----------+
//                                             |
//                                           /   \
//                                         /       \
//                                       /           \
//                   +-------------------+           +----------------------+
//                   |  GT_RELOAD (3)    |           |         y (2)        |
//                   +-------------------+           +----------------------+
//                             |
//                   +-------------------+
//                   |         x (1)     | "tree"
//                   +-------------------+

There’s also a more in-depth example in coreclr/src/jit/morph.cpp

Also from roslyn/src/Compilers/VisualBasic/Portable/Semantics/TypeInference/RequiredConversion.vb

 '// These restrictions form a partial order composed of three chains: from less strict to more strict, we have:
'//    [reverse chain] [None] < AnyReverse < ReverseReference < Identity
'//    [middle  chain] None < [Any,AnyReverse] < AnyConversionAndReverse < Identity
'//    [forward chain] [None] < Any < ArrayElement < Reference < Identity
'//
'//            =           KEY:
'//         /  |  \           =     Identity
'//        /   |   \         +r     Reference
'//      -r    |    +r       -r     ReverseReference
'//       |  +-any  |       +-any   AnyConversionAndReverse
'//       |   /|\   +arr     +arr   ArrayElement
'//       |  / | \  |        +any   Any
'//      -any  |  +any       -any   AnyReverse
'//         \  |  /           none  None
'//          \ | /
'//           none
'//

Timelines

This example from coreclr/src/vm/comwaithandle.cpp was unique! I didn’t find another example of ASCII Art used to illustrate time-lines, it’s a really novel approach.

// In case the CLR is paused inbetween a wait, this method calculates how much 
// the wait has to be adjusted to account for the CLR Freeze. Essentially all
// pause duration has to be considered as "time that never existed".
//
// Two cases exists, consider that 10 sec wait is issued 
// Case 1: All pauses happened before the wait completes. Hence just the 
// pause time needs to be added back at the end of wait
// 0           3                   8       10
// |-----------|###################|------>
//                 5-sec pause    
//             ....................>
//                                            Additional 5 sec wait
//                                        |=========================> 
//
// Case 2: Pauses ended after the wait completes. 
// 3 second of wait was left as the pause started at 7 so need to add that back
// 0                           7           10
// |---------------------------|###########>
//                                 5-sec pause   12
//                             ...................>
//                                            Additional 3 sec wait
//                                                |==================> 
//
// Both cases can be expressed in the same calculation
// pauseTime:   sum of all pauses that were triggered after the timer was started
// expDuration: expected duration of the wait (without any pauses) 10 in the example
// actDuration: time when the wait finished. Since the CLR is frozen during pause it's
//              max of timeout or pause-end. In case-1 it's 10, in case-2 it's 12

Logic Tables

A sweet-spot for ASCII Art seems to be tables, there are so many examples. Starting with coreclr/src/vm/methodtablebuilder.cpp (bonus points for combining comments and code together!)

//               |        Base type
// Subtype       |        mdPrivateScope  mdPrivate   mdFamANDAssem   mdAssem     mdFamily    mdFamORAssem    mdPublic
// --------------+-------------------------------------------------------------------------------------------------------
/*mdPrivateScope | */ { { e_SM,           e_NO,       e_NO,           e_NO,       e_NO,       e_NO,           e_NO    },
/*mdPrivate      | */   { e_SM,           e_YES,      e_NO,           e_NO,       e_NO,       e_NO,           e_NO    },
/*mdFamANDAssem  | */   { e_SM,           e_YES,      e_SA,           e_NO,       e_NO,       e_NO,           e_NO    },
/*mdAssem        | */   { e_SM,           e_YES,      e_SA,           e_SA,       e_NO,       e_NO,           e_NO    },
/*mdFamily       | */   { e_SM,           e_YES,      e_YES,          e_NO,       e_YES,      e_NSA,          e_NO    },
/*mdFamORAssem   | */   { e_SM,           e_YES,      e_YES,          e_SA,       e_YES,      e_YES,          e_NO    },
/*mdPublic       | */   { e_SM,           e_YES,      e_YES,          e_YES,      e_YES,      e_YES,          e_YES   } };

Also coreclr/src/jit/importer.cpp which shows how the JIT deals with boxing/un-boxing

/*
    ----------------------------------------------------------------------
    | \ helper  |                         |                              |
    |   \       |                         |                              |
    |     \     | CORINFO_HELP_UNBOX      | CORINFO_HELP_UNBOX_NULLABLE  |
    |       \   | (which returns a BYREF) | (which returns a STRUCT)     |
    | opcode  \ |                         |                              |
    |---------------------------------------------------------------------
    | UNBOX     | push the BYREF          | spill the STRUCT to a local, |
    |           |                         | push the BYREF to this local |
    |---------------------------------------------------------------------
    | UNBOX_ANY | push a GT_OBJ of        | push the STRUCT              |
    |           | the BYREF               | For Linux when the           |
    |           |                         |  struct is returned in two   |
    |           |                         |  registers create a temp     |
    |           |                         |  which address is passed to  |
    |           |                         |  the unbox_nullable helper.  |
    |---------------------------------------------------------------------
*/

Finally, there’s some other nice examples showing the rules for operator overloading in the C# (Roslyn) Compiler and which .NET data-types can be converted via the System.ToXXX() functions.

Class Hierarchies

Of course, most IDE’s come with tools that will generate class-hierarchies for you, but it’s much nicer to see them in ASCII, from coreclr/src/vm/object.h

 * COM+ Internal Object Model
 *
 *
 * Object              - This is the common base part to all COM+ objects
 *  |                        it contains the MethodTable pointer and the
 *  |                        sync block index, which is at a negative offset
 *  |
 *  +-- code:StringObject       - String objects are specialized objects for string
 *  |                        storage/retrieval for higher performance
 *  |
 *  +-- BaseObjectWithCachedData - Object Plus one object field for caching.
 *  |       |
 *  |       +-  ReflectClassBaseObject    - The base object for the RuntimeType class
 *  |       +-  ReflectMethodObject       - The base object for the RuntimeMethodInfo class
 *  |       +-  ReflectFieldObject        - The base object for the RtFieldInfo class
 *  |
 *  +-- code:ArrayBase          - Base portion of all arrays
 *  |       |
 *  |       +-  I1Array    - Base type arrays
 *  |       |   I2Array
 *  |       |   ...
 *  |       |
 *  |       +-  PtrArray   - Array of OBJECTREFs, different than base arrays because of pObjectClass
 *  |              
 *  +-- code:AssemblyBaseObject - The base object for the class Assembly

There’s also an even larger one that I stumbled across when writing “Stack Walking” in the .NET Runtime.

Component Diagrams

When you have several different components in a code-base it’s always nice to see how they fit together. From coreclr/src/vm/codeman.h we can see how the top-level parts of the .NET JIT work together

                                               ExecutionManager
                                                       |
                           +-----------+---------------+---------------+-----------+--- ...
                           |           |                               |           |
                        CodeType       |                            CodeType       |
                           |           |                               |           |
                           v           v                               v           v
+---------------+      +--------+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           | Size of previous chunk (if P = 1)                             |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |P|
         | Size of this chunk                                         1| +-+
   mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         |                                                               |
         +-                                                             -+
         |                                                               |
         +-                                                             -+
         |                                                               :
         +-      size - sizeof(size_t) available payload bytes          -+
         :                                                               |
 chunk-> +-                                                             -+
         |                                                               |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|
       | Size of next chunk (may or may not be in use)               | +-+
 mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

    And if it's free, it looks like this:

   chunk-> +-                                                             -+
           | User payload (must be in use, or we would have merged!)       |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |P|
         | Size of this chunk                                         0| +-+
   mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         | Next pointer                                                  |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         | Prev pointer                                                  |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         |                                                               :
         +-      size - sizeof(struct chunk) unused bytes               -+
         :                                                               |
 chunk-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
         | Size of this chunk                                            |
         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0|
       | Size of next chunk (must be in use, or we would have merged)| +-+
 mem-> +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                                                               :
       +- User payload                                                -+
       :                                                               |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                                                     |0|
                                                                     +-+

Also, from corefx/src/Common/src/CoreLib/System/MemoryExtensions.cs we can see how overlapping memory regions are detected:

//  Visually, the two sequences are located somewhere in the 32-bit
//  address space as follows:
//
//      [----------------------------------------------)                            normal address space
//      0                                             2³²
//                            [------------------)                                  first sequence
//                            xRef            xRef + xLength
//              [--------------------------)     .                                  second sequence
//              yRef          .         yRef + yLength
//              :             .            .     .
//              :             .            .     .
//                            .            .     .
//                            .            .     .
//                            .            .     .
//                            [----------------------------------------------)      relative address space
//                            0            .     .                          2³²
//                            [------------------)             :                    first sequence
//                            x1           .     x2            :
//                            -------------)                   [-------------       second sequence
//                                         y2                  y1

State Machines

This comment from mono/benchmark/zipmark.cs gives a great over-view of the implementation of RFC 1951 - DEFLATE Compressed Data Format Specification

/*
 * The Deflater can do the following state transitions:
    *
    * (1) -> INIT_STATE   ----> INIT_FINISHING_STATE ---.
    *        /  | (2)      (5)                         |
    *       /   v          (5)                         |
    *   (3)| SETDICT_STATE ---> SETDICT_FINISHING_STATE |(3)
    *       \   | (3)                 |        ,-------'
    *        |  |                     | (3)   /
    *        v  v          (5)        v      v
    * (1) -> BUSY_STATE   ----> FINISHING_STATE
    *                                | (6)
    *                                v
    *                           FINISHED_STATE
    *    \_____________________________________/
    *          | (7)
    *          v
    *        CLOSED_STATE
    *
    * (1) If we should produce a header we start in INIT_STATE, otherwise
    *     we start in BUSY_STATE.
    * (2) A dictionary may be set only when we are in INIT_STATE, then
    *     we change the state as indicated.
    * (3) Whether a dictionary is set or not, on the first call of deflate
    *     we change to BUSY_STATE.
    * (4) -- intentionally left blank -- :)
    * (5) FINISHING_STATE is entered, when flush() is called to indicate that
    *     there is no more INPUT.  There are also states indicating, that
    *     the header wasn't written yet.
    * (6) FINISHED_STATE is entered, when everything has been flushed to the
    *     internal pending output buffer.
    * (7) At any time (7)
    *
    */

This might be pushing the definition of ‘state machine’ a bit far, but I wanted to include it because it shows just how complex ‘exception handling’ can be, from coreclr/src/jit/jiteh.cpp

// fgNormalizeEH: Enforce the following invariants:
//
//   1. No block is both the first block of a handler and the first block of a try. In IL (and on entry
//      to this function), this can happen if the "try" is more nested than the handler.
//
//      For example, consider:
//
//               try1 ----------------- BB01
//               |                      BB02
//               |--------------------- BB03
//               handler1
//               |----- try2 ---------- BB04
//               |      |               BB05
//               |      handler2 ------ BB06
//               |      |               BB07
//               |      --------------- BB08
//               |--------------------- BB09
//
//      Thus, the start of handler1 and the start of try2 are the same block. We will transform this to:
//
//               try1 ----------------- BB01
//               |                      BB02
//               |--------------------- BB03
//               handler1 ------------- BB10 // empty block
//               |      try2 ---------- BB04
//               |      |               BB05
//               |      handler2 ------ BB06
//               |      |               BB07
//               |      --------------- BB08
//               |--------------------- BB09
//

RFC’s and Specs

Next up, how the Kestrel web-server handles RFC 7540 - Hypertext Transfer Protocol Version 2 (HTTP/2).

Firstly, from aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2Frame.cs

/* https://tools.ietf.org/html/rfc7540#section-4.1
    +-----------------------------------------------+
    |                 Length (24)                   |
    +---------------+---------------+---------------+
    |   Type (8)    |   Flags (8)   |
    +-+-------------+---------------+-------------------------------+
    |R|                 Stream Identifier (31)                      |
    +=+=============================================================+
    |                   Frame Payload (0...)                      ...
    +---------------------------------------------------------------+
*/

and then in aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2Frame.Headers.cs

/* https://tools.ietf.org/html/rfc7540#section-6.2
    +---------------+
    |Pad Length? (8)|
    +-+-------------+-----------------------------------------------+
    |E|                 Stream Dependency? (31)                     |
    +-+-------------+-----------------------------------------------+
    |  Weight? (8)  |
    +-+-------------+-----------------------------------------------+
    |                   Header Block Fragment (*)                 ...
    +---------------------------------------------------------------+
    |                           Padding (*)                       ...
    +---------------------------------------------------------------+
*/

There are other notable examples in aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2FrameReader.cs and aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/Http2FrameWriter.cs.

Also RFC 3986 - Uniform Resource Identifier (URI) is discussed in corefx/src/Common/src/System/Net/IPv4AddressHelper.Common.cs

Finally, RFC 7541 - HPACK: Header Compression for HTTP/2, is covered in aspnet/AspNetCore/src/Servers/Kestrel/Core/src/Internal/Http2/HPack/HPackDecoder.cs

// http://httpwg.org/specs/rfc7541.html#rfc.section.6.1
//   0   1   2   3   4   5   6   7
// +---+---+---+---+---+---+---+---+
// | 1 |        Index (7+)         |
// +---+---------------------------+
private const byte IndexedHeaderFieldMask = 0x80;
private const byte IndexedHeaderFieldRepresentation = 0x80;

// http://httpwg.org/specs/rfc7541.html#rfc.section.6.2.1
//   0   1   2   3   4   5   6   7
// +---+---+---+---+---+---+---+---+
// | 0 | 1 |      Index (6+)       |
// +---+---+-----------------------+
private const byte LiteralHeaderFieldWithIncrementalIndexingMask = 0xc0;
private const byte LiteralHeaderFieldWithIncrementalIndexingRepresentation = 0x40;

// http://httpwg.org/specs/rfc7541.html#rfc.section.6.2.2
//   0   1   2   3   4   5   6   7
// +---+---+---+---+---+---+---+---+
// | 0 | 0 | 0 | 0 |  Index (4+)   |
// +---+---+-----------------------+
private const byte LiteralHeaderFieldWithoutIndexingMask = 0xf0;
private const byte LiteralHeaderFieldWithoutIndexingRepresentation = 0x00;

// http://httpwg.org/specs/rfc7541.html#rfc.section.6.2.3
//   0   1   2   3   4   5   6   7
// +---+---+---+---+---+---+---+---+
// | 0 | 0 | 0 | 1 |  Index (4+)   |
// +---+---+-----------------------+
private const byte LiteralHeaderFieldNeverIndexedMask = 0xf0;
private const byte LiteralHeaderFieldNeverIndexedRepresentation = 0x10;

// http://httpwg.org/specs/rfc7541.html#rfc.section.6.3
//   0   1   2   3   4   5   6   7
// +---+---+---+---+---+---+---+---+
// | 0 | 0 | 1 |   Max size (5+)   |
// +---+---------------------------+
private const byte DynamicTableSizeUpdateMask = 0xe0;
private const byte DynamicTableSizeUpdateRepresentation = 0x20;

// http://httpwg.org/specs/rfc7541.html#rfc.section.5.2
//   0   1   2   3   4   5   6   7
// +---+---+---+---+---+---+---+---+
// | H |    String Length (7+)     |
// +---+---------------------------+
private const byte HuffmanMask = 0x80;

Dates & Times

It is pretty widely accepted that dates and times are hard and that’s reflected in the amount of comments explaining different scenarios. For example from corefx/src/Common/src/CoreLib/System/TimeZoneInfo.cs

// startTime and endTime represent the period from either the start of DST to the end and
// ***does not include*** the potentially overlapped times
//
//         -=-=-=-=-=- Pacific Standard Time -=-=-=-=-=-=-
//    April 2, 2006                            October 29, 2006
// 2AM            3AM                        1AM              2AM
// |      +1 hr     |                        |       -1 hr      |
// |  |                        |  |
//                  [========== DST ========>)
//
//        -=-=-=-=-=- Some Weird Time Zone -=-=-=-=-=-=-
//    April 2, 2006                          October 29, 2006
// 1AM              2AM                    2AM              3AM
// |      -1 hr       |                      |       +1 hr      |
// |  |                      |    |
//                    [======== DST ========>)
//

Also, from corefx/src/Common/src/CoreLib/System/TimeZoneInfo.Unix.cs we see some details on how ‘leap-years’ are handled:

// should be n Julian day format which we don't support. 
// 
// This specifies the Julian day, with n between 0 and 365. February 29 is counted in leap years.
//
// n would be a relative number from the begining of the year. which should handle if the 
// the year is a leap year or not.
// 
// In leap year, n would be counted as:
// 
// 0                30 31              59 60              90      335            365
// |-------Jan--------|-------Feb--------|-------Mar--------|....|-------Dec--------|
//
// while in non leap year we'll have 
// 
// 0                30 31              58 59              89      334            364
// |-------Jan--------|-------Feb--------|-------Mar--------|....|-------Dec--------|
//
// 
// For example if n is specified as 60, this means in leap year the rule will start at Mar 1,
// while in non leap year the rule will start at Mar 2.
// 
// If we need to support n format, we'll have to have a floating adjustment rule support this case.

Finally, this comment from corefx/src/System.Runtime/tests/System/TimeZoneInfoTests.cs discusses invalid and ambiguous times that are covered in tests:

//    March 26, 2006                            October 29, 2006
// 2AM            3AM                        2AM              3AM
// |      +1 hr     |                        |       -1 hr      |
// |  |                        |  |
//                  *========== DST ========>*

//
// * 00:59:59 Sunday March 26, 2006 in Universal converts to
//   01:59:59 Sunday March 26, 2006 in Europe/Amsterdam (NO DST)
//
// * 01:00:00 Sunday March 26, 2006 in Universal converts to
//   03:00:00 Sunday March 26, 2006 in Europe/Amsterdam (DST)
//

Stack Layouts

To finish off, I wanted to look at ‘stack layouts’ because they seem to be a favourite of the .NET/Mono Runtime Engineers, there’s sooo many examples!

First-up, x68 from coreclr/src/jit/lclvars.cpp (you can also see the x64, ARM and ARM64 versions).

 *  The frame is laid out as follows for x86:
 *
 *              ESP frames                
 *
 *      |                       |         
 *      |-----------------------|         
 *      |       incoming        |         
 *      |       arguments       |         
 *      |-----------------------| stack_usage-4
 *   	spilled regs
 *   ------------------- sp + 
 *   	MonoLMF structure	optional
 *   ------------------- sp + cfg->arch.lmf_offset
 *   	saved registers		s0-s8
 *   ------------------- sp + cfg->arch.iregs_offset
 *   	locals
 *   ------------------- sp + cfg->param_area
 *   	param area		outgoing
 *   ------------------- sp + MIPS_STACK_PARAM_OFFSET
 *   	a0-a3			outgoing
 *   ------------------- sp
 *   	red zone
 */

Finally, there’s another example covering [DLLImport] callbacks and one more involving funclet frames in ARM64, I told you there were lots!!

The Rest

If you aren’t sick of ‘ASCII Art’ by now, here’s a few more examples for you to look at!!


Thu, 25 Apr 2019, 12:00 am


Is C# a low-level language?

I’m a massive fan of everything Fabien Sanglard does, I love his blog and I’ve read both his books cover-to-cover (for more info on his books, check out the recent Hansleminutes podcast).

Recently he wrote an excellent post where he deciphered a postcard sized raytracer, un-packing the obfuscated code and providing a fantastic explanation of the maths involved. I really recommend you take the time to read it!

But it got me thinking, would it be possible to port that C++ code to C#?

Partly because in my day job I’ve been having to write a fair amount of C++ recently and I’ve realised I’m a bit rusty, so I thought this might help!

But more significantly, I wanted to get a better insight into the question is C# a low-level language?

A slightly different, but related question is how suitable is C# for ‘systems programming’? For more on that I really recommend Joe Duffy’s excellent post from 2013.

Line-by-line port

I started by simply porting the un-obfuscated C++ code line-by-line to C#. Turns out that this was pretty straight forward, I guess the story about C# being C++++ is true after all!!

Let’s look at an example, the main data structure in the code is a ‘vector’, here’s the code side-by-side, C++ on the left and C# on the right:

So there’s a few syntax differences, but because .NET lets you define your own ‘Value Types’ I was able to get the same functionality. This is significant because treating the ‘vector’ as a struct means we can get better ‘data locality’ and the .NET Garbage Collector (GC) doesn’t need to be involved as the data will go onto the stack (probably, yes I know it’s an implementation detail).

For more info on structs or ‘value types’ in .NET see:

In particular that last post form Eric Lippert contains this helpful quote that makes it clear what ‘value types’ really are:

Surely the most relevant fact about value types is not the implementation detail of how they are allocated, but rather the by-design semantic meaning of “value type”, namely that they are always copied “by value”. If the relevant thing was their allocation details then we’d have called them “heap types” and “stack types”. But that’s not relevant most of the time. Most of the time the relevant thing is their copying and identity semantics.

Now lets look at how some other methods look side-by-side (again C++ on the left, C# on the right), first up RayTracing(..):

Next QueryDatabase(..):

(see Fabien’s post for an explanation of what these 2 functions are doing)

But the point is that again, C# lets us very easily write C++ code! In this case what helps us out the most is the ref keyword which lets us pass a value by reference. We’ve been able to use ref in method calls for quite a while, but recently there’s been a effort to allow ref in more places:

Now sometimes using ref can provide a performance boost because it means that the struct doesn’t need to be copied, see the benchmarks in Adam Sitniks post and Performance traps of ref locals and ref returns in C# for more information.

However what’s most important for this scenario is that it allows us to have the same behaviour in our C# port as the original C++ code. Although I want to point out that ‘Managed References’ as they’re known aren’t exactly the same as ‘pointers’, most notably you can’t do arithmetic on them, for more on this see:

Performance

So, it’s all well and good being able to port the code, but ultimately the performance also matters. Especially in something like a ‘ray tracer’ that can take minutes to run! The C++ code contains a variable called sampleCount that controls the final quality of the image, with sampleCount = 2 it looks like this:

Which clearly isn’t that realistic!

However once you get to sampleCount = 2048 things look a lot better:

But, running with sampleCount = 2048 means the rendering takes a long time, so all the following results were run with it set to 2, which means the test runs completed in ~1 minute. Changing sampleCount only affects the number of iterations of the outermost loop of the code, see this gist for an explanation.

Results after a ‘naive’ line-by-line port

To be able to give a meaningful side-by-side comparison of the C++ and C# versions I used the time-windows tool that’s a port of the Unix time command. My initial results looked this this:

  C++ (VS 2017) .NET Framework (4.7.2) .NET Core (2.2) Elapsed time (secs) 47.40 80.14 78.02 Kernel time 0.14 (0.3%) 0.72 (0.9%) 0.63 (0.8%) User time 43.86 (92.5%) 73.06 (91.2%) 70.66 (90.6%) page fault # 1,143 4,818 5,945 Working set (KB) 4,232 13,624 17,052 Paged pool (KB) 95 172 154 Non-paged pool 7 14 16 Page file size (KB) 1,460 10,936 11,024

So initially we see that the C# code is quite a bit slower than the C++ version, but it does get better (see below).

However lets first look at what the .NET JIT is doing for us even with this ‘naive’ line-by-line port. Firstly, it’s doing a nice job of in-lining the smaller ‘helper methods’, we can see this by looking at the output of the brilliant Inlining Analyzer tool (green overlay = inlined):

However, it doesn’t inline all methods, for example QueryDatabase(..) is skipped because of it’s complexity:

Another feature that the .NET Just-In-Time (JIT) compiler provides is converting specific methods calls into corresponding CPU instructions. We can see this in action with the sqrt wrapper function, here’s the original C# code (note the call to Math.Sqrt):

// intnv square root
public static Vec operator !(Vec q) {
    return q * (1.0f / (float)Math.Sqrt(q % q));
}

And here’s the assembly code that the .NET JIT generates, there’s no call to Math.Sqrt and it makes use of the vsqrtsd CPU instruction:

; Assembly listing for method Program:sqrtf(float):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   float  ->  mm0        
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M8216_IG01:
       vzeroupper 

G_M8216_IG02:
       vcvtss2sd xmm0, xmm0
       vsqrtsd  xmm0, xmm0
       vcvtsd2ss xmm0, xmm0

G_M8216_IG03:
       ret      

; Total bytes of code 16, prolog size 3 for method Program:sqrtf(float):float
; ============================================================

(to get this output you need to following these instructions, use the ‘Disasmo’ VS2019 Add-in or take a look at SharpLab.io)

These replacements are also known as ‘intrinsics’ and we can see the JIT generating them in the code below. This snippet just shows the mapping for AMD64, the JIT also targets X86, ARM and ARM64, the full method is here

bool Compiler::IsTargetIntrinsic(CorInfoIntrinsics intrinsicId)
{
#if defined(_TARGET_AMD64_) || (defined(_TARGET_X86_) && !defined(LEGACY_BACKEND))
    switch (intrinsicId)
    {
        // AMD64/x86 has SSE2 instructions to directly compute sqrt/abs and SSE4.1
        // instructions to directly compute round/ceiling/floor.
        //
        // TODO: Because the x86 backend only targets SSE for floating-point code,
        //       it does not treat Sine, Cosine, or Round as intrinsics (JIT32
        //       implemented those intrinsics as x87 instructions). If this poses
        //       a CQ problem, it may be necessary to change the implementation of
        //       the helper calls to decrease call overhead or switch back to the
        //       x87 instructions. This is tracked by #7097.
        case CORINFO_INTRINSIC_Sqrt:
        case CORINFO_INTRINSIC_Abs:
            return true;

        case CORINFO_INTRINSIC_Round:
        case CORINFO_INTRINSIC_Ceiling:
        case CORINFO_INTRINSIC_Floor:
            return compSupports(InstructionSet_SSE41);

        default:
            return false;
    }
    ...
}

As you can see, some methods are implemented like this, e.g. Sqrt and Abs, but for others the CLR instead uses the C++ runtime functions for instance powf.

This entire process is explained very nicely in How is Math.Pow() implemented in .NET Framework?, but we can also see it in action in the CoreCLR source:

Results after simple performance improvements

However, I wanted to see if my ‘naive’ line-by-line port could be improved, after some profiling I made two main changes:

  • Remove in-line array initialisation
  • Switch from Math.XXX(..) functions to the MathF.XXX() counterparts.

These changes are explained in more depth below

Remove in-line array initialisation

For more information about why this is necessary see this excellent Stack Overflow answer from Andrey Akinshin complete with benchmarks and assembly code! It comes to the following conclusion:

Conclusion

  • Does .NET caches hardcoded local arrays? Kind of: the Roslyn compiler put it in the metadata.
  • Do we have any overhead in this case? Unfortunately, yes: JIT will copy the array content from the metadata for each invocation; it will work longer than the case with a static array. Runtime also allocates objects and produce memory traffic.
  • Should we care about it? It depends. If it’s a hot method and you want to achieve a good level of performance, you should use a static array. If it’s a cold method which doesn’t affect the application performance, you probably should write “good” source code and put the array in the method scope.

You can see the change I made in this diff.

Using MathF functions instead of Math

Secondly and most significantly I got a big perf improvement by making the following changes:

#if NETSTANDARD2_1 || NETCOREAPP2_0 || NETCOREAPP2_1 || NETCOREAPP2_2 || NETCOREAPP3_0
    // intnv square root
    public static Vec operator !(Vec q) {
      return q * (1.0f / MathF.Sqrt(q % q));
    }
#else
    public static Vec operator !(Vec q) {
      return q * (1.0f / (float)Math.Sqrt(q % q));
    }
#endif

As of ‘.NET Standard 2.1’ there are now specific float implementations of the common maths functions, located in the System.MathF class. For more information on this API and it’s implementation see:

After these changes, the C# code is ~10% slower than the C++ version:

  C++ (VS C++ 2017) .NET Framework (4.7.2) .NET Core (2.2) TC OFF .NET Core (2.2) TC ON Elapsed time (secs) 41.38 58.89 46.04 44.33 Kernel time 0.05 (0.1%) 0.06 (0.1%) 0.14 (0.3%) 0.13 (0.3%) User time 41.19 (99.5%) 58.34 (99.1%) 44.72 (97.1%) 44.03 (99.3%) page fault # 1,119 4,749 5,776 5,661 Working set (KB) 4,136 13,440 16,788 16,652 Paged pool (KB) 89 172 150 150 Non-paged pool 7 13 16 16 Page file size (KB) 1,428 10,904 10,960 11,044

TC = Tiered Compilation (I believe that it’ll be on by default in .NET Core 3.0)

For completeness, here’s the results across several runs:

Run C++ (VS C++ 2017) .NET Framework (4.7.2) .NET Core (2.2) TC OFF .NET Core (2.2) TC ON TestRun-01 41.38 58.89 46.04 44.33 TestRun-02 41.19 57.65 46.23 45.96 TestRun-03 42.17 62.64 46.22 48.73

Note: the difference between .NET Core and .NET Framework is due to the lack of the MathF API in .NET Framework v4.7.2, for more info see Support .Net Framework (4.8?) for netstandard 2.1.

Further performance improvements

However I’m sure that others can do better!

If you’re interested in trying to close the gap the C# code is available. For comparison, you can see the assembly produced by the C++ compiler courtesy of the brilliant Compiler Explorer.

Finally, if it helps, here’s the output from the Visual Studio Profiler showing the ‘hot path’ (after the perf improvement described above):

Is C# a low-level language?

Or more specifically:

What language features of C#/F#/VB.NET or BCL/Runtime functionality enable ‘low-level’* programming?

* yes, I know ‘low-level’ is a subjective term 😊

Note: Any C# developer is going to have a different idea of what ‘low-level’ means, these features would be taken for granted by C++ or Rust programmers.

Here’s the list that I came up with:

  • ref returns and ref locals
    • “tl;dr Pass and return by reference to avoid large struct copying. It’s type and memory safe. It can be even faster than unsafe!
  • Unsafe code in .NET
    • “The core C# language, as defined in the preceding chapters, differs notably from C and C++ in its omission of pointers as a data type. Instead, C# provides references and the ability to create objects that are managed by a garbage collector. This design, coupled with other features, makes C# a much safer language than C or C++.”
  • Managed pointers in .NET
    • “There is, however, another pointer type in CLR – a managed pointer. It could be defined as a more general type of reference, which may point to other locations than just the beginning of an object.”
  • C# 7 Series, Part 10: Span and universal memory management
    • System.Span is a stack-only type (ref struct) that wraps all memory access patterns, it is the type for universal contiguous memory access. You can think the implementation of the Span contains a dummy reference and a length, accepting all 3 memory access types."
  • Interoperability (C# Programming Guide)
    • “The .NET Framework enables interoperability with unmanaged code through platform invoke services, the System.Runtime.InteropServices namespace, C++ interoperability, and COM interoperability (COM interop).”

However, I know my limitations and so I asked on twitter and got a lot more replies to add to the list:

  • Ben Adams “Platform intrinsics (CPU instruction access)”
  • Marc Gravell “SIMD via Vector (which mixes well with Span) is *fairly* low; .NET Core should (soon?) offer direct CPU intrinsics for more explicit usage targeting particular CPU ops"
  • Marc Gravell “powerful JIT: things like range elision on arrays/spans, and the JIT using per-struct-T rules to remove huge chunks of code that it knows can’t be reached for that T, or on your particular CPU (BitConverter.IsLittleEndian, Vector.IsHardwareAccelerated, etc)”
  • Kevin Jones “I would give a special shout-out to the MemoryMarshal and Unsafe classes, and probably a few other things in the System.Runtime.CompilerServices namespace.”
  • Theodoros Chatzigiannakis “You could also include __makeref and the rest.”
  • damageboy “Being able to dynamically generate code that fits the expected input exactly, given that the latter will only be known at runtime, and might change periodically?”
  • Robert Haken “dynamic IL emission”
  • Victor Baybekov “Stackalloc was not mentioned. Also ability to write raw IL (not dynamic, so save on a delegate call), e.g. to use cached ldftn and call them via calli. VS2017 has a proj template that makes this trivial via extern methods + MethodImplOptions.ForwardRef + ilasm.exe rewrite.”
  • Victor Baybekov “Also MethodImplOptions.AggressiveInlining “does enable ‘low-level’ programming” in a sense that it allows to write high-level code with many small methods and still control JIT behavior to get optimized result. Otherwise uncomposable 100s LOCs methods with copy-paste…”
  • Ben Adams “Using the same calling conventions (ABI) as the underlying platform and p/invokes for interop might be more of a thing though?”
  • Victor Baybekov “Also since you mentioned #fsharp - it does have inline keyword that does the job at IL level before JIT, so it was deemed important at the language level. C# lacks this (so far) for lambdas which are always virtual calls and workarounds are often weird (constrained generics).”
  • Alexandre Mutel “new SIMD intrinsics, Unsafe Utility class/IL post processing (e.g custom, Fody…etc.). For C#8.0, upcoming function pointers…”
  • Alexandre Mutel “related to IL, F# has support for direct IL within the language for example”
  • OmariO “BinaryPrimitives. Low-level but safe.” (https://docs.microsoft.com/en-us/dotnet/api/system.buffers.binary.binaryprimitives?view=netcore-3.0)
  • Kouji (Kozy) Matsui “How about native inline assembler? It’s difficult for how relation both toolchains and runtime, but can replace current P/Invoke solution and do inlining if we have it.”
  • Frank A. Krueger “Ldobj, stobj, initobj, initblk, cpyblk.”
  • Konrad Kokosa “Maybe Thread Local Storage? Fixed Size Buffers? unmanaged constraint and blittable types should be probably mentioned:)”
  • Sebastiano Mandalà “Just my two cents as everything has been said: what about something as simple as struct layout and how padding and memory alignment and order of the fields may affect the cache line performance? It’s something I have to investigate myself too”
  • Nino Floris “Constants embedding via readonlyspan, stackalloc, finalizers, WeakReference, open delegates, MethodImplOptions, MemoryBarriers, TypedReference, varargs, SIMD, Unsafe.AsRef can coerce struct types if layout matches exactly (used for a.o. TaskAwaiter and its version)"

So in summary, I would say that C# certainly lets you write code that looks a lot like C++ and in conjunction with the Runtime and Base-Class Libraries it gives you a lot of low-level functionality

Discuss this post on Hacker News, /r/programming, /r/dotnet or /r/csharp

Further Reading

The Unity ‘Burst’ Compiler:


Fri, 1 Mar 2019, 12:00 am


"Stack Walking" in the .NET Runtime

What is ‘stack walking’, well as always the ‘Book of the Runtime’ (BotR) helps us, from the relevant page:

The CLR makes heavy use of a technique known as stack walking (or stack crawling). This involves iterating the sequence of call frames for a particular thread, from the most recent (the thread’s current function) back down to the base of the stack.

The runtime uses stack walks for a number of purposes:

  • The runtime walks the stacks of all threads during garbage collection, looking for managed roots (local variables holding object references in the frames of managed methods that need to be reported to the GC to keep the objects alive and possibly track their movement if the GC decides to compact the heap).
  • On some platforms the stack walker is used during the processing of exceptions (looking for handlers in the first pass and unwinding the stack in the second).
  • The debugger uses the functionality when generating managed stack traces.
  • Various miscellaneous methods, usually those close to some public managed API, perform a stack walk to pick up information about their caller (such as the method, class or assembly of that caller).

The rest of this post will explore what ‘Stack Walking’ is, how it works and why so many parts of the runtime need to be involved.

Table of Contents

Where does the CLR use ‘Stack Walking’?

Before we dig into the ‘internals’, let’s take a look at where the runtime utilises ‘stack walking’, below is the full list (as of .NET Core CLR ‘Release 2.2’). All these examples end up calling into the Thread::StackWalkFrames(..) method here and provide a callback that is triggered whenever the API encounters a new section of the stack (see How to use it below for more info).

Common Scenarios

Debugging/Diagnostics

  • Debugger
  • Managed APIs (e.g System.Diagnostics.StackTrace)
    • Managed code calls via an InternalCall (C#) here into DebugStackTrace::GetStackFramesInternal(..) (C++) here
    • Before ending up in DebugStackTrace::GetStackFramesHelper(..) here -> callback
  • DAC (via by SOS) - Scan for GC ‘Roots’
  • Profiling API
    • ProfToEEInterfaceImpl::ProfilerStackWalkFramesWrapper(..) here -> callback
  • Event Pipe (Diagnostics)
  • CLR prints a Stack Trace (to the console/log, DEBUG builds only)

Obscure Scenarios

  • Reflection
  • Application (App) Domains (See ‘Stack Crawl Marks’ below)
    • SystemDomain::GetCallersMethod(..) here (also GetCallersType(..) and GetCallersModule(..)) (callback)
    • SystemDomain::GetCallersModule(..) here (callback)
  • ‘Code Pitching’
  • Extensible Class Factory (System.Runtime.InteropServices.ExtensibleClassFactory)
  • Stack Sampler (unused?)

Stack Crawl Marks

One of the above scenarios deserves a closer look, but firstly why are ‘stack crawl marks’ used, from coreclr/issues/#21629 (comment):

Unfortunately, there is a ton of legacy APIs that were added during netstandard2.0 push whose behavior depend on the caller. The caller is basically passed in as an implicit argument to the API. Most of these StackCrawlMarks are there to support these APIs…

So we can see that multiple functions within the CLR itself need to have knowledge of their caller. To understand this some more, let’s look an example, the GetType(string typeName) method. Here’s the flow from the externally-visible method all the way down to where the work is done, note how a StackCrawlMark instance is passed through:

  • Type::GetType(string typeName) implementation (Creates StackCrawlMark.LookForMyCaller)
  • RuntimeType::GetType(.., ref StackCrawlMark stackMark) implementation
  • RuntimeType::GetTypeByName(.., ref StackCrawlMark stackMark, ..) implementation
  • extern void GetTypeByName(.., ref StackCrawlMark stackMark, ..) definition (call into native code, i.e. [DllImport(JitHelpers.QCall, ..)])
  • RuntimeTypeHandle::GetTypeByName(.., QCall::StackCrawlMarkHandle pStackMark, ..) implementation
  • TypeHandle TypeName::GetTypeManaged(.., StackCrawlMark* pStackMark, ..) implementation
  • TypeHandle TypeName::GetTypeWorker(.. , StackCrawlMark* pStackMark, ..) implementation
  • SystemDomain::GetCallersAssembly(StackCrawlMark *stackMark,..) implementation
  • SystemDomain::GetCallersModule(StackCrawlMark* stackMark, ..) implementation
  • SystemDomain::CallersMethodCallbackWithStackMark(..) callback implementation

In addition the JIT (via the VM) has to ensure that all relevant methods are available in the call-stack, i.e. they can’t be removed:

However, the StackCrawlMark feature is currently being cleaned up, so it may look different in the future:

Exception Handling

The place that most .NET Developers will run into ‘stack traces’ is when dealing with exceptions. I originally intended to also describe ‘exception handling’ here, but then I opened up /src/vm/exceptionhandling.cpp and saw that it contained over 7,000 lines of code!! So I decided that it can wait for a future post 😁.

However, if you want to learn more about the ‘internals’ I really recommend Chris Brumme’s post The Exception Model (2003) which is the definitive guide on the topic (also see his Channel9 Videos) and as always, the ‘BotR’ chapter ‘What Every (Runtime) Dev needs to Know About Exceptions in the Runtime’ is well worth a read.

Also, I recommend talking a look at the slides from the ‘Internals of Exceptions’ talk’ and the related post .NET Inside Out Part 2 — Handling and rethrowing exceptions in C# both by Adam Furmanek.

The ‘Stack Walking’ API

Now that we’ve seen where it’s used, let’s look at the ‘stack walking’ API itself. Firstly, how is it used?

How to use it

It’s worth pointing out that the only way you can access it from C#/F#/VB.NET code is via the StackTrace class, only the runtime itself can call into Thread::StackWalkFrames(..) directly. The simplest usage in the runtime is EventPipe::WalkManagedStackForThread(..) (see here), which is shown below. As you can see it’s as simple as specifying the relevant flags, in this case ALLOW_ASYNC_STACK_WALK | FUNCTIONSONLY | HANDLESKIPPEDFRAMES | ALLOW_INVALID_OBJECTS and then providing the callback, which in the EventPipe class is the StackWalkCallback method (here)

bool EventPipe::WalkManagedStackForThread(Thread *pThread, StackContents &stackContents)
{
    CONTRACTL
    {
        NOTHROW;
        GC_NOTRIGGER;
        MODE_ANY;
        PRECONDITION(pThread != NULL);
    }
    CONTRACTL_END;

    // Calling into StackWalkFrames in preemptive mode violates the host contract,
    // but this contract is not used on CoreCLR.
    CONTRACT_VIOLATION( HostViolation );

    stackContents.Reset();

    StackWalkAction swaRet = pThread->StackWalkFrames(
        (PSTACKWALKFRAMESCALLBACK) &StackWalkCallback,
        &stackContents,
        ALLOW_ASYNC_STACK_WALK | FUNCTIONSONLY | HANDLESKIPPEDFRAMES | ALLOW_INVALID_OBJECTS);

    return ((swaRet == SWA_DONE) || (swaRet == SWA_CONTINUE));
}

The StackWalkFrame(..) function then does the heavy-lifting of actually walking the stack, before triggering the callback shown below. In this case it just records the ‘Instruction Pointer’ (IP/CP) and the ‘managed function’, which is an instance of the MethodDesc obtained via the pCf->GetFunction() call:

StackWalkAction EventPipe::StackWalkCallback(CrawlFrame *pCf, StackContents *pData)
{
    CONTRACTL
    {
        NOTHROW;
        GC_NOTRIGGER;
        MODE_ANY;
        PRECONDITION(pCf != NULL);
        PRECONDITION(pData != NULL);
    }
    CONTRACTL_END;

    // Get the IP.
    UINT_PTR controlPC = (UINT_PTR)pCf->GetRegisterSet()->ControlPC;
    if (controlPC == 0)
    {
        if (pData->GetLength() == 0)
        {
            // This happens for pinvoke stubs on the top of the stack.
            return SWA_CONTINUE;
        }
    }

    _ASSERTE(controlPC != 0);

    // Add the IP to the captured stack.
    pData->Append(controlPC, pCf->GetFunction());

    // Continue the stack walk.
    return SWA_CONTINUE;
}

How it works

Now onto the most interesting part, how to the runtime actually walks the stack. Well, first let’s understand what the stack looks like, from the ‘BotR’ page:

The main thing to note is that a .NET ‘stack’ can contain 3 types of methods:

  1. Managed - this represents code that started off as C#/F#/VB.NET, was turned into IL and then finally compiled to native code by the ‘JIT Compiler’.
  2. Unmanaged - completely native code that exists outside of the runtime, i.e. a OS function the runtime calls into or a user call via P/Invoke. The runtime only cares about transitions into or out of regular unmanaged code, is doesn’t care about the stack frame within it.
  3. Runtime Managed - still native code, but this is slightly different because the runtime case more about this code. For example there are quite a few parts of the Base-Class libraries that make use of InternalCall methods, for more on this see the ‘Helper Method’ Frames section later on.

So the ‘stack walk’ has to deal with these different scenarios as it proceeds. Now let’s look at the ‘code flow’ starting with the entry-point method StackWalkFrames(..):

  • Thread::StackWalkFrames(..) here
    • the entry-point function, the type of ‘stack walk’ can be controlled via these flags
  • Thread::StackWalkFramesEx(..) here
    • worker-function that sets up the StackFrameIterator, via a call to StackFrameIterator::Init(..) here
  • StackFrameIterator::Next() here, then hands off to the primary worker method StackFrameIterator::NextRaw() here that does 5 things:
    1. CheckForSkippedFrames(..) here, deals with frames that may have been allocated inside a managed stack frame (e.g. an inlined p/invoke call).
    2. UnwindStackFrame(..) here, in-turn calls:
      • x64 - Thread::VirtualUnwindCallFrame(..) here, then calls VirtualUnwindNonLeafCallFrame(..) here or VirtualUnwindLeafCallFrame(..) here. All of of these functions make use of the Windows API function RtlLookupFunctionEntry(..) to do the actual unwinding.
      • x86 - ::UnwindStackFrame(..) here, in turn calls UnwindEpilog(..) here and UnwindEspFrame(..) here. Unlike x64, under x86 all the ‘stack-unwinding’ is done manually, within the CLR code.
    3. PostProcessingForManagedFrames(..) here, determines if the stack-walk is actually within a managed method rather than a native frame.
    4. ProcessIp(..) here has the job of looking up the current managed method (if any) based on the current instruction pointer (IP). It does this by calling into EECodeInfo::Init(..) here and then ends up in one of:
      • EEJitManager::JitCodeToMethodInfo(..) here, that uses a very cool looking data structure refereed to as a ‘nibble map’
      • NativeImageJitManager::JitCodeToMethodInfo(..) here
      • ReadyToRunJitManager::JitCodeToMethodInfo(..) here
    5. ProcessCurrentFrame(..) here, does some final house-keeping and tidy-up.
  • CrawlFrame::GotoNextFrame() here
    • in-turn calls pFrame->Next() here to walk through the ‘linked list’ of frames which drive the ‘stack walk’ (more on these ‘frames’ later)
  • StackFrameIterator::Filter() here

When it gets a valid frame it triggers the callback in Thread::MakeStackwalkerCallback(..) here and passes in a pointer to the current CrawlFrame class defined here, this exposes methods such as IsFrameless(), GetFunction() and GetThisPointer(). The CrawlFrame actually represents 2 scenarios, based on the current IP:

  • Native code, represented by a Frame class defined here, which we’ll discuss more in a moment.
  • Managed code, well technically ‘managed code’ that was JITted to ‘native code’, so more accurately a managed stack frame. In this situation the MethodDesc class defined here is provided, you can read more about this key CLR data-structure in the corresponding BotR chapter.

See it ‘in Action’

Fortunately we’re able to turn on some nice diagnostics in a debug build of the CLR (COMPLUS_LogEnable, COMPLUS_LogToFile & COMPLUS_LogFacility). With that in place, given C# code like this:

internal class Program {
    private static void Main() {
        MethodA();
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    private void MethodA() {
        MethodB();
    }
    
    [MethodImpl(MethodImplOptions.NoInlining)]
    private void MethodB() {
        MethodC();
    }
    
    [MethodImpl(MethodImplOptions.NoInlining)]
    private void MethodC() {
        var stackTrace = new StackTrace(fNeedFileInfo: true);
        Console.WriteLine(stackTrace.ToString());
    }
}

We get the output shown below, in which you can see the ‘stack walking’ process. It starts in InitializeSourceInfo and CaptureStackTrace which are methods internal to the StackTrace class (see here), before moving up the stack MethodC -> MethodB -> MethodA and then finally stopping in the Main function. Along the way its does a ‘FILTER’ and ‘CONSIDER’ step before actually unwinding (‘finished unwind for …’):

TID 4740: STACKWALK    starting with partial context
TID 4740: STACKWALK: [000] FILTER  : EXPLICIT : PC= 00000000`00000000  SP= 00000000`00000000  Frame= 00000002`9977cc48  vtbl= 00007ffd`74a105b0 
TID 4740: STACKWALK: [001] CONSIDER: EXPLICIT : PC= 00000000`00000000  SP= 00000000`00000000  Frame= 00000002`9977cc48  vtbl= 00007ffd`74a105b0 
TID 4740: STACKWALK: [001] FILTER  : EXPLICIT : PC= 00000000`00000000  SP= 00000000`00000000  Frame= 00000002`9977cc48  vtbl= 00007ffd`74a105b0 
TID 4740: STACKWALK: [002] CONSIDER: EXPLICIT : PC= 00000000`00000000  SP= 00000000`00000000  Frame= 00000002`9977cdd8  vtbl= 00007ffd`74995220 
TID 4740: STACKWALK    LazyMachState::unwindLazyState(ip:00007FFD7439C45C,sp:000000029977C338)
TID 4740: STACKWALK: [002] CALLBACK: EXPLICIT : PC= 00000000`00000000  SP= 00000000`00000000  Frame= 00000002`9977cdd8  vtbl= 00007ffd`74995220 
TID 4740: STACKWALK    HelperMethodFrame::UpdateRegDisplay cached ip:00007FFD72FE9258, sp:000000029977D300
TID 4740: STACKWALK: [003] CONSIDER: FRAMELESS: PC= 00007ffd`72fe9258  SP= 00000002`9977d300  method=InitializeSourceInfo 
TID 4740: STACKWALK: [003] CALLBACK: FRAMELESS: PC= 00007ffd`72fe9258  SP= 00000002`9977d300  method=InitializeSourceInfo 
TID 4740: STACKWALK: [004] about to unwind for 'InitializeSourceInfo', SP: 00000002`9977d300 , IP: 00007ffd`72fe9258 
TID 4740: STACKWALK: [004] finished unwind for 'InitializeSourceInfo', SP: 00000002`9977d480 , IP: 00007ffd`72eeb671 
TID 4740: STACKWALK: [004] CONSIDER: FRAMELESS: PC= 00007ffd`72eeb671  SP= 00000002`9977d480  method=CaptureStackTrace 
TID 4740: STACKWALK: [004] CALLBACK: FRAMELESS: PC= 00007ffd`72eeb671  SP= 00000002`9977d480  method=CaptureStackTrace 
TID 4740: STACKWALK: [005] about to unwind for 'CaptureStackTrace', SP: 00000002`9977d480 , IP: 00007ffd`72eeb671 
TID 4740: STACKWALK: [005] finished unwind for 'CaptureStackTrace', SP: 00000002`9977d5b0 , IP: 00007ffd`72eeadd0 
TID 4740: STACKWALK: [005] CONSIDER: FRAMELESS: PC= 00007ffd`72eeadd0  SP= 00000002`9977d5b0  method=.ctor 
TID 4740: STACKWALK: [005] CALLBACK: FRAMELESS: PC= 00007ffd`72eeadd0  SP= 00000002`9977d5b0  method=.ctor 
TID 4740: STACKWALK: [006] about to unwind for '.ctor', SP: 00000002`9977d5b0 , IP: 00007ffd`72eeadd0 
TID 4740: STACKWALK: [006] finished unwind for '.ctor', SP: 00000002`9977d5f0 , IP: 00007ffd`14c620d3 
TID 4740: STACKWALK: [006] CONSIDER: FRAMELESS: PC= 00007ffd`14c620d3  SP= 00000002`9977d5f0  method=MethodC 
TID 4740: STACKWALK: [006] CALLBACK: FRAMELESS: PC= 00007ffd`14c620d3  SP= 00000002`9977d5f0  method=MethodC 
TID 4740: STACKWALK: [007] about to unwind for 'MethodC', SP: 00000002`9977d5f0 , IP: 00007ffd`14c620d3 
TID 4740: STACKWALK: [007] finished unwind for 'MethodC', SP: 00000002`9977d630 , IP: 00007ffd`14c62066 
TID 4740: STACKWALK: [007] CONSIDER: FRAMELESS: PC= 00007ffd`14c62066  SP= 00000002`9977d630  method=MethodB 
TID 4740: STACKWALK: [007] CALLBACK: FRAMELESS: PC= 00007ffd`14c62066  SP= 00000002`9977d630  method=MethodB 
TID 4740: STACKWALK: [008] about to unwind for 'MethodB', SP: 00000002`9977d630 , IP: 00007ffd`14c62066 
TID 4740: STACKWALK: [008] finished unwind for 'MethodB', SP: 00000002`9977d660 , IP: 00007ffd`14c62016 
TID 4740: STACKWALK: [008] CONSIDER: FRAMELESS: PC= 00007ffd`14c62016  SP= 00000002`9977d660  method=MethodA 
TID 4740: STACKWALK: [008] CALLBACK: FRAMELESS: PC= 00007ffd`14c62016  SP= 00000002`9977d660  method=MethodA 
TID 4740: STACKWALK: [009] about to unwind for 'MethodA', SP: 00000002`9977d660 , IP: 00007ffd`14c62016 
TID 4740: STACKWALK: [009] finished unwind for 'MethodA', SP: 00000002`9977d690 , IP: 00007ffd`14c61f65 
TID 4740: STACKWALK: [009] CONSIDER: FRAMELESS: PC= 00007ffd`14c61f65  SP= 00000002`9977d690  method=Main 
TID 4740: STACKWALK: [009] CALLBACK: FRAMELESS: PC= 00007ffd`14c61f65  SP= 00000002`9977d690  method=Main 
TID 4740: STACKWALK: [00a] about to unwind for 'Main', SP: 00000002`9977d690 , IP: 00007ffd`14c61f65 
TID 4740: STACKWALK: [00a] finished unwind for 'Main', SP: 00000002`9977d6d0 , IP: 00007ffd`742f9073 
TID 4740: STACKWALK: [00a] FILTER  : NATIVE   : PC= 00007ffd`742f9073  SP= 00000002`9977d6d0 
TID 4740: STACKWALK: [00b] CONSIDER: EXPLICIT : PC= 00007ffd`742f9073  SP= 00000002`9977d6d0  Frame= 00000002`9977de58  vtbl= 00007ffd`74a105b0 
TID 4740: STACKWALK: [00b] FILTER  : EXPLICIT : PC= 00007ffd`742f9073  SP= 00000002`9977d6d0  Frame= 00000002`9977de58  vtbl= 00007ffd`74a105b0 
TID 4740: STACKWALK: [00c] CONSIDER: EXPLICIT : PC= 00007ffd`742f9073  SP= 00000002`9977d6d0  Frame= 00000002`9977e7e0  vtbl= 00007ffd`74a105b0 
TID 4740: STACKWALK: [00c] FILTER  : EXPLICIT : PC= 00007ffd`742f9073  SP= 00000002`9977d6d0  Frame= 00000002`9977e7e0  vtbl= 00007ffd`74a105b0 
TID 4740: STACKWALK: SWA_DONE: reached the end of the stack

To find out more, you can search for these diagnostic message in \vm\stackwalk.cpp, e.g. in Thread::DebugLogStackWalkInfo(..) here

Unwinding ‘Native’ Code

As explained in this excellent article:

There are fundamentally two main ways to implement exception propagation in an ABI (Application Binary Interface):

  • “dynamic registration”, with frame pointers in each activation record, organized as a linked list. This makes stack unwinding fast at the expense of having to set up the frame pointer in each function that calls other functions. This is also simpler to implement.

  • “table-driven”, where the compiler and assembler create data structures alongside the program code to indicate which addresses of code correspond to which sizes of activation records. This is called “Call Frame Information” (CFI) data in e.g. the GNU tool chain. When an exception is generated, the data in this table is loaded to determine how to unwind. This makes exception propagation slower but the general case faster.

It turns out that .NET uses the ‘table-driven’ approach, for the reason explained in the ‘BotR’:

The exact definition of a frame varies from platform to platform and on many platforms there isn’t a hard definition of a frame format that all functions adhere to (x86 is an example of this). Instead the compiler is often free to optimize the exact format of frames. On such systems it is not possible to guarantee that a stackwalk will return 100% correct or complete results (for debugging purposes, debug symbols such as pdbs are used to fill in the gaps so that debuggers can generate more accurate stack traces).

This is not a problem for the CLR, however, since we do not require a fully generalized stack walk. Instead we are only interested in those frames that are managed (i.e. represent a managed method) or, to some extent, frames coming from unmanaged code used to implement part of the runtime itself. In particular there is no guarantee about fidelity of 3rd party unmanaged frames other than to note where such frames transition into or out of the runtime itself (i.e. one of the frame types we do care about).

Frames

To enable ‘unwinding’ of native code or more strictly the transitions ‘into’ and ‘out of’ native code, the CLR uses a mechanism of Frames, which are defined in the source code here. These frames are arranged into a hierachy and there is one type of Frame for each scenario, for more info on these individual Frames take a look at the excellent source-code comments here.

  • Frame (abstract/base class)
    • GCFrame
    • FaultingExceptionFrame
    • HijackFrame
    • ResumableFrame
      • RedirectedThreadFrame
    • InlinedCallFrame
    • HelperMethodFrame
      • HelperMethodFrame_1OBJ
      • HelperMethodFrame_2OBJ
      • HelperMethodFrame_3OBJ
      • HelperMethodFrame_PROTECTOBJ
    • TransitionFrame
      • StubHelperFrame
      • SecureDelegateFrame
        • MulticastFrame
      • FramedMethodFrame
        • ComPlusMethodFrame
        • PInvokeCalliFrame
        • PrestubMethodFrame
        • StubDispatchFrame
        • ExternalMethodFrame
        • TPMethodFrame
    • UnmanagedToManagedFrame
      • ComMethodFrame
        • ComPrestubMethodFrame
      • UMThkCallFrame
    • ContextTransitionFrame
    • TailCallFrame
    • ProtectByRefsFrame
    • ProtectValueClassFrame
    • DebuggerClassInitMarkFrame
    • DebuggerSecurityCodeMarkFrame
    • DebuggerExitFrame
    • DebuggerU2MCatchHandlerFrame
    • FuncEvalFrame
    • ExceptionFilterFrame

‘Helper Method’ Frames

But to make sense of this, let’s look at one type of Frame, known as HelperMethodFrame (above). This is used when .NET code in the runtime calls into C++ code to do the heavy-lifting, often for performance reasons. One example is if you call Environment.GetCommandLineArgs() you end up in this code (C#), but note that it ends up calling an extern method marked with InternalCall:

[MethodImplAttribute(MethodImplOptions.InternalCall)]
private static extern string[] GetCommandLineArgsNative();

This means that the rest of the method is implemented in the runtime in C++, you can see how the method call is wired up, before ending up SystemNative::GetCommandLineArgs here, which is shown below:

FCIMPL0(Object*, SystemNative::GetCommandLineArgs)
{
    FCALL_CONTRACT;

    PTRARRAYREF strArray = NULL;

    HELPER_METHOD_FRAME_BEGIN_RET_1(strArray); // GetAppDomain());
    }
    delete [] argv;

    HELPER_METHOD_FRAME_END(); // CountOfUnwindCodes]), sizeof(ULONG));
    *pPersonalityRoutine = ExecutionManager::GetCLRPersonalityRoutineValue();
#elif defined(_TARGET_ARM64_)
    *(LONG *)pUnwindInfo |= (1    End offset   : 0x00004e (not in unwind data)
  Version           : 1
  Flags             : 0x00
  SizeOfProlog      : 0x07
  CountOfUnwindCodes: 4
  FrameRegister     : none (0)
  FrameOffset       : N/A (no FrameRegister) (Value=0)
  UnwindCodes       :
    CodeOffset: 0x07 UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 11 * 8 + 8 = 96 = 0x60
    CodeOffset: 0x03 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)
    CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rdi (7)
    CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rbp (5)
Unwind Info:
  >> Start offset   : 0x00004e (not in unwind data)
  >>   End offset   : 0x0000e2 (not in unwind data)
  Version           : 1
  Flags             : 0x00
  SizeOfProlog      : 0x07
  CountOfUnwindCodes: 4
  FrameRegister     : none (0)
  FrameOffset       : N/A (no FrameRegister) (Value=0)
  UnwindCodes       :
    CodeOffset: 0x07 UnwindOp: UWOP_ALLOC_SMALL (2)     OpInfo: 5 * 8 + 8 = 48 = 0x30
    CodeOffset: 0x03 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rsi (6)
    CodeOffset: 0x02 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rdi (7)
    CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0)     OpInfo: rbp (5)

This ‘unwind info’ is then looked up during a ‘stack walk’ as explained in the How it works section above.

So next time you encounter a ‘stack trace’ remember that a lot of work went into making it possible!!

Further Reading

‘Stack Walking’ or ‘Stack Unwinding’ is a very large topic, so if you want to know more, here are some links to get you started:

Stack Unwinding (general)

Stack Unwinding (other runtimes)

In addition, it’s interesting to look at how other runtimes handles this process:


Mon, 21 Jan 2019, 12:00 am


Exploring the .NET Core Runtime (in which I set myself a challenge)

It seems like this time of year anyone with a blog is doing some sort of ‘advent calendar’, i.e. 24 posts leading up to Christmas. For instance there’s a F# one which inspired a C# one (C# copying from F#, that never happens 😉)

However, that’s a bit of a problem for me, I struggled to write 24 posts in my most productive year, let alone a single month! Also, I mostly blog about ‘.NET Internals’, a subject which doesn’t necessarily lend itself to the more ‘light-hearted’ posts you get in these ‘advent calendar’ blogs.

Until now!

Recently I’ve been giving a talk titled from ‘dotnet run’ to ‘hello world’, which attempts to explain everything that the .NET Runtime does from the point you launch your application till “Hello World” is printed on the screen:

From 'dotnet run' to 'hello world' from Matt Warren

But as I was researching and presenting this talk, it made me think about the .NET Runtime as a whole, what does it contain and most importantly what can you do with it?

Note: this is mostly for informational purposes, for the recommended way of achieving the same thing, take a look at this excellent Deep-dive into .NET Core primitives by Nate McMaster.

In this post I will explore what you can do using only the code in the dotnet/coreclr repository and along the way we’ll find out more about how the runtime interacts with the wider .NET Ecosystem.

To makes things clearer, there are 3 challenges that will need to be solved before a simple “Hello World” application can be run. That’s because in the dotnet/coreclr repository there is:

  1. No compiler, that lives in dotnet/Roslyn
  2. No Framework Class Library (FCL) a.k.a. ‘dotnet/CoreFX
  3. No dotnet run as it’s implemented in the dotnet/CLI repository

Building the CoreCLR

But before we even work through these ‘challenges’, we need to build the CoreCLR itself. Helpfully there is really nice guide available in ‘Building the Repository’:

The build depends on Git, CMake, Python and of course a C++ compiler. Once these prerequisites are installed the build is simply a matter of invoking the ‘build’ script (build.cmd or build.sh) at the base of the repository.

The details of installing the components differ depending on the operating system. See the following pages based on your OS. There is no cross-building across OS (only for ARM, which is built on X64). You have to be on the particular platform to build that platform.

If you follow these steps successfully, you’ll end up with the following files (at least on Windows, other OSes may produce something slightly different):

No Compiler

First up, how do we get around the fact that we don’t have a compiler? After all we need some way of turing our simple “Hello World” code into a .exe?

namespace Hello_World
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello World!");
        }
    }
}

Fortunately we do have access to the ILASM tool (IL Assembler), which can turn Common Intermediate Language (CIL) into an .exe file. But how do we get the correct IL code? Well, one way is to write it from scratch, maybe after reading Inside NET IL Assembler and Expert .NET 2.0 IL Assembler by Serge Lidin (yes, amazingly, 2 books have been written about IL!)

Another, much easier way, is to use the amazing SharpLab.io site to do it for us! If you paste the C# code from above into it, you’ll get the following IL code:

.class private auto ansi ''
{
} // end of class 

.class private auto ansi beforefieldinit Hello_World.Program
    extends [mscorlib]System.Object
{
    // Methods
    .method private hidebysig static 
        void Main (
            string[] args
        ) cil managed 
    {
        // Method begins at RVA 0x2050
        // Code size 11 (0xb)
        .maxstack 8

        IL_0000: ldstr "Hello World!"
        IL_0005: call void [mscorlib]System.Console::WriteLine(string)
        IL_000a: ret
    } // end of method Program::Main

    .method public hidebysig specialname rtspecialname 
        instance void .ctor () cil managed 
    {
        // Method begins at RVA 0x205c
        // Code size 7 (0x7)
        .maxstack 8

        IL_0000: ldarg.0
        IL_0001: call instance void [mscorlib]System.Object::.ctor()
        IL_0006: ret
    } // end of method Program::.ctor

} // end of class Hello_World.Program

Then, if we save this to a file called ‘HelloWorld.il’ and run the cmd ilasm HelloWorld.il /out=HelloWorld.exe, we get the following output:

Microsoft (R) .NET Framework IL Assembler.  Version 4.5.30319.0
Copyright (c) Microsoft Corporation.  All rights reserved.
Assembling 'HelloWorld.il'  to EXE --> 'HelloWorld.exe'
Source file is ANSI

HelloWorld.il(38) : warning : Reference to undeclared extern assembly 'mscorlib'. Attempting autodetect
Assembled method Hello_World.Program::Main
Assembled method Hello_World.Program::.ctor
Creating PE file

Emitting classes:
Class 1:        Hello_World.Program

Emitting fields and methods:
Global
Class 1 Methods: 2;

Emitting events and properties:
Global
Class 1
Writing PE file
Operation completed successfully

Nice, so part 1 is done, we now have our HelloWorld.exe file!

No Base Class Library

Well, not exactly, one problem is that System.Console lives in dotnet/corefx, in there you can see the different files that make up the implementation, such as Console.cs, ConsolePal.Unix.cs, ConsolePal.Windows.cs, etc.

Fortunately, the nice CoreCLR developers included a simple Console implementation in System.Private.CoreLib.dll, the managed part of the CoreCLR, which was previously known as ‘mscorlib’ (before it was renamed). This internal version of Console is pretty small and basic, but it provides enough for what we need.

To use this ‘workaround’ we need to edit our HelloWorld.il to look like this (note the change from mscorlib to System.Private.CoreLib)

.class public auto ansi beforefieldinit C
       extends [System.Private.CoreLib]System.Object
{
    .method public hidebysig static void M () cil managed 
    {
        .entrypoint
        // Code size 11 (0xb)
        .maxstack 8

        IL_0000: ldstr "Hello World!"
        IL_0005: call void [System.Private.CoreLib]Internal.Console::WriteLine(string)
        IL_000a: ret
    } // end of method C::M
    ...
}

Note: You can achieve the same thing with C# code instead of raw IL, by invoking the C# compiler with the following cmd-line:

csc -optimize+ -nostdlib -reference:System.Private.Corelib.dll -out:HelloWorld.exe HelloWorld.cs

So we’ve completed part 2, we are able to at least print “Hello World” to the screen without using the CoreFX repository!

Now this is a nice little trick, but I wouldn’t ever recommend writing real code like this. Compiling against System.Private.CoreLib isn’t the right way of doing things. What the compiler normally does is compile against the publicly exposed surface area that lives in dotnet/corefx, but then at run-time a process called ‘Type-Forwarding’ is used to make that ‘reference’ implementation in CoreFX map to the ‘real’ implementation in the CoreCLR. For more on this entire process see The Rough History of Referenced Assemblies.

However, only a small amount of managed code (i.e. C#) actually exists in the CoreCLR, to show this, the directory tree for /dotnet/coreclr/src/System.Private.CoreLib is available here and the tree with all ~1280 .cs files included is here.

As a concrete example, if you look in CoreFX, you’ll see that the System.Reflection implementation is pretty empty! That’s because it’s a ‘partial facade’ that is eventually ‘type-forwarded’ to System.Private.CoreLib.

If you’re interested, the entire API that is exposed in CoreFX (but actually lives in CoreCLR) is contained in System.Runtime.cs. But back to our example, here is the code that describes all the GetMethod(..) functions in the ‘System.Reflection’ API.

To learn more about ‘type forwarding’, I recommend watching ‘.NET Standard - Under the Hood’ (slides) by Immo Landwerth and there is also some more in-depth information in ‘Evolution of design time assemblies’.

But why is this code split useful, from the CoreFX README:

Runtime-specific library code (mscorlib) lives in the CoreCLR repo. It needs to be built and versioned in tandem with the runtime. The rest of CoreFX is agnostic of runtime-implementation and can be run on any compatible .NET runtime (e.g. CoreRT).

And from the other point-of-view, in the CoreCLR README:

By itself, the Microsoft.NETCore.Runtime.CoreCLR package is actually not enough to do much. One reason for this is that the CoreCLR package tries to minimize the amount of the class library that it implements. Only types that have a strong dependency on the internal workings of the runtime are included (e.g, System.Object, System.String, System.Threading.Thread, System.Threading.Tasks.Task and most foundational interfaces).

Instead most of the class library is implemented as independent NuGet packages that simply use the .NET Core runtime as a dependency. Many of the most familiar classes (System.Collections, System.IO, System.Xml and so on), live in packages defined in the dotnet/corefx repository.

One huge benefit of this approach is that Mono can share large amounts of the CoreFX code, as shown in this tweet:

How Mono reuses .NET Core sources for BCL (doesn't include runtime, tools, etc) according to my calculations 🙂 pic.twitter.com/8JCDxqwnNi

— Egor Bogatov (@EgorBo) March 27, 2018

No Launcher

So far we’ve ‘compiled’ our code (well technically ‘assembled’ it) and we’ve been able to access a simple version of System.Console, but how do we actually run our .exe? Remember we can’t use the dotnet run command because that lives in the dotnet/CLI repository (and that would be breaking the rules of this slightly contrived challenge!!).

Again, fortunately those clever runtime engineers have thought of this exact scenario and they built the very helpful corerun application. You can read more about in Using corerun To Run .NET Core Application, but the td;dr is that it will only look for dependencies in the same folder as your .exe.

So, to complete the challenge, we can now run CoreRun HelloWorld.exe:

# CoreRun HelloWorld.exe
Hello World!

Yay, the least impressive demo you’ll see this year!!

For more information on how you can ‘host’ the CLR in your application I recommend this excellent tutorial Write a custom .NET Core host to control the .NET runtime from your native code. In addition, the docs page on ‘Runtime Hosts’ gives a nice overview of the different hosts that are available:

The .NET Framework ships with a number of different runtime hosts, including the hosts listed in the following table.

Runtime Host Description ASP.NET Loads the runtime into the process that is to handle the Web request. ASP.NET also creates an application domain for each Web application that will run on a Web server. Microsoft Internet Explorer Creates application domains in which to run managed controls. The .NET Framework supports the download and execution of browser-based controls. The runtime interfaces with the extensibility mechanism of Microsoft Internet Explorer through a mime filter to create application domains in which to run the managed controls. By default, one application domain is created for each Web site. Shell executables Invokes runtime hosting code to transfer control to the runtime each time an executable is launched from the shell.

Thu, 13 Dec 2018, 12:00 am


Open Source .NET – 4 years later

A little over 4 years ago Microsoft announced that they were open sourcing large parts of the .NET framework and as this slide from New Features in .NET Core and ASP.NET Core 2.1 shows, the community has been contributing in a significant way:

Side-note: This post forms part of an on-going series, if you want to see how things have changed over time you can check out the previous ones:

Runtime Changes

Before I look at the numbers, I just want to take a moment to look at the significant runtime changes that have taken place over the last 4 years. Partly because I really like looking at the ‘Internals’ of CoreCLR, but also because the runtime is the one repository that makes all the others possible, they rely on it!

To give some context, here’s the slides from a presentation I did called ‘From ‘dotnet run’ to ‘hello world’. If you flick through them you’ll see what components make up the CoreCLR code-base and what they do to make your application run.

From 'dotnet run' to 'hello world' from Matt Warren

So, after a bit of digging through the 19,059 commits, 5,790 issues and the 8 projects, here’s the list of significant changes in the .NET Core Runtime (CoreCLR) over the last few years (if I’ve missed any out, please let me know!!):

So there’s been quite a few large, fundamental changes to the runtime since it’s been open-sourced.

Repository activity over time

But onto the data, first we are going to look at an overview of the level of activity in each repo, by analysing the total number of ‘Issues’ (created) or ‘Pull Requests’ (closed) per month. (Sparklines FTW!!). If you are interested in how I got the data, see the previous post because the process is the same.

Note: Numbers in black are from the most recent month, with the red dot showing the lowest and the green dot the highest previous value. You can toggle between Issues and Pull Requests by clicking on the buttons, hover over individual sparklines to get a tooltip showing the per/month values and click on the project name to take you to the GitHub page for that repository.

Issues Pull Requests

This data gives a good indication of how healthy different repos are, are they growing over time, or staying the same. You can also see the different levels of activity each repo has and how they compare to other ones.

Whilst it’s clear that Visual Studio Code is way ahead of all the other repos (in ‘# of Issues’), it’s interesting to see that some of the .NET-only ones are still pretty large, notably CoreFX (base-class libraries), Roslyn (compiler) and CoreCLR (runtime).

Overall Participation - Community v. Microsoft

Next will will look at the total participation from the last 4 years, i.e. November 2014 to November 2018. All Pull Requests and Issues are treated equally, so a large PR counts the same as one that fixes a speling mistake. Whilst this isn’t ideal it’s the simplest way to get an idea of the Microsoft/Community split. In addition, Community does include people paid by other companies to work on .NET Projects, for instance Samsung Engineers.

Note: You can hover over the bars to get the actual numbers, rather than percentages.

Issues: Microsoft Community Pull Requests: Microsoft Community

Participation over time - Community v. Microsoft

Finally we can see the ‘per-month’ data from the last 4 years, i.e. November 2014 to November 2018.

Note: You can inspect different repos by selecting them from the pull-down list, but be aware that the y-axis on the graphs are re-scaled, so the maximum value will change each time.

Issues: Microsoft Community Pull Requests: Microsoft Community

Summary

It’s clear that the community continues to be invested in the .NET-related, Open Source repositories, contributing significantly and for a sustained period of time. I think this is good for all .NET developers, whether you contribute to OSS or not, having .NET be a thriving, Open Source product has many benefits!


Tue, 4 Dec 2018, 12:00 am


A History of .NET Runtimes

Recently I was fortunate enough to chat with Chris Bacon who wrote DotNetAnywhere (an alternative .NET Runtime) and I quipped with him:

.. you’re probably one of only a select group(*) of people who’ve written a .NET runtime, that’s pretty cool!

* if you exclude people who were paid to work on one, i.e. Microsoft/Mono/Xamarin engineers, it’s a very select group.

But it got me thinking, how many .NET Runtimes are there? I put together my own list, then enlisted a crack team of highly-paid researchers, a.k.a my twitter followers:

#LazyWeb, fun Friday quiz, how many different .NET Runtimes are there? (that implement ECMA-335 https://t.co/76stuYZLrw)
- .NET Framework
- .NET Core
- Mono
- Unity
- .NET Compact Framework
- DotNetAnywhere
- Silverlight
What have I missed out?

— Matt Warren (@matthewwarren) September 14, 2018

For the purposes of this post I’m classifying a ‘.NET Runtime’ as anything that implements the ECMA-335 Standard for .NET (more info here). I don’t know if there’s a more precise definition or even some way of officially veryifying conformance, but in practise it means that the runtimes can take a .NET exe/dll produced by any C#/F#/VB.NET compiler and run it.

Once I had the list, I made copious use of wikipedia (see the list of ‘References’) and came up with the following timeline:

Timeline maker

(If the interactive timeline isn’t working for you, take a look at this version)

If I’ve missed out any runtimes, please let me know!

To make the timeline a bit easier to understand, I put each runtime into one of the following categories:

  1. Microsoft .NET Frameworks
  2. Other Microsoft Runtimes
  3. Mono/Xamarin Runtimes
  4. 'Ahead-of-Time' (AOT) Runtimes
  5. Community Projects
  6. Research Projects

The rest of the post will look at the different runtimes in more detail. Why they were created, What they can do and How they compare to each other.

Microsoft .NET Frameworks

The original ‘.NET Framework’ was started by Microsoft in the late 1990’s and has been going strong ever since. Recently they’ve changed course somewhat with the announcement of .NET Core, which is ‘open-source’ and ‘cross-platform’. In addition, by creating the .NET Standard they’ve provided a way for different runtimes to remain compatible:

.NET Standard is for sharing code. .NET Standard is a set of APIs that all .NET implementations must provide to conform to the standard. This unifies the .NET implementations and prevents future fragmentation.

As an aside, if you want more information on the ‘History of .NET’, I really recommend Anders Hejlsberg - What brought about the birth of the CLR? and this presentation by Richard Campbell who really knows how to tell a story!

(Also available as a podcast if you’d prefer and he’s working on a book covering the same subject. If you want to learn more about the history of the entire ‘.NET Ecosystem’ not just the Runtimes, check out ‘Legends of .NET’)

Other Microsoft Runtimes

But outside of the main general purpose ‘.NET Framework’, Microsoft have also released other runtimes, designed for specific scenarios.

.NET Compact Framework

The Compact (.NET CF) and Micro (.NET MF) Frameworks were both attempts to provide cut-down runtimes that would run on more constrained devices, for instance .NET CF:

… is designed to run on resource constrained mobile/embedded devices such as personal digital assistants (PDAs), mobile phones factory controllers, set-top boxes, etc. The .NET Compact Framework uses some of the same class libraries as the full .NET Framework and also a few libraries designed specifically for mobile devices such as .NET Compact Framework controls. However, the libraries are not exact copies of the .NET Framework; they are scaled down to use less space.

.NET Micro Framework

The .NET MF is even more constrained:

… for resource-constrained devices with at least 256 KB of flash and 64 KB of random-access memory (RAM). It includes a small version of the .NET Common Language Runtime (CLR) and supports development in C#, Visual Basic .NET, and debugging (in an emulator or on hardware) using Microsoft Visual Studio. NETMF features a subset of the .NET base class libraries (about 70 classes with about 420 methods),.. NETMF also features added libraries specific to embedded applications. It is free and open-source software released under Apache License 2.0.

If you want to try it out, Scott Hanselman did a nice write-up The .NET Micro Framework - Hardware for Software People.

Silverlight

Although now only in support mode (or ‘dead’/‘sunsetted’ depending on your POV), it’s interesting to go back to the original announcement and see what Silverlight was trying to do:

Silverlight is a cross platform, cross browser .NET plug-in that enables designers and developers to build rich media experiences and RIAs for browsers. The preview builds we released this week currently support Firefox, Safari and IE browsers on both the Mac and Windows.

Back in 2007, Silverlight 1.0 had the following features (it even worked on Linux!):

  • Built-in codec support for playing VC-1 and WMV video, and MP3 and WMA audio within a browser…
  • Silverlight supports the ability to progressively download and play media content from any web-server…
  • Silverlight also optionally supports built-in media streaming…
  • Silverlight enables you to create rich UI and animations, and blend vector graphics with HTML to create compelling content experiences…
  • Silverlight makes it easy to build rich video player interactive experiences…

Mono/Xamarin Runtimes

Mono came about when Miguel de Icaza and others explored the possibility of making .NET work on Linux (from Mono early history):

Who came first is not an important question to me, because Mono to me is a means to an end: a technology to help Linux succeed on the desktop.

The same post also talks about how it started:

On the Mono side, the events were approximately like this:

As soon as the .NET documents came out in December 2000, I got really interested in the technology, and started where everyone starts: at the byte code interpreter, but I faced a problem: there was no specification for the metadata though.

The last modification to the early VM sources was done on January 22 2001, around that time I started posting to the .NET mailing lists asking for the missing information on the metadata file format.

About this time Sam Ruby was pushing at the ECMA committee to get the binary file format published, something that was not part of the original agenda. I do not know how things developed, but by April 2001 ECMA had published the file format.

Over time, Mono (now Xamarin) has branched out into wider areas. It runs on Android and iOS/Mac and was acquired by Microsoft in Feb 2016. In addition Unity & Mono/Xamarim have long worked together, to provide C# support in Unity and Unity is now a member of the .NET Foundation.

'Ahead-of-Time' (AOT) Runtimes

I wanted to include AOT runtimes as a seperate category, because traditionally .NET has been ‘Just-in-Time’ Compiled, but over time more and more ‘Ahead-of-Time’ compilation options have been available.

As far as I can tell, Mono was the first, with an ‘AOT’ mode since Aug 2006, but recently, Microsoft have released .NET Native and are they’re working on CoreRT - A .NET Runtime for AOT.

Community Projects

However, not all ‘.NET Runtimes’ were developed by Microsoft, or companies that they later acquired. There are some ‘Community’ owned ones:

  • The oldest is DotGNU Portable.NET, which started at the same time as Mono, with the goal ‘to build a suite of Free Software tools to compile and execute applications for the Common Language Infrastructure (CLI)..’.
  • Secondly, there is DotNetAnywhere, the work of just one person, Chris Bacon. DotNetAnywhere has the claim to fame that it provided the initial runtime for the Blazor project. However it’s also an excellent resource if you want to look at what makes up a ‘.NET Compatible-Runtime’ and don’t have the time to wade through the millions of lines-of-code that make up the CoreCLR!
  • Next comes CosmosOS (GitHub project), which is not just a .NET Runtime, but a ‘Managed Operating System’. If you want to see how it achieves this I recommend reading through the excellent FAQ or taking a quick look under the hood. Another similar effort is SharpOS.
  • Finally, I recently stumbled across CrossNet, which takes a different approach, it ‘parses .NET assemblies and generates unmanaged C++ code that can be compiled on any standard C++ compiler.’ Take a look at the overview docs and example of generated code to learn more.

Research Projects

Finally, onto the more esoteric .NET Runtimes. These are the Research Projects run by Microsoft, with the aim of seeing just how far can you extend a ‘managed runtime’, what can they be used for. Some of this research work has made it’s way back into commercial/shipping .NET Runtimes, for instance Span came from Midori.

Shared Source Common Language Infrastructure (SSCLI) (a.k.a ‘Rotor):

is Microsoft’s shared source implementation of the CLI, the core of .NET. Although the SSCLI is not suitable for commercial use due to its license, it does make it possible for programmers to examine the implementation details of many .NET libraries and to create modified CLI versions. Microsoft provides the Shared Source CLI as a reference CLI implementation suitable for educational use.

An interesting side-effect of releasing Rotor is that they were also able to release the ‘Gyro’ Project, which gives an idea of how Generics were added to the .NET Runtime.

Midori:

Midori was the code name for a managed code operating system being developed by Microsoft with joint effort of Microsoft Research. It had been reported to be a possible commercial implementation of the Singularity operating system, a research project started in 2003 to build a highly dependable operating system in which the kernel, device drivers, and applications are all written in managed code. It was designed for concurrency, and could run a program spread across multiple nodes at once. It also featured a security model that sandboxes applications for increased security. Microsoft had mapped out several possible migration paths from Windows to Midori. The operating system was discontinued some time in 2015, though many of its concepts were rolled into other Microsoft projects.

Midori is the project that appears to have led to the most ideas making their way back into the ‘.NET Framework’, you can read more about this in Joe Duffy’s excellent series Blogging about Midori

  1. A Tale of Three Safeties
  2. Objects as Secure Capabilities
  3. Asynchronous Everything
  4. Safe Native Code
  5. The Error Model
  6. Performance Culture
  7. 15 Years of Concurrency

Singularity (operating system) (also Singularity RDK)

Singularity is an experimental operating system (OS) which was built by Microsoft Research between 2003 and 2010. It was designed as a high dependability OS in which the kernel, device drivers, and application software were all written in managed code. Internal security uses type safety instead of hardware memory protection.

Last, but not least, there is Redhawk:

Codename for experimental minimal managed code runtime that evolved into CoreRT.

References

Below are the Wikipedia articles I referenced when creating the timeline:


Tue, 2 Oct 2018, 12:00 am


Fuzzing the .NET JIT Compiler

I recently came across the excellent ‘Fuzzlyn’ project, created as part of the ‘Language-Based Security’ course at Aarhus University. As per the project description Fuzzlyn is a:

… fuzzer which utilizes Roslyn to generate random C# programs

And what is a ‘fuzzer’, from the Wikipedia page for ‘fuzzing:

Fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a computer program.

Or in other words, a fuzzer is a program that tries to create source code that finds bugs in a compiler.

Massive kudos to the developers behind Fuzzlyn, Jakob Botsch Nielsen (who helped answer my questions when writing this post), Chris Schmidt and Jonas Larsen, it’s an impressive project!! (to be clear, I have no link with the project and can’t take any of the credit for it)

Compilation in .NET

But before we dive into ‘Fuzzlyn’ and what it does, we’re going to take a quick look at ‘compilation’ in the .NET Framework. When you write C#/VB.NET/F# code (delete as appropriate) and compile it, the compiler converts it into Intermediate Language (IL) code. The IL is then stored in a .exe or .dll, which the Common Language Runtime (CLR) reads and executes when your program is actually run. However it’s the job of the Just-in-Time (JIT) Compiler to convert the IL code into machine code.

Why is this relevant? Because Fuzzlyn works by comparing the output of a Debug and a Release version of a program and if they are different, there’s a bug! But it turns out that very few optimisations are actually done by the ‘Roslyn’ compiler, compared to what the JIT does, from Eric Lippert’s excellent post What does the optimize switch do? (2009)

The /optimize flag does not change a huge amount of our emitting and generation logic. We try to always generate straightforward, verifiable code and then rely upon the jitter to do the heavy lifting of optimizations when it generates the real machine code. But we will do some simple optimizations with that flag set. For example, with the flag set:

He then goes on to list the 15 things that the C# Compiler will optimise, before finishing with this:

That’s pretty much it. These are very straightforward optimizations; there’s no inlining of IL, no loop unrolling, no interprocedural analysis whatsoever. We let the jitter team worry about optimizing the heck out of the code when it is actually spit into machine code; that’s the place where you can get real wins.

So in .NET, very few of the techniques that an ‘Optimising Compiler’ uses are done at compile-time. They are almost all done at run-time by the JIT Compiler (leaving aside AOT scenarios for the time being).

For reference, most of the differences in IL are there to make the code easier to debug, for instance given this C# code:

public void M() {
    foreach (var item in new [] { 1, 2, 3, 4 }) {
      Console.WriteLine(item);
    }
}

The differences in IL are shown below (‘Release’ on the left, ‘Debug’ on the right). As you can see there are a few extra nop instructions to allow the debugger to ‘step-through’ more locations in the code, plus an extra local variable, which makes it easier/possible to see the value when debugging.

(click for larger image or you can view the ‘Release’ version and the ‘Debug’ version on the excellent SharpLab)

For more information on the differences in Release/Debug code-gen see the ‘Release (optimized)’ section in this doc on CodeGen Differences. Also, because Roslyn is open-source we can see how this is handled in the code:

This all means that the ‘Fuzzlyn’ project has actually been finding bugs in the .NET JIT, not in the Roslyn Compiler

(well, except this one Finally block belonging to unexecuted try runs anyway, which was fixed here)

How it works

At the simplest level, Fuzzlyn works by compiling and running a piece of randomly generated code in ‘Debug’ and ‘Release’ versions and comparing the output. If the 2 versions produce different results, then it’s a bug, specifically a bug in the optimisations that the JIT compiler has attempted.

The .NET JIT, known as ‘RyuJIT’, has several modes. It can produce fully optimised code that has the highest-performance, or in can produce more ‘debug’ friendly code that has no optimisations, but is much simpler. You can find out more about the different ‘optimisations’ that RyuJIT performs in this excellent tutorial, in this design doc or you can search through the code for usages of the ‘compDbgCode’ flag.

From a high-level Fuzzlyn goes through the following steps:

  1. Randomly generate a C# program
  2. Check if the code produces an error (Debug v. Release)
  3. Reduce the code to it’s simplest form

If you want to see this in action, I ran Fuzzlyn until it produced a randomly generated program with a bug. You can see the original source (6,802 LOC) and the reduced version (28 LOC). What’s interesting is that you can clearly see the buggy line-of-code in the original code, before it’s turned into a simplified version:

// Generated by Fuzzlyn v1.1 on 2018-08-22 15:19:26
// Seed: 14928117313359926641
// Reduced from 256.3 KiB to 0.4 KiB in 00:01:58
// Debug: Prints 0 line(s)
// Release: Prints 1 line(s)
public class Program
{
    static short s_18;
    static byte s_33 = 1;
    static int[] s_40 = new int[]{0};
    static short s_74 = 1;
    public static void Main()
    {
        s_18 = -1;
        // This comparision is the bug, in Debug it's False, in Release it's True
        // However, '(ushort)(s_18 | 2L)' is 65,535 in Debug *and* Release
        if (((ushort)(s_18 | 2L)

Tue, 28 Aug 2018, 12:00 am


Monitoring and Observability in the .NET Runtime

.NET is a managed runtime, which means that it provides high-level features that ‘manage’ your program for you, from Introduction to the Common Language Runtime (CLR) (written in 2007):

The runtime has many features, so it is useful to categorize them as follows:

  1. Fundamental features – Features that have broad impact on the design of other features. These include:
    1. Garbage Collection
    2. Memory Safety and Type Safety
    3. High level support for programming languages.
  2. Secondary features – Features enabled by the fundamental features that may not be required by many useful programs:
    1. Program isolation with AppDomains
    2. Program Security and sandboxing
  3. Other Features – Features that all runtime environments need but that do not leverage the fundamental features of the CLR. Instead, they are the result of the desire to create a complete programming environment. Among them are:
    1. Versioning
    2. Debugging/Profiling
    3. Interoperation

You can see that ‘Debugging/Profiling’, whilst not a Fundamental or Secondary feature, still makes it into the list because of a ‘desire to create a complete programming environment’.

The rest of this post will look at what Monitoring, Observability and Introspection features the Core CLR provides, why they’re useful and how it provides them.

To make it easier to navigate, the post is split up into 3 main sections (with some ‘extra-reading material’ at the end):

Diagnostics

Firstly we are going to look at the diagnostic information that the CLR provides, which has traditionally been supplied via ‘Event Tracing for Windows’ (ETW).

There is quite a wide range of events that the CLR provides related to:

  • Garbage Collection (GC)
  • Just-in-Time (JIT) Compilation
  • Module and AppDomains
  • Threading and Lock Contention
  • and much more

For example this is where the AppDomain Load event is fired, this is the Exception Thrown event and here is the GC Allocation Tick event.

Perf View

If you want to see the ETW Events coming from your .NET program I recommend using the excellent PerfView tool and starting with these PerfView Tutorials or this excellent talk PerfView: The Ultimate .NET Performance Tool. PerfView is widely regarded because it provides invaluable information, for instance Microsoft Engineers regularly use it for performance investigations.

Common Infrastructure

However, in case it wasn’t clear from the name, ETW events are only available on Windows, which doesn’t really fit into the new ‘cross-platform’ world of .NET Core. You can use PerfView for Performance Tracing on Linux (via LTTng), but that is only the cmd-line collection tool, known as ‘PerfCollect’, the analysis and rich UI (which includes flamegraphs) is currently Windows only.

But if you do want to analyse .NET Performance Linux, there are some other approaches:

The 2nd link above discusses the new ‘EventPipe’ infrastructure that is being worked on in .NET Core (along with EventSources & EventListeners, can you spot a theme!), you can see its aims in Cross-Platform Performance Monitoring Design. At a high-level it will provide a single place for the CLR to push ‘events’ related to diagnostics and performance. These ‘events’ will then be routed to one or more loggers which may include ETW, LTTng, and BPF for example, with the exact logger being determined by which OS/Platform the CLR is running on. There is also more background information in .NET Cross-Plat Performance and Eventing Design that explains the pros/cons of the different logging technologies.

All the work being done on ‘Event Pipes’ is being tracked in the ‘Performance Monitoring’ project and the associated ‘EventPipe’ Issues.

Future Plans

Finally, there are also future plans for a Performance Profiling Controller which has the following goal:

The controller is responsible for control of the profiling infrastructure and exposure of performance data produced by .NET performance diagnostics components in a simple and cross-platform way.

The idea is for it to expose the following functionality via a HTTP server, by pulling all the relevant data from ‘Event Pipes’:

REST APIs

  • Pri 1: Simple Profiling: Profile the runtime for X amount of time and return the trace.
  • Pri 1: Advanced Profiling: Start tracing (along with configuration)
  • Pri 1: Advanced Profiling: Stop tracing (the response to calling this will be the trace itself)
  • Pri 2: Get the statistics associated with all EventCounters or a specified EventCounter.

Browsable HTML Pages

  • Pri 1: Textual representation of all managed code stacks in the process.
    • Provides an snapshot overview of what’s currently running for use as a simple diagnostic report.
  • Pri 2: Display the current state (potentially with history) of EventCounters.
    • Provides an overview of the existing counters and their values.
    • OPEN ISSUE: I don’t believe the necessary public APIs are present to enumerate EventCounters.

I’m excited to see where the ‘Performance Profiling Controller’ (PPC?) goes, I think it’ll be really valuable for .NET to have this built-in to the CLR, it’s something that other runtimes have.

Profiling

Another powerful feature the CLR provides is the Profiling API, which is (mostly) used by 3rd party tools to hook into the runtime at a very low-level. You can find our more about the API in this overview, but at a high-level, it allows your to wire up callbacks that are triggered when:

  • GC-related events happen
  • Exceptions are thrown
  • Assemblies are loaded/unloaded
  • much, much more

Image from the BOTR page Profiling API – Overview

In addition is has other very power features. Firstly you can setup hooks that are called every time a .NET method is executed whether in the runtime or from users code. These callbacks are known as ‘Enter/Leave’ hooks and there is a nice sample that shows how to use them, however to make them work you need to understand ‘calling conventions’ across different OSes and CPU architectures, which isn’t always easy. Also, as a warning, the Profiling API is a COM component that can only be accessed via C/C++ code, you can’t use it from C#/F#/VB.NET!

Secondly, the Profiler is able to re-write the IL code of any .NET method before it is JITted, via the SetILFunctionBody() API. This API is hugely powerful and forms the basis of many .NET APM Tools, you can learn more about how to use it in my previous post How to mock sealed classes and static methods and the accompanying code.

ICorProfiler API

It turns out that the run-time has to perform all sorts of crazy tricks to make the Profiling API work, just look at what went into this PR Allow rejit on attach (for more info on ‘ReJIT’ see ReJIT: A How-To Guide).

The overall definition for all the Profiling API interfaces and callbacks is found in \vm\inc\corprof.idl (see Interface description language). But it’s divided into 2 logical parts, one is the Profiler -> ‘Execution Engine’ (EE) interface, known asICorProfilerInfo:

// Declaration of class that implements the ICorProfilerInfo* interfaces, which allow the
// Profiler to communicate with the EE.  This allows the Profiler DLL to get
// access to private EE data structures and other things that should never be exported
// outside of the EE.

Which is implemented in the following files:

The other main part is the EE -> Profiler callbacks, which are grouped together under the ICorProfilerCallback interface:

// This module implements wrappers around calling the profiler's 
// ICorProfilerCallaback* interfaces. When code in the EE needs to call the
// profiler, it goes through EEToProfInterfaceImpl to do so.

These callbacks are implemented across the following files:

Finally, it’s worth pointing out that the Profiler APIs might not work across all OSes and CPU-archs that .NET Core runs on, e.g. ELT call stub issues on Linux, see Status of CoreCLR Profiler APIs for more info.

Profiling v. Debugging

As a quick aside, ‘Profiling’ and ‘Debugging’ do have some overlap, so it’s helpful to understand what the different APIs provide in the context of the .NET Runtime, from CLR Debugging vs. CLR Profiling

Debugging

Debugging means different things to different people, for instance I asked on Twitter “what are the ways that you’ve debugged a .NET program” and got a wide range of different responses, although both sets of responses contain a really good list of tools and techniques, so they’re worth checking out, thanks #LazyWeb!

But perhaps this quote best sums up what Debugging really is 😊

Debugging is like being the detective in a crime movie where you are also the murderer.

— Filipe Fortes (@fortes) November 10, 2013

The CLR provides a very extensive range of features related to Debugging, but why does it need to provide these services, the excellent post Why is managed debugging different than native-debugging? provides 3 reasons:

  1. Native debugging can be abstracted at the hardware level but managed debugging needs to be abstracted at the IL level
  2. Managed debugging needs a lot of information not available until runtime
  3. A managed debugger needs to coordinate with the Garbage Collector (GC)

So to give a decent experience, the CLR has to provide the higher-level debugging API known as ICorDebug, which is shown in the image below of a ‘common debugging scenario’ from the BOTR:

In addition, there is a nice description of how the different parts interact in How do Managed Breakpoints work?:

Here’s an overview of the pipeline of components:
1) End-user
2) Debugger (such as Visual Studio or MDbg).
3) CLR Debugging Services (which we call "The Right Side"). This is the implementation of ICorDebug (in mscordbi.dll).
---- process boundary between Debugger and Debuggee ----
4) CLR. This is mscorwks.dll. This contains the in-process portion of the debugging services (which we call "The Left Side") which communicates directly with the RS in stage #3.
5) Debuggee's code (such as end users C# program)

ICorDebug API

But how is all this implemented and what are the different components, from CLR Debugging, a brief introduction:

All of .Net debugging support is implemented on top of a dll we call “The Dac”. This file (usually named mscordacwks.dll) is the building block for both our public debugging API (ICorDebug) as well as the two private debugging APIs: The SOS-Dac API and IXCLR.

In a perfect world, everyone would use ICorDebug, our public debugging API. However a vast majority of features needed by tool developers such as yourself is lacking from ICorDebug. This is a problem that we are fixing where we can, but these improvements go into CLR v.next, not older versions of CLR. In fact, the ICorDebug API only added support for crash dump debugging in CLR v4. Anyone debugging CLR v2 crash dumps cannot use ICorDebug at all!

(for an additional write-up, see SOS & ICorDebug)

The ICorDebug API is actually split up into multiple interfaces, there are over 70 of them!! I won’t list them all here, but I will show the categories they fall into, for more info see Partition of ICorDebug where this list came from, as it goes into much more detail.

  • Top-level: ICorDebug + ICorDebug2 are the top-level interfaces which effectively serve as a collection of ICorDebugProcess objects.
  • Callbacks: Managed debug events are dispatched via methods on a callback object implemented by the debugger
  • Process: This set of interfaces represents running code and includes the APIs related to eventing.
  • Code / Type Inspection: Could mostly operate on a static PE image, although there are a few convenience methods for live data.
  • Execution Control: Execution is the ability to “inspect” a thread’s execution. Practically, this means things like placing breakpoints (F9) and doing stepping (F11 step-in, F10 step-over, S+F11 step-out). ICorDebug’s Execution control only operates within managed code.
  • Threads + Callstacks: Callstacks are the backbone of the debugger’s inspection functionality. The following interfaces are related to taking a callstack. ICorDebug only exposes debugging managed code, and thus the stacks traces are managed-only.
  • Object Inspection: Object inspection is the part of the API that lets you see the values of the variables throughout the debuggee. For each interface, I list the “MVP” method that I think must succinctly conveys the purpose of that interface.

One other note, as with the Profiling APIs the level of support for the Debugging API varies across OS’s and CPU architectures. For instance, as of Aug 2018 there’s “no solution for Linux ARM of managed debugging and diagnostic”. For more info on ‘Linux’ support in general, see this great post Debugging .NET Core on Linux with LLDB and check-out the Diagnostics repository from Microsoft that has the goal of making it easier to debug .NET programs on Linux.

Finally, if you want to see what the ICorDebug APIs look like in C#, take a look at the wrappers included in CLRMD library, include all the available callbacks (CLRMD will be covered in more depth, later on in this post).

SOS and the DAC

The ‘Data Access Component’ (DAC) is discussed in detail in the BOTR page, but in essence it provides ‘out-of-process’ access to the CLR data structures, so that their internal details can be read from another process. This allows a debugger (via ICorDebug) or the ‘Son of Strike’ (SOS) extension to reach into a running instance of the CLR or a memory dump and find things like:

  • all the running threads
  • what objects are on the managed heap
  • full information about a method, including the machine code
  • the current ‘stack trace’

Quick aside, if you want an explanation of all the strange names and a bit of a ‘.NET History Lesson’ see this Stack Overflow answer.

The full list of SOS Commands is quite impressive and using it along-side WinDBG allows you a very low-level insight into what’s going on in your program and the CLR. To see how it’s implemented, lets take a look at the !HeapStat command that gives you a summary of the size of different Heaps that the .NET GC is using:

(image from SOS: Upcoming release has a few new commands – HeapStat)

Here’s the code flow, showing how SOS and the DAC work together:

  • SOS The full !HeapStat command (link)
  • SOS The code in the !HeapStat command that deals with the ‘Workstation GC’ (link)
  • SOS GCHeapUsageStats(..) function that does the heavy-lifting (link)
  • Shared The DacpGcHeapDetails data structure that contains pointers to the main data in the GC heap, such as segments, card tables and individual generations (link)
  • DAC GetGCHeapStaticData function that fills-out the DacpGcHeapDetails struct (link)
  • Shared the DacpHeapSegmentData data structure that contains details for an individual ‘segment’ with the GC Heap (link)
  • DAC GetHeapSegmentData(..) that fills-out the DacpHeapSegmentData struct (link)

3rd Party ‘Debuggers’

Because Microsoft published the debugging API it allowed 3rd parties to make use of the use of the ICorDebug interfaces, here’s a list of some that I’ve come across:

Memory Dumps

The final area we are going to look at is ‘memory dumps’, which can be captured from a live system and analysed off-line. The .NET runtime has always had good support for creating ‘memory dumps’ on Windows and now that .NET Core is ‘cross-platform’, the are also tools available do the same on other OSes.

One of the issues with ‘memory dumps’ is that it can be tricky to get hold of the correct, matching versions of the SOS and DAC files. Fortunately Microsoft have just released the dotnet symbol CLI tool that:

can download all the files needed for debugging (symbols, modules, SOS and DAC for the coreclr module given) for any given core dump, minidump or any supported platform’s file formats like ELF, MachO, Windows DLLs, PDBs and portable PDBs.

Finally, if you spend any length of time analysing ‘memory dumps’ you really should take a look at the excellent CLR MD library that Microsoft released a few years ago. I’ve previously written about what you can do with it, but in a nutshell, it allows you to interact with memory dumps via an intuitive C# API, with classes that provide access to the ClrHeap, GC Roots, CLR Threads, Stack Frames and much more. In fact, aside from the time needed to implemented the work, CLR MD could implement most (if not all) of the SOS commands.

But how does it work, from the announcement post:

The ClrMD managed library is a wrapper around CLR internal-only debugging APIs. Although those internal-only APIs are very useful for diagnostics, we do not support them as a public, documented release because they are incredibly difficult to use and tightly coupled with other implementation details of the CLR. ClrMD addresses this problem by providing an easy-to-use managed wrapper around these low-level debugging APIs.

By making these APIs available, in an officially supported library, Microsoft have enabled developers to build a wide range of tools on top of CLRMD, which is a great result!

So in summary, the .NET Runtime provides a wide-range of diagnostic, debugging and profiling features that allow a deep-insight into what’s going on inside the CLR.

Discuss this post on HackerNews, /r/programming or /r/csharp

Further Reading

Where appropriate I’ve included additional links that covers the topics discussed in this post.

General

ETW Events and PerfView:

Profiling API:

Debugging:

Memory Dumps:


Tue, 21 Aug 2018, 12:00 am


Presentations and Talks covering '.NET Internals'

I’m constantly surprised at just how popular resources related to ‘.NET Internals’ are, for instance take this tweet and the thread that followed:

If you like learning about '.NET Internals' here's a few talks/presentations I've watched that you might also like. First 'Writing High Performance Code in .NET' by Bart de Smet https://t.co/L5S9BsBlWe

— Matt Warren (@matthewwarren) July 9, 2018

All I’d done was put together a list of Presentations/Talks (based on the criteria below) and people really seemed to appreciate it!!

Criteria

To keep things focussed, the talks or presentations:

  • Must explain some aspect of the ‘internals’ of the .NET Runtime (CLR)
    • i.e. something ‘under-the-hood’, the more ‘low-level’ the better!
    • e.g. how the GC works, what the JIT does, how assemblies are structured, how to inspect what’s going on, etc
  • Be entertaining and worth watching!
    • i.e. worth someone giving up 40-50 mins of their time for
    • this is hard when you’re talking about low-level details, not all speakers manage it!
  • Needs to be a talk that I’ve watched myself and actually learnt something from
    • i.e. I don’t just hope it’s good based on the speaker/topic
  • Doesn’t have to be unique, fine if it overlaps with another talk
    • it often helps having two people cover the same idea, from different perspectives

If you want more general lists of talks and presentations see Awesome talks and Awesome .NET Performance

List of Talks

Here’s the complete list of talks, including a few bonus ones that weren’t in the tweet:

  1. PerfView: The Ultimate .NET Performance Tool by Sasha Goldshtein
  2. Writing High Performance Code in .NET by Bart De Smet
  3. State of the .NET Performance by Adam Sitnik
  4. Let’s talk about microbenchmarking by Andrey Akinshin
  5. Safe Systems Programming in C# and .NET (summary) by Joe Duffy
  6. FlingOS - Using C# for an OS by Ed Nutting
  7. Maoni Stephens on .NET GC by Maoni Stephens
  8. What’s new for performance in .NET Core 2.0 by Ben Adams
  9. Open Source Hacking the CoreCLR by Geoff Norton
  10. .NET Core & Cross Platform by Matt Ellis
  11. .NET Core on Unix by Jan Vorlicek
  12. Multithreading Deep Dive by Gael Fraiteur
  13. Everything you need to know about .NET memory by Ben Emmett

I also added these 2 categories:

If I’ve missed any out, please let me know in the comments (or on twitter)

PerfView: The Ultimate .NET Performance Tool by Sasha Goldshtein (slides)

In fact, just watch all the talks/presentations that Sasha has done, they’re great!! For example Modern Garbage Collection in Theory and Practice and Making .NET Applications Faster

This talk is a great ‘how-to’ guide for PerfView, what it can do and how to use it (JIT stats, memory allocations, CPU profiling). For more on PerfView see this interview with it’s creator, Vance Morrison: Performance and PerfView.

Writing High Performance Code in .NET by Bart De Smet (he also has a some Pluralsight Courses on the same subject)

Features CLRMD, WinDBG, ETW Events and PerfView, plus some great ‘real world’ performance issues

State of the .NET Performance by Adam Sitnik (slides)

How to write high-perf code that plays nicely with the .NET GC, covering Span, Memory & ValueTask

Let’s talk about microbenchmarking by Andrey Akinshin (slides)

Primarily a look at how to benchmark .NET code, but along the way it demonstrates some of the internal behaviour of the JIT compiler (Andrey is the creator of BenchmarkDotNet)

Safe Systems Programming in C# and .NET (summary) by Joe Duffy (slides and blog)

Joe Duffy (worked on the Midori project) shows why C# is a good ‘System Programming’ language, including what low-level features it provides

FlingOS - Using C# for an OS by Ed Nutting (slides)

Shows what you need to do if you want to write and entire OS in C# (!!) The FlingOS project is worth checking out, it’s a great learning resource.

Maoni Stephens on .NET GC by Maoni Stephens who is the main (only?) .NET GC developer. In addition CLR 4.5 Server Background GC and .NET 4.5 in Practice: Bing are also worth a watch.

An in-depth Q&A on how the .NET GC works, why is does what it does and how to use it efficiently

What’s new for performance in .NET Core 2.0 by Ben Adams (slides)

Whilst it mostly focuses on performance, there is some great internal details on how the JIT generates code for ‘de-virtualisation’, ‘exception handling’ and ‘bounds checking’

Open Source Hacking the CoreCLR by Geoff Norton

Making .NET Core (the CoreCLR) work on OSX was mostly a ‘community contribution’, this talks is a ‘walk-through’ of what it took to make it happen

.NET Core & Cross Platform by Matt Ellis, one of the .NET Runtime Engineers (this one on how made .NET Core ‘Open Source’ is also worth a watch)

Discussion of the early work done to make CoreCLR ‘cross-platform’, including the build setup, ‘Platform Abstraction Layer’ (PAL) and OS differences that had to be accounted for

.NET Core on Unix by Jan Vorlicek a .NET Runtime Engineer (slides)

This talk discusses which parts of the CLR had to be changed to run on Unix, including exception handling, calling conventions, runtime suspension and the PAL

Multithreading Deep Dive by Gael Fraiteur (creator of PostSharp)

Takes a really in-depth look at the CLR memory-model and threading primitives

Everything you need to know about .NET memory by Ben Emmett (slides)

Explains how the .NET GC works using Lego! A very innovative and effective approach!!

Channel 9

The Channel 9 videos recorded by Microsoft deserve their own category, because there’s so much deep, technical information in them. This list is just a selection, including some of my favourites, there are many, many more available!!

Ones to watch

I can’t recommend these yet, because I haven’t watched them myself! (I can’t break my own rules!!).

But they all look really interesting and I will watch them as soon as I get a chance, so I thought they were worth including:

If this post causes you to go off and watch hours and hours of videos, ignoring friends, family and work for the next few weeks, Don’t Blame Me


Thu, 12 Jul 2018, 12:00 am


.NET JIT and CLR - Joined at the Hip

I’ve been digging into .NET Internals for a while now, but never really looked closely at how the ‘Just-in-Time’ (JIT) compiler works. In my mind, the interaction between the .NET Runtime and the JIT has always looked like this:

Nice and straight-forward, the CLR asks the JIT to compile some ‘Intermediate Language’ (IL) code into machine code and the JIT hands back the bytes when it’s done.

However, it turns out the interaction is much more complicated, in reality it looks more like this:

The JIT and the CLR’s ‘Execution Engine’ (EE) or ‘Virtual Machine’ (VM) work closely with one another, they really are ‘joined at the hip’.

The rest of this post will explore the interaction between the 2 components, how they work together and why they need to.

The JIT Compiler

As a quick aside, this post will not be talking about the internals of the JIT compiler itself, if you want to find out more about how that works I recommend reading the fantastic overview in the BOTR and this excellent tutorial, where this very helpful diagram comes from:

After all that, if you still want more, you can take a look at the ‘JIT’ section in the ‘Hitchhikers-Guide-to-the-CoreCLR-Source-Code’.

Components within the CLR

Before we go any further it’s helpful to discuss how the ‘Common Language Runtime’ (CLR) is actually composed. It’s actually made up of several different components including the VM/EE, JIT, GC and others. The treemap below shows the different areas of the source code, grouped by colour into the top-level sections they fall under. You can clearly see that the VM and JIT dominate as well as ‘mscorlib’ which is the only component written in C#.

You can hover over an individual box to get more detailed information and can click on the different radio buttons to toggle the sizing (LOC/Files/Commits)

Total L.O.C # Files # Commits

Note: This treemap is from my previous post ‘Hitchhikers-Guide-to-the-CoreCLR-Source-Code’ which was written over a year ago, so the exact numbers will have changed in the meantime.

You can also see these ‘components’ or ‘areas’ reflected in the classification scheme used for the CoreCLR GitHub issues (one difference is that area-CodeGen is used instead of JIT).

The CLR and the JIT Compiler

Onto the main subject, just how do the CLR and the JIT compiler work together to transform a method from IL to machine code? As always, the ‘Book of the Runtime’ is a good place to start, from the ‘Execution Environment and External Interface’ section of the RyuJIT Overview:

RyuJIT provides the just in time compilation service for the .NET runtime. The runtime itself is variously called the EE (execution engine), the VM (virtual machine) or simply the CLR (common language runtime). Depending upon the configuration, the EE and JIT may reside in the same or different executable files. RyuJIT implements the JIT side of the JIT/EE interfaces:

  • ICorJitCompiler – this is the interface that the JIT compiler implements. This interface is defined in src/inc/corjit.h and its implementation is in src/jit/ee_il_dll.cpp. The following are the key methods on this interface:
    • compileMethod is the main entry point for the JIT. The EE passes it a ICorJitInfo object, and the “info” containing the IL, the method header, and various other useful tidbits. It returns a pointer to the code, its size, and additional GC, EH and (optionally) debug info.
    • getVersionIdentifier is the mechanism by which the JIT/EE interface is versioned. There is a single GUID (manually generated) which the JIT and EE must agree on.
    • getMaxIntrinsicSIMDVectorLength communicates to the EE the largest SIMD vector length that the JIT can support.
  • ICorJitInfo – this is the interface that the EE implements. It has many methods defined on it that allow the JIT to look up metadata tokens, traverse type signatures, compute field and vtable offsets, find method entry points, construct string literals, etc. This bulk of this interface is inherited from ICorDynamicInfo which is defined in src/inc/corinfo.h. The implementation is defined in src/vm/jitinterface.cpp.

So there are 2 main interfaces, ICorJitCompiler which is implemented by the JIT compiler and allows the EE to control how a method is compiled. Second there is ICorJitInfo which the EE implements to allow the JIT to request information it needs during compilation.

Let’s now look at these interfaces in more detail.

EE ➜ JIT ICorJitCompiler

Firstly, we’ll examine ICorJitCompiler, the interface exposed by the JIT. It’s actually pretty straight-forward and only contains 7 methods:

  • CorJitResult __stdcall compileMethod (..)
  • void clearCache()
  • BOOL isCacheCleanupRequired()
  • void ProcessShutdownWork(ICorStaticInfo* info)
  • void getVersionIdentifier(..)
  • unsigned getMaxIntrinsicSIMDVectorLength(..)
  • void setRealJit(..)

Of these, the most interesting one is compileMethod(..), which has the following signature:

    virtual CorJitResult __stdcall compileMethod (
            ICorJitInfo                 *comp,               /* IN */
            struct CORINFO_METHOD_INFO  *info,               /* IN */
            unsigned /* code:CorJitFlag */   flags,          /* IN */
            BYTE                        **nativeEntry,       /* OUT */
            ULONG                       *nativeSizeOfCode    /* OUT */
            ) = 0;

The EE provides the JIT with information about the method it wants compiled (CORINFO_METHOD_INFO) as well as flags (CorJitFlag) which control the:

  • Level of optimisation
  • Whether the code is compiled in Debug or Release mode
  • If the code needs to be ‘Profilable’ or support ‘Edit-and-Continue’
  • Alignment of loops, i.e. should they be aligned on byte-boundaries
  • If SSE3/SSE4 should be used
  • and many other scenarios

The final parameter is a reference to the ICorJitInfo interface, which is covered in the next section.

JIT ➜ EE ICorJitHost and ICorJitInfo

The APIs that the EE has to implement to work with the JIT are not simple, there are almost 180 functions or callbacks!!

Interface Method Count ICorJitHost 5 ICorJitInfo 19 ICorDynamicInfo 36 ICorStaticInfo 118 Total 178

Note: The links take you to the function ‘definitions’ for a given interface. Alternatively all the methods are listed together in this gist.

ICorJitHost makes available ‘functionality that would normally be provided by the operating system’, predominantly the ability to allocate the ‘pages’ of memory that the JIT uses during compilation.

ICorJitInfo (class ICorJitInfo : public ICorDynamicInfo) contains more specific memory allocation routines, including ones for the ‘GC Info’ data, a ‘method/funclet’s unwind information’, ‘.rdata and .pdata for a method’ and the ‘exception handler blocks’.

ICorDynamicInfo (class ICorDynamicInfo : public ICorStaticInfo) provides data that can change from ‘invocation to invocation’, i.e. the JIT cannot cache the results of these method calls. It includes functions that provide:

  • Thread Local Storage (TLS) index
  • Function Entry Point (address)
  • EE ‘helper functions’
  • Address of a Field
  • Constructor for a delegate
  • and much more

Finally, ICorStaticInfo, which is further sub-divided up into more specific interfaces:

Interface Method Count ICorMethodInfo 28 ICorModuleInfo 9 ICorClassInfo 49 ICorFieldInfo 7 ICorDebugInfo 4 ICorArgInfo 4 ICorErrorInfo 7 Diagnostic methods 6 General methods 2 Misc methods 2 Total 118

Because the interface is nicely composed we can easily see what it provides. The bulk of the functions are concerned with information about a module, class, method or field. For instance the JIT can query the class size, GC layout and obtain the address of a field within a class. It can also learn about a method’s signature, find it’s parent class and get ‘exception handling’ information (the full list of methods are available in this gist).

These interfaces and the methods they contain give a nice insight into what information the JIT requests from the runtime and therefore what knowledge it requires when compiling a single method.

Now, let’s look at the end-to-end flow of a couple of these methods and see where they are implemented in the CoreCLR source code.

EE ➜ JIT getFunctionEntryPoint(..)

First we’ll look at a method where the EE provides information to the JIT:

JIT ➜ EE reportInliningDecision()

Next we’ll look at a scenario where the data flows from the JIT back to the EE:

SuperPMI tool

Finally, I just want to cover the ‘SuperPMI’ tool that showed up in the previous 2 scenarios. What is this tool and what does it do? From the CoreCLR glossary:

SuperPMI - JIT component test framework (super fast JIT testing - it mocks/replays EE in EE-JIT interface)

So in a nutshell it allows JIT development and testing to be de-coupled from the EE, which is useful because we’ve just seen that the 2 components are tightly integrated.

But how does it work? From the README:

SuperPMI works in two phases: collection and playback. In the collection phase, the system is configured to collect SuperPMI data. Then, run any set of .NET managed programs. When these managed programs invoke the JIT compiler, SuperPMI gathers and captures all information passed between the JIT and its .NET host. In the playback phase, SuperPMI loads the JIT directly, and causes it to compile all the functions that it previously compiled, but using the collected data to provide answers to various questions that the JIT needs to ask. The .NET execution engine (EE) is not invoked at all.

This explains why there is a SuperPMI implementation for every method that is part of the JIT EE interface. SuperPMI needs to ‘record’ or ‘collect’ each interaction with the EE and store the information so that it can be ‘played back’ at a later time, when the EE isn’t present.

Discuss this post on Hacker News or /r/dotnet

Further Reading

As always, if you’ve read this far, here’s some further information that you might find useful:


Thu, 5 Jul 2018, 12:00 am


Tools for Exploring .NET Internals

Whether you want to look at what your code is doing ‘under-the-hood’ or you’re trying to see what the ‘internals’ of the CLR look like, there is a whole range of tools that can help you out.

To give ‘credit where credit is due’, this post is based on a tweet, so thanks to everyone who contributed to the list and if I’ve missed out any tools, please let me know in the comments below.

While you’re here, I’ve also written other posts that look at the ‘internals’ of the .NET Runtime:

Honourable Mentions

Firstly I’ll start by mentioning that Visual Studio has a great debugger and so does VSCode. Also there are lots of very good (commercial) .NET Profilers and Application Monitoring Tools available that you should also take a look at. For example I’ve recently been playing around with Codetrack and I’m very impressed by what it can do!

However, the rest of the post is going to look at some more single-use tools that give a even deeper insight into what is going on. As a added bonus they’re all ‘open-source’, so you can take a look at the code and see how they work!!

PerfView by Vance Morrison

PerfView is simply an excellent tool and is the one that I’ve used most over the years. It uses ‘Event Tracing for Windows’ (ETW) Events to provide a deep insight into what the CLR is doing, as well as allowing you to profile Memory and CPU usage. It does have a fairly steep learning curve, but there are some nice tutorials to help you along the way and it’s absolutely worth the time and effort.

Also, if you need more proof of how useful it is, Microsoft Engineers themselves use it and many of the recent performance improvements in MSBuild were carried out after using PerfView to find the bottlenecks.

PerfView is built on-top of the Microsoft.Diagnostics.Tracing.TraceEvent library which you can use in your own tools. In addition, since it’s been open-sourced the community has contributed and it has gained some really nice features, including flame-graphs:

(Click for larger version)

SharpLab by Andrey Shchekin

SharpLab started out as a tool for inspecting the IL code emitted by the Roslyn compiler, but has now grown into much more:

SharpLab is a .NET code playground that shows intermediate steps and results of code compilation. Some language features are thin wrappers on top of other features – e.g. using() becomes try/catch. SharpLab allows you to see the code as compiler sees it, and get a better understanding of .NET languages.

If supports C#, Visual Basic and F#, but most impressive are the ‘Decompilation/Disassembly’ features:

There are currently four targets for decompilation/disassembly:

  1. C#
  2. Visual Basic
  3. IL
  4. JIT Asm (Native Asm Code)

That’s right, it will output the assembly code that the .NET JIT generates from your C#:

Object Layout Inspector by Sergey Teplyakov

This tool gives you an insight into the memory layout of your .NET objects, i.e. it will show you how the JITter has decided to arrange the fields within your class or struct. This can be useful when writing high-performance code and it’s helpful to have a tool that does it for us because doing it manually is tricky:

There is no official documentation about fields layout because the CLR authors reserved the right to change it in the future. But knowledge about the layout can be helpful if you’re curious or if you’re working on a performance critical application.

How can we inspect the layout? We can look at a raw memory in Visual Studio or use !dumpobj command in SOS Debugging Extension. These approaches are tedious and boring, so we’ll try to write a tool that will print an object layout at runtime.

From the example in the GitHub repo, if you use TypeLayout.Print() with code like this:

public struct NotAlignedStruct
{
    public byte m_byte1;
    public int m_int;

    public byte m_byte2;
    public short m_short;
}

You’ll get the following output, showing exactly how the CLR will layout the struct in memory, based on it’s padding and optimization rules.

Size: 12. Paddings: 4 (%33 of empty space)
|================================|
|     0: Byte m_byte1 (1 byte)   |
|--------------------------------|
|   1-3: padding (3 bytes)       |
|--------------------------------|
|   4-7: Int32 m_int (4 bytes)   |
|--------------------------------|
|     8: Byte m_byte2 (1 byte)   |
|--------------------------------|
|     9: padding (1 byte)        |
|--------------------------------|
| 10-11: Int16 m_short (2 bytes) |
|================================|

The Ultimate .NET Experiment (TUNE) by Konrad Kokosa

TUNE is a really intriguing tool, as it says on the GitHub page, it’s purpose is to help you

… learn .NET internals and performance tuning by experiments with C# code.

You can find out more information about what it does in this blog post, but at a high-level it works like this:

  • write a sample, valid C# script which contains at least one class with public method taking a single string parameter. It will be executed by hitting Run button. This script can contain as many additional methods and classes as you wish. Just remember that first public method from the first public class will be executed (with single parameter taken from the input box below the script). …
  • after clicking Run button, the script will be compiled and executed. Additionally, it will be decompiled both to IL (Intermediate Language) and assembly code in the corresponding tabs.
  • all the time Tune is running (including time during script execution) a graph with GC data is being drawn. It shows information about generation sizes and GC occurrences (illustrated as vertical lines with the number below indicating which generation has been triggered).

And looks like this:

(Click for larger version)

Tools based on CLR Memory Diagnostics (ClrMD)

Finally, we’re going to look at a particular category of tools. Since .NET came out you’ve always been able to use WinDBG and the SOS Debugging Extension to get deep into the .NET runtime. However it’s not always the easiest tool to get started with and as this tweet says, it’s not always the most productive way to do things:

Besides how complex it is, the idea is to build better abstractions. Raw debugging at the low level is just usually too unproductive. That to me is the promise of ClrMD, that it lets us build specific extensions to extract quickly the right info

— Tomas Restrepo (@tomasrestrepo) March 14, 2018

Fortunately Microsoft made the ClrMD library available (a.k.a Microsoft.Diagnostics.Runtime), so now anyone can write a tool that analyses memory dumps of .NET programs. You can find out even more info in the official blog post and I also recommend taking a look at ClrMD.Extensions that “.. provide integration with LINPad and to make ClrMD even more easy to use”.

I wanted to pull together a list of all the existing tools, so I enlisted twitter to help. Note to self: careful what you tweet, the WinDBG Product Manager might read your tweets and get a bit upset!!

Well this just hurts my feelings :(

— Andy Luhrs (@aluhrs13) March 14, 2018

Most of these tools are based on ClrMD because it’s the easiest way to do things, however you can use the underlying COM interfaces directly if you want. Also, it’s worth pointing out that any tool based on ClrMD is not cross-platform, because ClrMD itself is Windows-only. For cross-platform options see Analyzing a .NET Core Core Dump on Linux

Finally, in the interest of balance, there have been lots of recent improvements to WinDBG and because it’s extensible there have been various efforts to add functionality to it:

Having said all that, onto the list:

  • SuperDump (GitHub)
    • A service for automated crash-dump analysis (presentation)
  • msos (GitHub)
    • Command-line environment a-la WinDbg for executing SOS commands without having SOS available.
  • MemoScope.Net (GitHub)
    • A tool to analyze .Net process memory Can dump an application’s memory in a file and read it later.
    • The dump file contains all data (objects) and threads (state, stack, call stack). MemoScope.Net will analyze the data and help you to find memory leaks and deadlocks
  • dnSpy (GitHub)
    • .NET debugger and assembly editor
    • You can use it to edit and debug assemblies even if you don’t have any source code available!!
  • MemAnalyzer (GitHub)
    • A command line memory analysis tool for managed code.
    • Can show which objects use most space on the managed heap just like !DumpHeap from Windbg without the need to install and attach a debugger.
  • DumpMiner (GitHub)
    • UI tool for playing with ClrMD, with more features coming soon
  • Trace CLI (GitHub)
    • A production debugging and tracing tool
  • Shed (GitHub)
    • Shed is an application that allow to inspect the .NET runtime of a program in order to extract useful information. It can be used to inspect malicious applications in order to have a first general overview of which information are stored once that the malware is executed. Shed is able to:
      • Extract all objects stored in the managed heap
      • Print strings stored in memory
      • Save the snapshot of the heap in a JSON format for post-processing
      • Dump all modules that are loaded in memory

You can also find many other tools that make use of ClrMD, it was a very good move by Microsoft to make it available.

Other Tools

A few other tools that are also worth mentioning:

  • DebugDiag
    • The DebugDiag tool is designed to assist in troubleshooting issues such as hangs, slow performance, memory leaks or memory fragmentation, and crashes in any user-mode process (now with ‘CLRMD Integration’)
  • SOSEX (might not be developed any more)
    • … a debugging extension for managed code that begins to alleviate some of my frustrations with SOS
  • VMMap from Sysinternals

Discuss this post on Hacker News or /r/programming


Fri, 15 Jun 2018, 12:00 am


CoreRT - A .NET Runtime for AOT

Firstly, what exactly is CoreRT? From its GitHub repo:

.. a .NET Core runtime optimized for AOT (ahead of time compilation) scenarios, with the accompanying .NET native compiler toolchain

The rest of this post will look at what that actually means.

Contents

  1. Existing .NET ‘AOT’ Implementations
  2. High-Level Overview
  3. The Compiler
  4. The Runtime
  5. ‘Hello World’ Program
  6. Limitations
  7. Further Reading

Existing .NET ‘AOT’ Implementations

However, before we look at what CoreRT is, it’s worth pointing out there are existing .NET ‘Ahead-of-Time’ (AOT) implementations that have been around for a while:

Mono

.NET Native (Windows 10/UWP apps only, a.k.a ‘Project N’)

So if there were existing implementations, why was CoreRT created? The official announcement gives us some idea:

If we want to shortcut this two-step compilation process and deliver a 100% native application on Windows, Mac, and Linux, we need an alternative to the CLR. The project that is aiming to deliver that solution with an ahead-of-time compilation process is called CoreRT.

The main difference is that CoreRT is designed to support .NET Core scenarios, i.e. .NET Standard, cross-platform, etc.

Also worth pointing out is that whilst .NET Native is a separate product, they are related and in fact “.NET Native shares many CoreRT parts”.

High-Level Overview

Because all the code is open source, we can very easily identify the main components and understand where the complexity is. Firstly lets look at where the most ‘lines of code’ are:

We clearly see that the majority of the code is written in C#, with only the Native component written in C++. The largest single component is System.Private.CoreLib which is all C# code, although there are other sub-components that contribute to it (‘System.Private.XXX’), such as System.Private.Interop (36,547 LOC), System.Private.TypeLoader (30,777) and System.Private.Reflection.Core (24,964). Other significant components are the ‘Intermediate Language (IL) Compiler’ and the Common code that is used re-used by everything else.

All these components are discussed in more detail below.

The Compiler

So whilst CoreRT is a run-time, it also needs a compiler to put everything together, from Intro to .NET Native and CoreRT:

.NET Native is a native toolchain that compiles CIL byte code to machine code (e.g. X64 instructions). By default, .NET Native (for .NET Core, as opposed to UWP) uses RyuJIT as an ahead-of-time (AOT) compiler, the same one that CoreCLR uses as a just-in-time (JIT) compiler. It can also be used with other compilers, such as LLILC, UTC for UWP apps and IL to CPP (an IL to textual C++ compiler we have built as a reference prototype).

But what does this actually look like in practice, as they say ‘a picture paints a thousand words’:

(Click for larger version)

To give more detail, the main compilation phases (started from \ILCompiler\src\Program.cs) are the following:

  1. Calculate the reachable modules/types/classes, i.e. the ‘compilation roots’ using the ILScanner.cs
  2. Allow for reflection, via an optional rd.xml file and generate the necessary metadata using ILCompiler.MetadataWriter
  3. Compile the IL using the specific back-end (generic/shared code is in Compilation.cs)
  4. Finally, write out the compiled methods using ObjectWriter which in turn uses LLVM under-the-hood

But it’s not just your code that ends up in the final .exe, along the way the CoreRT compiler also generates several ‘helper methods’ to cover the following scenarios:

Fortunately the compiler doesn’t blindly include all the code it finds, it is intelligent enough to only include code that’s actually used:

We don’t use ILLinker, but everything gets naturally treeshaken by the compiler itself (we start with compiling Main/NativeCallable exports and continue compiling other methods and generating necessary data structures as we go). If there’s a type or method that is not used, the compiler doesn’t even look at it.

The Runtime

All the user/helper code then sits on-top of the CoreRT runtime, from Intro to .NET Native and CoreRT:

CoreRT is the .NET Core runtime that is optimized for AOT scenarios, which .NET Native targets. This is a refactored and layered runtime. The base is a small native execution engine that provides services such as garbage collection(GC). This is the same GC used in CoreCLR. Many other parts of the traditional .NET runtime, such as the type system, are implemented in C#. We’ve always wanted to implement runtime functionality in C#. We now have the infrastructure to do that. In addition, library implementations that were built deep into CoreCLR, have also been cleanly refactored and implemented as C# libraries.

This last point is interesting, why is it advantageous to implement ‘runtime functionality in C#’? Well it turns out that it’s hard to do in an un-managed language because there’s some very subtle and hard-to-track-down ways that you can get it wrong:

Reliability and performance. The C/C++ code has to manually managed. It means that one has to be very careful to report all GC references to the GC. The manually managed code is both very hard to get right and it has performance overhead.

— Jan Kotas (@JanKotas7) April 24, 2018

These are known as ‘GC Holes’ and the BOTR provides more detail on them. The author of that tweet is significant, Jan Kotas has worked on the .NET runtime for a long time, if he thinks something is hard, it really is!!

Runtime Components

As previously mentioned it’s a layered runtime, i.e made up of several, distinct components, as explained in this comment:

At the core of CoreRT, there’s a runtime that provides basic services for the code to run (think: garbage collection, exception handling, stack walking). This runtime is pretty small and mostly depends on C/C++ runtime (even the C++ runtime dependency is not a hard requirement as Jan pointed out - #3564). This code mostly lives in src/Native/Runtime, src/Native/gc, and src/Runtime.Base. It’s structured so that the places that do require interacting with the underlying platform (allocating native memory, threading, etc.) go through a platform abstraction layer (PAL). We have a PAL for Windows, Linux, and macOS, but others can be added.

And you can see the PAL Components in the following locations:

C# Code shared with CoreCLR

One interesting aspect of the CoreRT runtime is that wherever possible it shares code with the CoreCLR runtime, this is part of a larger effort to ensure that wherever possible code is shared across multiple repositories:

This directory contains the shared sources for System.Private.CoreLib. These are shared between dotnet/corert, dotnet/coreclr and dotnet/corefx. The sources are synchronized with a mirroring tool that watches for new commits on either side and creates new pull requests (as @dotnet-bot) in the other repository.

Recently there has been a significant amount of work done to moved more and more code over into the ‘shared partition’ to ensure work isn’t duplicated and any fixes are shared across both locations. You can see how this works by looking at the links below:

What this means is that about 2/3 of the C# code in System.Private.CoreLib is shared with CoreCLR and only 1/3 is unique to CoreRT:

Group C# LOC (Files) shared 170,106 (759) src 96,733 (351) Total 266,839 (1,110)

Native Code

Finally, whilst it is advantageous to write as much code as possible in C#, there are certain components that have to be written in C++, these include the GC (the majority of which is one file, gc.cpp which is almost 37,000 LOC!!), the JIT Interface, ObjWriter (based on LLVM) and most significantly the Core Runtime that contains code for activities like:

  • Threading
  • Stack Frame handling
  • Debugging/Profiling
  • Interfacing to the OS
  • CPU specific helpers for:
    • Exception handling
    • GC Write Barriers
    • Stubs/Thunks
    • Optimised object allocation

‘Hello World’ Program

One of the first things people asked about CoreRT is “what is the size of a ‘Hello World’ app” and the answer is ~3.93 MB (if you compile in Release mode), but there is work being done to reduce this. At a ‘high-level’, the .exe that is produced looks like this:

Note the different colours correspond to the original format of a component, obviously the output is a single, native, executable file.

This file comes with a full .NET specific ‘base runtime’ or ‘class libraries’ (‘System.Private.XXX’) so you get a lot of functionality, it is not the absolute bare-minimum app. Fortunately there is a way to see what a ‘bare-minimum’ runtime would look like by compiling against the Test.CoreLib project included in the CoreRT source. By using this you end up with an .exe that looks like this:

But it’s so minimal that OOTB you can’t even write ‘Hello World’ to the console as there is no System.Console type! After a bit of hacking I was able to build a version that did have a working Console output (if you’re interested, this diff is available here). To make it work I had to include the following components:

So Test.CoreLib really is a minimal runtime!! But the difference in size is dramatic, it shrinks down to 0.49 MB compared to 3.93 MB for the fully-featured runtime!

Type Standard (bytes) Test.CoreLib (bytes) Difference .data 163,840 36,864 -126,976 .managed 1,540,096 65,536 -1,474,560 .pdata 147,456 20,480 -126,976 .rdata 1,712,128 81,920 -1,630,208 .reloc 98,304 4,096 -94,208 .text 360,448 299,008 -61,440 rdata 98,304 4,096 -94,208         Total (bytes) 4,120,576 512,000 -3,608,576 Total (MB) 3.93 0.49 -3.44

These data sizes were obtained by using the Microsoft DUMPBIN tool and the /DISASM cmd line switch (zip file of the full ouput), which produces the following summary (note: size values are in HEX):

  Summary

       28000 .data
      178000 .managed
       24000 .pdata
      1A2000 .rdata
       18000 .reloc
       58000 .text
       18000 rdata

Also contained in the output is the assembly code for a simple Hello World method:

HelloWorld_HelloWorld_Program__Main:
  0000000140004C50: 48 8D 0D 19 94 37  lea         rcx,[__Str_Hello_World__E63BA1FD6D43904697343A373ECFB93457121E4B2C51AF97278C431E8EC85545]
                    00
  0000000140004C57: 48 8D 05 DA C5 00  lea         rax,[System_Console_System_Console__WriteLine_12]
                    00
  0000000140004C5E: 48 FF E0           jmp         rax
  0000000140004C61: 90                 nop
  0000000140004C62: 90                 nop
  0000000140004C63: 90                 nop

and if we dig further we can see the code for System.Console.WriteLine(..):

System_Console_System_Console__WriteLine_12:
  0000000140011238: 56                 push        rsi
  0000000140011239: 48 83 EC 20        sub         rsp,20h
  000000014001123D: 48 8B F1           mov         rsi,rcx
  0000000140011240: E8 33 AD FF FF     call        System_Console_System_Console__get_Out
  0000000140011245: 48 8B C8           mov         rcx,rax
  0000000140011248: 48 8B D6           mov         rdx,rsi
  000000014001124B: 48 8B 00           mov         rax,qword ptr [rax]
  000000014001124E: 48 8B 40 68        mov         rax,qword ptr [rax+68h]
  0000000140011252: 48 83 C4 20        add         rsp,20h
  0000000140011256: 5E                 pop         rsi
  0000000140011257: 48 FF E0           jmp         rax
  000000014001125A: 90                 nop
  000000014001125B: 90                 nop

Limitations

Missing Functionality

There have been some people who’ve successfully run complex apps using CoreRT, but, as it stands CoreRT is still an alpha product. At least according to the NuGet package ‘1.0.0-alpha-26529-02’ that the official samples instruct you to use and I’ve not seen any information about when a full 1.0 Release will be available.

So there is some functionality that is not yet implemented, e.g. F# Support, GC.GetMemoryInfo or canGetCookieForPInvokeCalliSig (a calli to a p/invoke). For more information on this I recommend this entertaining presentation on Building Native Executables from .NET with CoreRT by Mark Rendle. In the 2nd half he chronicles all the issues that he ran into when he was trying to run an ASP.NET app under CoreRT (some of which may well be fixed now).

Reflection

But more fundamentally, because of the nature of AOT compilation, there are 2 main stumbling blocks that you may also run into Reflection and Runtime Code-Generation.

Firstly, if you want to use reflection in your code you need to tell the CoreRT compiler about the types you expect to reflect over, because by-default it only includes the types it knows about. You can do with by using a file called rd.xml as shown here. Unfortunately this will always require manual intervention for the reasons explained in this issue. More information is available in this comment ‘…some details about CoreRT’s restriction on MakeGenericType and MakeGenericMethod’.

To make reflection work the compiler adds the required metadata to the final .exe using this process:

This would reuse the same scheme we already have for the RyuJIT codegen path:

  • The compiler generates a blob of bytes that describes the metadata (namespaces, types, their members, their custom attributes, method parameters, etc.). The data is generated as a byte array in the ComputeMetadata method.
  • The metadata gets embedded as a data blob into the executable image. This is achieved by adding the blob to a “ready to run header”. Ready to run header is a well known data structure that can be located by the code in the framework at runtime.
  • The ready to run header along with the blobs it refers to is emitted into the final executable.
  • At runtime, pointer to the byte array is located using the RhFindBlob API, and a parser is constructed over the array, to be used by the reflection stack.

Runtime Code-Generation

In .NET you often use reflection once (because it can be slow) followed by ‘dynamic’ or ‘runtime’ code-generation with Reflection.Emit(..). This technique is widely using in .NET libraries for Serialisation/Deserialisation, Dependency Injection, Object Mapping and ORM.

The issue is that ‘runtime’ code generation is problematic in an ‘AOT’ scenario:

ASP.NET dependency injection introduced dependency on Reflection.Emit in aspnet/DependencyInjection#630 unfortunately. It makes it incompatible with CoreRT.

We can make it functional in CoreRT AOT environment by introducing IL interpretter (#5011), but it would still perform poorly. The dependency injection framework is using Reflection.Emit on performance critical paths.

It would be really up to ASP.NET to provide AOT-friendly flavor that generates all code at build time instead of runtime to make this work well. It would likely help the startup without CoreRT as well.

I’m sure this will be solved one way or the other (see #5011), but at the moment it’s still ‘work-in-progress’.

Discuss this post on HackerNews and /r/dotnet

Further Reading

If you’ve got this far, here’s some other links that you might be interested in:


Thu, 7 Jun 2018, 12:00 am


Taking a look at the ECMA-335 Standard for .NET

It turns out that the .NET Runtime has a technical standard (or specification), known by its full name ECMA-335 - Common Language Infrastructure (CLI) (not to be confused with ECMA-334 which is the ‘C# Language Specification’). The latest update is the 6th edition from June 2012.

The specification or standard was written before .NET Core existed, so only applies to the .NET Framework, I’d be interested to know if there are any plans for an updated version?

The rest of this post will take a look at the standard, exploring the contents and investigating what we can learn from it (hint: lots of low-level details and information about .NET internals)

Why is it useful?

Having a standard means that different implementations, such as Mono and DotNetAnywhere can exist, from Common Language Runtime (CLR):

Compilers and tools are able to produce output that the common language runtime can consume because the type system, the format of metadata, and the runtime environment (the virtual execution system) are all defined by a public standard, the ECMA Common Language Infrastructure specification. For more information, see ECMA C# and Common Language Infrastructure Specifications.

and from the CoreCLR documentation on .NET Standards:

There was a very early realization by the founders of .NET that they were creating a new programming technology that had broad applicability across operating systems and CPU types and that advanced the state of the art of late 1990s (when the .NET project started at Microsoft) programming language implementation techniques. This led to considering and then pursuing standardization as an important pillar of establishing .NET in the industry.

The key addition to the state of the art was support for multiple programming languages with a single language runtime, hence the name Common Language Runtime. There were many other smaller additions, such as value types, a simple exception model and attributes. Generics and language integrated query were later added to that list.

Looking back, standardization was quite effective, leading to .NET having a strong presence on iOS and Android, with the Unity and Xamarin offerings, both of which use the Mono runtime. The same may end up being true for .NET on Linux.

The various .NET standards have been made meaningful by the collaboration of multiple companies and industry experts that have served on the working groups that have defined the standards. In addition (and most importantly), the .NET standards have been implemented by multiple commercial (ex: Unity IL2CPP, .NET Native) and open source (ex: Mono) implementors. The presence of multiple implementations proves the point of standardization.

As the last quote points out, the standard is not produced solely by Microsoft:

There is also a nice Wikipedia page that has some additional information.

What is in it?

At a high-level overview, the specification is divided into the following ‘partitions’ :

  • I: Concepts and Architecture
    • A great introduction to the CLR itself, explaining many of the key concepts and components, as well as the rationale behind them
  • II: Metadata Definition and Semantics
    • An explanation of the format of .NET dll/exe files, the different sections within them and how they’re laid out in-memory
  • III: CIL Instruction Set
    • A complete list of all the Intermediate Language (IL) instructions that the CLR understands, along with a detailed description of what they do and how to use them
  • IV: Profiles and Libraries
    • Describes the various different ‘Base Class libraries’ that make-up the runtime and how they are grouped into ‘Profiles’
  • V: Binary Formats (Debug Interchange Format)
    • An overview of ‘Portable CILDB files’, which give a way for additional debugging information to be provided
  • VI: Annexes
    • Annex A - Introduction
    • Annex B - Sample programs
    • Annex C - CIL assembler implementation
    • Annex D - Class library design guidelines
    • Annex E - Portability considerations
    • Annex F - Imprecise faults
    • Annex G - Parallel library

But, working your way through the entire specification is a mammoth task, generally I find it useful to just search for a particular word or phrase and locate the parts I need that way. However if you do want to read through one section, I recommend ‘Partition I: Concepts and Architecture’, at just over 100 pages it is much easier to fully digest! This section is a very comprehensive overview of the key concepts and components contained within the CLR and well worth a read.

Also, I’m convinced that the authors of the spec wanted to help out any future readers, so to break things up they included lots of very helpful diagrams:

For more examples see:

On top of all that, they also dropped in some Comic Sans 😀, just to make it clear when the text is only ‘informative’:

How has it changed?

The spec has been through 6th editions and it’s interesting to look at the changes over time:

Edition Release Date CLR Version Significant Changes 1st December 2001 1.0 (February 2002) N/A 2nd December 2002 1.1 (April 2003)   3rd June 2005 2.0 (January 2006) See below (link) 4th June 2006   None, revision of 3rd edition (link) 5th December 2010 4.0 (April 2010) See below (link) 6th June 2012   None, revision of 5th edition (link)

However, only 2 editions contained significant updates, they are explained in more detail below:

  • Support for generic types and methods (see ‘How generics were added to .NET’)
  • New IL instructions - ldelem, stelem and unbox.any
  • Added the constrained., no. and readonly. IL instruction prefixes
  • Brand new ‘namespaces’ (with corresponding types) - System.Collections.Generics, System.Threading.Parallel
  • New types added, including Action, Nullable and ThreadStaticAttribute
  • Type-forwarding added
  • Semantics of ‘variance’ redefined, became a core feature
  • Multiple types added or updated, including System.Action, System.MulticastDelegate and System.WeakReference
  • System.Math and System.Double modified to better conform to IEEE

Microsoft Specific Implementation

Another interesting aspect to look at is the Microsoft specific implementation details and notes. The following links are to pdf documents that are modified versions of the 4th edition:

They all contain multiple occurrences of text like this ‘Implementation Specific (Microsoft)’:

More Information

Finally, if you want to find out more there’s a book available (affiliate link):


Fri, 6 Apr 2018, 12:00 am


Exploring the internals of the .NET Runtime

I recently appeared on Herding Code and Stackify ‘Developer Things’ podcasts and in both cases, the first question asked was ‘how do you figure out the internals of the .NET runtime’?

This post is an attempt to articulate that process, in the hope that it might be useful to others.

Here are my suggested steps:

  1. Decide what you want to investigate
  2. See if someone else has already figured it out (optional)
  3. Read the ‘Book of the Runtime’
  4. Build from the source
  5. Debugging
  6. Verify against .NET Framework (optional)

Note: As with all these types of lists, just because it worked for me doesn’t mean that it will for everyone. So, ‘your milage may vary’.

Step One - Decide what you want to investigate

For me, this means working out what question I’m trying to answer, for example here are some previous posts I’ve written:

(it just goes to show, you don’t always need fancy titles!)

I put this as ‘Step 1’ because digging into .NET internals isn’t quick or easy work, some of my posts take weeks to research, so I need to have a motivation to keep me going, something to focus on. In addition, the CLR isn’t a small run-time, there’s a lot in there, so just blindly trying to find your way around it isn’t easy! That’s why having a specific focus helps, looking at one feature or section at a time is more manageable.

The very first post where I followed this approach was Strings and the CLR - a Special Relationship. I’d previously spent some time looking at the CoreCLR source and I knew a bit about how Strings in the CLR worked, but not all the details. During the research of that post I then found more and more areas of the CLR that I didn’t understand and the rest of my blog grew from there (delegates, arrays, fixed keyword, type loader, etc).

Aside: I think this is generally applicable, if you want to start blogging, but you don’t think you have enough ideas to sustain it, I’d recommend that you start somewhere and other ideas will follow.

Another tip is to look at HackerNews or /r/programming for posts about the ‘internals’ of other runtimes, e.g. Java, Ruby, Python, Go etc, then write the equivalent post about the CLR. One of my most popular posts A Hitchhikers Guide to the CoreCLR Source Code was clearly influenced by equivalent articles!

Finally, for more help with learning, ‘figuring things out’ and explaining them to others, I recommend that you read anything by Julia Evans. Start with Blogging principles I use and So you want to be a wizard (also available as a zine), then work your way through all the other posts related to blogging or writing.

I’ve been hugely influenced, for the better, by Julia’s approach to blogging.

Step Two - See if someone else has already figured it out (optional)

I put this in as ‘optional’, because it depends on your motivation. If you are trying to understand .NET internals for your own education, then feel-free to write about whatever you want. If you are trying to do it to also help others, I’d recommend that you first see what’s already been written about the subject. If, once you’ve done that you still think there is something new or different that you can add, then go ahead, but I try not to just re-hash what is already out there.

To see what’s already been written, you can start with Resources for Learning about .NET Internals or peruse the ‘Internals’ tag on this blog. Another really great resource is all the answers by Hans Passant on StackOverflow, he is prolific and amazingly knowledgeable, here’s some examples to get you started:

Step Three - Read the ‘Book of the Runtime’

You won’t get far in investigating .NET internals without coming across the ‘Book of the Runtime’ (BOTR) which is an invaluable resource, even Scott Hanselman agrees!

It was written by the .NET engineering team, for the .NET engineering team, as per this HackerNews comment:

Having worked for 7 years on the .NET runtime team, I can attest that the BOTR is the official reference. It was created as documentation for the engineering team, by the engineering team. And it was (supposed to be) kept up to date any time a new feature was added or changed.

However, just a word of warning, this means that it’s an in-depth, non-trivial document and hard to understand when you are first learning about a particular topic. Several of my blog posts have consisted of the following steps:

  1. Read the BOTR chapter on ‘Topic X’
  2. Understand about 5% of what I read
  3. Go away and learn more (read the source code, read other resources, etc)
  4. GOTO ‘Step 1’, understanding more this time!

Related to this, the source code itself is often as helpful as the BOTR due to the extensive comments, for example this one describing the rules for prestubs really helped me out. The downside of the source code comments is that they are bit harder to find, whereas the BOTR is all in one place.

Step Four - Build from the source

However, at some point, just reading about the internals of the CLR isn’t enough, you actually need to ‘get your hands’ dirty and see it in action. Now that the Core CLR is open source it’s very easy to build it yourself and then once you’ve done that, there are even more docs to help you out if you are building on different OSes, want to debug, test CoreCLR in conjunction with CoreFX, etc.

But why is building from source useful?

Because it lets you build a Debug/Diagnostic version of the runtime that gives you lots of additional information that isn’t available in the Release/Retails builds. For instance you can view JIT Dumps using COMPlus_JitDump=..., however this is just one of many COMPlus_XXX settings you can use, there are 100’s available.

However, even more useful is the ability to turn on diagnostic logging for a particular area of the CLR. For instance, lets imagine that we want to find out more about AppDomains and how they work under-the-hood, we can use the following logging configuration settings:

SET COMPLUS_LogEnable=1
SET COMPLUS_LogToFile=1
SET COMPLUS_LogFacility=02000000
SET COMPLUS_LogLevel=A

Where LogFacility is set to LF_APPDOMAIN, there are many other values you can provide as a HEX bit-mask the full list is available in the source code. If you set these variables and then run an app, you will get a log output like this one. Once you have this log you can very easily search around in the code to find where the messages came from, for instance here are all the places that LF_APPDOMAIN is logged. This is a great technique to find your way into a section of the CLR that you aren’t familiar with, I’ve used it many times to great effect.

Step Five - Debugging

For me, biggest boon of Microsoft open sourcing .NET is that you can discover so much more about the internals without having to resort to ‘old school’ debugging using WinDBG. But there still comes a time when it’s useful to step through the code line-by-line to see what’s going on. The added advantage of having the source code is that you can build a copy locally and then debug through that using Visual Studio which is slightly easier than WinDBG.

I always leave debugging to last, as it can be time-consuming and I only find it helpful when I already know where to set a breakpoint, i.e. I already know which part of the code I want to step through. I once tried to blindly step through the source of the CLR whilst it was starting up and it was very hard to see what was going on, as I’ve said before the CLR is a complex runtime, there are many things happening, so stepping through lots of code, line-by-line can get tricky.

Step Six - Verify against .NET Framework

I put this final step in because the .NET CLR source available on GitHub is the ‘.NET Core’ version of the runtime, which isn’t the same as the full/desktop .NET Framework that’s been around for years. So you may need to verify the behavior matches, if you want to understand the internals ‘as they were’, not just ‘as they will be’ going forward. For instance .NET Core has removed the ability to create App Domains as a way to provide isolation but interestingly enough the internal class lives on!

To verify the behaviour, your main option is to debug the CLR using WinDBG. Beyond that, you can resort to looking at the ‘Rotor’ source code (roughly the same as .NET Framework 2.0), or petition Microsoft the release the .NET Framework Source Code (probably not going to happen)!

However, low-level internals don’t change all that often, so more often than not the way things behave in the CoreCLR is the same as they’ve always worked.

Resources

Finally, for your viewing pleasure, here are a few talks related to ‘.NET Internals’:

Discuss this post on /r/programming or /r/dotnet


Fri, 23 Mar 2018, 12:00 am


How generics were added to .NET

Discuss this post on HackerNews and /r/programming

Before we dive into the technical details, let’s start with a quick history lesson, courtesy of Don Syme who worked on adding generics to .NET and then went on to design and implement F#, which is a pretty impressive set of achievements!!

Background and History

Update: Don Syme, pointed out another research paper related to .NET generics, Combining Generics, Precompilation and Sharing Between Software Based Processes (pdf)

To give you an idea of how these events fit into the bigger picture, here are the dates of .NET Framework Releases, up-to 2.0 which was the first version to have generics:

Version number CLR version Release date 1.0 1.0 2002-02-13 1.1 1.1 2003-04-24 2.0 2.0 2005-11-07

Aside from the historical perspective, what I find most fascinating is just how much the addition of generics in .NET was due to the work done by Microsoft Research, from .NET/C# Generics History:

It was only through the total dedication of Microsoft Research, Cambridge during 1998-2004, to doing a complete, high quality implementation in both the CLR (including NGEN, debugging, JIT, AppDomains, concurrent loading and many other aspects), and the C# compiler, that the project proceeded.

He then goes on to say:

What would the cost of inaction have been? What would the cost of failure have been? No generics in C# 2.0? No LINQ in C# 3.0? No TPL in C# 4.0? No Async in C# 5.0? No F#? Ultimately, an erasure model of generics would have been adopted, as for Java, since the CLR team would never have pursued a in-the-VM generics design without external help.

Wow, C# and .NET would look very different without all these features!!

The ‘Gyro’ Project - Generics for Rotor

Unfortunately there doesn’t exist a publicly accessible version of the .NET 1.0 and 2.0 source code, so we can’t go back and look at the changes that were made (if I’m wrong, please let me know as I’d love to read it).

However, we do have the next best thing, the ‘Gyro’ project in which the equivalent changes were made to the ‘Shared Source Common Language Implementation’ (SSCLI) code base (a.k.a ‘Rotor’). As an aside, if you want to learn more about the Rotor code base I really recommend the excellent book by Ted Neward, which you can download from his blog.

Gyro 1.0 was released in 2003 which implies that is was created after the work has been done in the real .NET Framework source code, I assume that Microsoft Research wanted to publish the ‘Rotor’ implementation so it could be studied more widely. Gyro is also referenced in one Don Syme’s posts, from Some History: 2001 “GC#” research project draft, from the MSR Cambridge team:

With Dave Berry’s help we later published a version of the corresponding code as the “Gyro” variant of the “Rotor” CLI implementation.

The rest of this post will look at how generics were implemented in the Rotor source code.

Note: There are some significant differences between the Rotor source code and the real .NET framework. Most notably the JIT and GC are completely different implementations (due to licensing issues, listen to DotNetRocks show 360 - Ted Neward and Joel Pobar on Rotor 2.0 for more info). However, the Rotor source does give us an accurate idea about how other core parts of the CLR are implemented, such as the Type-System, Debugger, AppDomains and the VM itself. It’s interesting to compare the Rotor source with the current CoreCLR source and see how much of the source code layout and class names have remained the same.

Implementation

To make things easier for anyone who wants to follow-along, I created a GitHub repo that contains the Rotor code for .NET 1.0 and then checked in the Gyro source code on top, which means that you can see all the changes in one place:

The first thing you notice in the Gyro source is that all the files contain this particular piece of legalese:

 ;    By using this software in any fashion, you are agreeing to be bound by the
 ;    terms of this license.
 ;   
+;    This file contains modifications of the base SSCLI software to support generic
+;    type definitions and generic methods. These modifications are for research
+;    purposes. They do not commit Microsoft to the future support of these or
+;    any similar changes to the SSCLI or the .NET product. -- 31st October, 2002.
+;   
 ;    You must not remove this notice, or any other, from this software.

It’s funny that they needed to add the line ‘They do not commit Microsoft to the future support of these or any similar changes to the SSCLI or the .NET product’, even though they were just a few months away from doing just that!!

Components (Directories) with the most changes

To see where the work was done, lets start with a high-level view, showing the directories with a significant amount of changes (> 1% of the total changes):

$ git diff --dirstat=lines,1 464bf98 2714cca
   0.1% bcl/
  14.4% csharp/csharp/sccomp/
   9.1% debug/di/
  11.9% debug/ee/
   2.1% debug/inc/
   1.9% debug/shell/
   2.5% fjit/
  21.1% ilasm/
   1.5% ildasm/
   1.2% inc/
   1.4% md/compiler/
  29.9% vm/

Note: fjit is the “Fast JIT” compiler, i.e the version released with Rotor, which was significantly different to one available in the full .NET framework.

The full output from git diff --dirstat=lines,0 is available here and the output from git diff --stat is here.

0.1% bcl/ is included only to show that very little C# code changes were needed, these were mostly plumbing code to expose the underlying C++ methods and changes to the various ToString() methods to include generic type information, e.g. ‘Class[int,double]’. However there are 2 more significant ones:

  • bcl/system/reflection/emit/opcodes.cs (diff)
  • bcl/system/reflection/emit/signaturehelper.cs (diff)
    • Add the ability to parse method metadata that contains generic related information, such as methods with generic parameters.

Files with the most changes

Next, we’ll take a look at the specific classes/files that had the most changes as this gives us a really good idea about where the complexity was

Added Deleted Total Changes File (click to go directly to the diff) 1794 323 1471 debug/di/module.cpp 1418 337 1081 vm/class.cpp 1335 308 1027 vm/jitinterface.cpp 1616 888 728 debug/ee/debugger.cpp 741 46 695 csharp/csharp/sccomp/symmgr.cpp 693 0 693 vm/genmeth.cpp 999 362 637 csharp/csharp/sccomp/clsdrec.cpp 926 321 605 csharp/csharp/sccomp/fncbind.cpp 559 0 559 vm/typeparse.cpp 605 156 449 vm/siginfo.cpp 417 29 388 vm/method.hpp 642 255 387 fjit/fjit.cpp 379 0 379 vm/jitinterfacegen.cpp 3045 2672 373 ilasm/parseasm.cpp 465 94 371 vm/class.h 515 163 352 debug/inc/cordb.h 339 0 339 vm/generics.cpp 733 418 315 csharp/csharp/sccomp/parser.cpp 471 169 302 debug/shell/dshell.cpp 382 88 294 csharp/csharp/sccomp/import.cpp

Components of the Runtime

Now we’ll look at individual components in more detail so we can get an idea of how different parts of the runtime had to change to accommodate generics.

Type System changes

Not surprisingly the bulk of the changes are in the Virtual Machine (VM) component of the CLR and related to the ‘Type System’. Obviously adding ‘parameterised types’ to a type system that didn’t already have them requires wide-ranging and significant changes, which are shown in the list below:

  • vm/class.cpp (diff )
    • Allow the type system to distinguish between open and closed generic types and provide APIs to allow working them, such as IsGenericVariable() and GetGenericTypeDefinition()
  • vm/genmeth.cpp (diff)
    • Contains the bulk of the functionality to make ‘generic methods’ possible, i.e. MyMethod(T item, U filter), including to work done to enable ‘shared instantiation’ of generic methods
  • vm/typeparse.cpp (diff)
    • Changes needed to allow generic types to be looked-up by name, i.e. ‘MyClass[System.Int32]
  • vm/siginfo.cpp (diff)
    • Adds the ability to work with ‘generic-related’ method signatures
  • vm/method.hpp (diff) and vm/method.cpp (diff)
    • Provides the runtime with generic related methods such as IsGenericMethodDefinition(), GetNumGenericMethodArgs() and GetNumGenericClassArgs()
  • vm/generics.cpp (diff)
    • All the completely new ‘generics’ specific code is in here, mostly related to ‘shared instantiation’ which is explained below

Bytecode or ‘Intermediate Language’ (IL) changes

The main place that the implementation of generics in the CLR differs from the JVM is that they are ‘fully reified’ instead of using ‘type erasure’, this was possible because the CLR designers were willing to break backwards compatibility, whereas the JVM had been around longer so I assume that this was a much less appealing option. For more discussion on this issue see Erasure vs reification and Reified Generics for Java. Update: this HackerNews discussion is also worth a read.

The specific changes made to the .NET Intermediate Language (IL) op-codes can be seen in the inc/opcode.def (diff), in essence the following 3 instructions were added

In addition the IL Assembler tool (ILASM) needed significant changes as well as it’s counter part `IL Disassembler (ILDASM) so it could handle the additional instructions.

There is also a whole section titled ‘Support for Polymorphism in IL’ that explains these changes in greater detail in Design and Implementation of Generics for the .NET Common Language Runtime

Shared Instantiations

From Design and Implementation of Generics for the .NET Common Language Runtime

Two instantiations are compatible if for any parameterized class its compilation at these instantiations gives rise to identical code and other execution structures (e.g. field layout and GC tables), apart from the dictionaries described below in Section 4.4. In particular, all reference types are compatible with each other, because the loader and JIT compiler make no distinction for the purposes of field layout or code generation. On the implementation for the Intel x86, at least, primitive types are mutually incompatible, even if they have the same size (floats and ints have different parameter passing conventions). That leaves user-defined struct types, which are compatible if their layout is the same with respect to garbage collection i.e. they share the same pattern of traced pointers

From a comment with more info:

// For an generic type instance return the representative within the class of
// all type handles that share code.  For example, 
//     --> ,
//     --> ,
//     --> ,
//     --> ,
//     --> 
//
// If the code for the type handle is not shared then return 
// the type handle itself.

In addition, this comment explains the work that needs to take place to allow shared instantiations when working with generic methods.

Update: If you want more info on the ‘code-sharing’ that takes places, I recommend reading these 4 posts:

Compiler and JIT Changes

If seems like almost every part of the compiler had to change to accommodate generics, which is not surprising given that they touch so many parts of the code we write, Types, Classes and Methods. Some of the biggest changes were:

  • csharp/csharp/sccomp/clsdrec.cpp - +999 -363 - (diff)
  • csharp/csharp/sccomp/emitter.cpp - +347 -127 - (diff)
  • csharp/csharp/sccomp/fncbind.cpp - +926 -321 - (diff)
  • csharp/csharp/sccomp/import.cpp - +382 - 88 - (diff)
  • csharp/csharp/sccomp/parser.cpp - +733 -418 - (diff)
  • csharp/csharp/sccomp/symmgr.cpp - +741 -46 - (diff)

In the ‘just-in-time’ (JIT) compiler extra work was needed because it’s responsible for implementing the additional ‘IL Instructions’. The bulk of these changes took place in fjit.cpp (diff) and fjitdef.h (diff).

Finally, a large amount of work was done in vm/jitinterface.cpp (diff) to enable the JIT to access the extra information it needed to emit code for generic methods.

Debugger Changes

Last, but by no means least, a significant amount of work was done to ensure that the debugger could understand and inspect generics types. It goes to show just how much inside information a debugger needs to have of the type system in an managed language.

  • debug/ee/debugger.cpp (diff)
  • debug/ee/debugger.h (diff)
  • debug/di/module.cpp (diff)
  • debug/di/rsthread.cpp (diff)
  • debug/shell/dshell.cpp (diff)

Further Reading

If you want even more information about generics in .NET, there are also some very useful design docs available (included in the Gyro source code download):

Also Pre-compilation for .NET Generics by Andrew Kennedy & Don Syme (pdf) is an interesting read


Fri, 2 Mar 2018, 12:00 am


Resources for Learning about .NET Internals

It all started with a tweet, which seemed to resonate with people:

If you like reading my posts on .NET internals, you'll like all these other blogs. So I've put them together in a thread for you!!

— Matt Warren (@matthewwarren) January 12, 2018

The aim was to list blogs that specifically cover .NET internals at a low-level or to put it another way, blogs that answer the question how does feature ‘X’ work, under-the-hood. The list includes either typical posts for that blog, or just some of my favourites!

Note: for a wider list of .NET and performance related blogs see Awesome .NET Performance by Adam Sitnik

I wouldn’t recommend reading through the entire list, at least not in one go, your brain will probably melt. Picks some posts/topics that interest you and start with those.

Finally, bear in mind that some of the posts are over 10 years old, so there’s a chance that things have changed since then (however, in my experience, the low-levels parts of the CLR are more stable). If you want to double-check the latest behaviour, you’re best option is to read the source!

Community or Non-Microsoft Blogs

These blogs are all written by non-Microsoft employees (AFAICT), or if they do work for Microsoft, they don’t work directly on the CLR. If I’ve missed any interesting blogs out, please let me know!

Special mention goes to Sasha Goldshtein, he’s been blogging about this longer than anyone!!

Update: I missed out a few blogs and learnt about some new ones:

Honourable mention goes to .NET Type Internals - From a Microsoft CLR Perspective on CodeProject, it’s a great article!!

Book of the Runtime (BotR)

The BotR deserves it’s own section (thanks to svick to reminding me about it).

If you haven’t heard of the BotR before, there’s a nice FAQ that explains what it is:

The Book of the Runtime is a set of documents that describe components in the CLR and BCL. They are intended to focus more on architecture and invariants and not an annotated description of the codebase.

It was originally created within Microsoft in ~2007, including this document. Developers were responsible to document their feature areas. This helped new devs joining the team and also helped share the product architecture across the team.

To find your way around it, I recommend starting with the table of contents and then diving in.

Note: It’s written for developers working on the CLR, so it’s not an introductory document. I’d recommend reading some of the other blog posts first, then referring to the BotR once you have the basic knowledge. For instance many of my blog posts started with me reading a chapter from the BotR, not fully understanding it, going away and learning some more, writing up what I found and then pointing people to the relevant BotR page for more information.

Microsoft Engineers

The blogs below are written by the actual engineers who worked on, designed or managed various parts of the CLR, so they give a deep insight (again, if I’ve missed any blogs out, please let me know):

Books

Finally, if you prefer reading off-line there are some decent books that discuss .NET Internals (Note: all links are Amazon Affiliate links):

All the books listed above I own copies of and I’ve read cover-to-cover, they’re fantastic resources.

I’ve also been recently recommend the 2 books below, they look good and certainly the authors know their stuff, but I haven’t read them yet:

*New Release*

Discuss this post on HackerNews and /r/programming


Mon, 22 Jan 2018, 12:00 am


A look back at 2017

I’ve now been blogging consistently for over 2 years (~2 times per/month) and I decided it was time for my first ‘retrospective’ post.

Warning this post contains a large amount of humble brags, if you’ve come here to read about .NET internals you’d better check back in a few weeks, when normal service will be resumed!

Overall Stats

Firstly, lets looks at my Google Analytics stats for 2017, showing Page Views and Sessions:

Which clearly shows that I took a bit of a break during the summer! But I still managed over 800K page views, mostly because I was fortunate enough to end up on the front page of HackerNews a few times!

As a comparison, here’s what ‘2017 v 2016’ looks like:

This is cool because it shows a nice trend, more people read my blog posts in 2017 than in 2016 (but I have no idea if it will continue in 2018?!)

Most Read Posts

Next, here are my top 10 most read posts. Surprising enough my most read post was literally just a list with 68 entries in it!!

Post Page Views The 68 things the CLR does before executing a single line of your code 101,382 A Hitchhikers Guide to the CoreCLR Source Code 61,169 A DoS Attack against the C# Compiler 50,884 Analysing C# code on GitHub with BigQuery 40,165 Adding a new Bytecode Instruction to the CLR 39,101 Open Source .NET – 3 years later 36,316 How do .NET delegates work? 36,047 Lowering in the C# Compiler (and what happens when you misuse it) 34,375 How the .NET Runtime loads a Type 32,813 DotNetAnywhere: An Alternative .NET Runtime 26,140

Traffic Sources

I was going to do a write-up on where/how I get my blog traffic, but instead I’d encourage you to read 6 Years of Thoughts on Programming by Henrik Warne as his experience exactly matches mine. But in summary, getting onto the front-page of HackerNews drives a lot of traffic to your site/blog.

Finally, a big thanks to everyone who has read, commented on or shared my blogs posts, it means a lot!!


Sun, 31 Dec 2017, 12:00 am


Open Source .NET – 3 years later

A little over 3 years ago Microsoft announced that they were open sourcing large parts of the .NET framework and as Scott Hanselman said in his Connect 2016 keynote, the community has been contributing in a significant way:

This post forms part of an on-going series, if you want to see how things have changed over time you can check out the previous ones:

In addition, I’ve recently done a talk covering this subject, the slides are below:

Microsoft & open source a 'brave new world' - CORESTART 2.0 from Matt Warren

Historical Perspective

Now that we are 3 years down the line, it’s interesting to go back and see what the aims were when it all started. If you want to know more about this, I recommend watching the 2 Channel 9 videos below, made by the Microsoft Engineers involved in the process:

It hasn’t always been plain sailing, it’s fair to say that there have been a few bumps along the way (I guess that’s what happens if you get to see “how the sausage gets made”), but I think that we’ve ended up in a good place.

During the past 3 years there have been a few notable events that I think are worth mentioning:

Repository activity over time

But onto the data, first we are going to look at an overview of the level of activity in each repo, by looking at the total number of ‘Issues’ (created) or ‘Pull Requests’ (closed) per month. (yay sparklines FTW!!). If you are interested in how I got the data, see the previous post because the process is the same.

Note: Numbers in black are from the most recent month, with the red dot showing the lowest and the green dot the highest previous value. You can toggle between Issues and Pull Requests by clicking on the buttons, hover over individual sparklines to get a tooltip showing the per/month values and click on the project name to take you to the GitHub page for that repository.

Issues Pull Requests

This data gives a good indication of how healthy different repos are, are they growing over time, or staying the same. You can also see the different levels of activity each repo has and how they compare to other ones.

Whilst it’s clear that Visual Studio Code is way ahead of all the other repos in terms of ‘Issues’, it’s interesting to see that the .NET-only ones have the most ‘Pull-Requests’, notably CoreFX (Base Class Libraries), Roslyn (Compiler) and CoreCLR (Runtime).

Overall Participation - Community v. Microsoft

Next will will look at the total participation from the last 3 years, i.e. November 2014 to November 2017. All Pull Requests are Issues are treated equally, so a large PR counts the same as one that fixes a spelling mistake. Whilst this isn’t ideal it’s the simplest way to get an idea of the Microsoft/Community split.

Note: You can hover over the bars to get the actual numbers, rather than percentages.

Issues: Microsoft Community Pull Requests: Microsoft Community

Participation over time - Community v. Microsoft

Finally we can see the ‘per-month’ data from the last 3 years, i.e. November 2014 to November 2017.

Note: You can inspect different repos by selecting them from the pull-down list, but be aware that the y-axis on the graphs are re-scaled, so the maximum value will change each time.

Issues: Microsoft Community Pull Requests: Microsoft Community

Summary

It’s clear that the community continues to be invested in the .NET-related, Open Source repositories, contributing significantly and for a sustained period of time. I think this is good for all .NET developers, whether you contribute to OSS or not, having .NET be a thriving, Open Source product has many benefits!

Discuss this post on Hacker News and /r/programming


Tue, 19 Dec 2017, 12:00 am


A look at the internals of 'Tiered JIT Compilation' in .NET Core

The .NET runtime (CLR) has predominantly used a just-in-time (JIT) compiler to convert your executable into machine code (leaving aside ahead-of-time (AOT) scenarios for the time being), as the official Microsoft docs say:

At execution time, a just-in-time (JIT) compiler translates the MSIL into native code. During this compilation, code must pass a verification process that examines the MSIL and metadata to find out whether the code can be determined to be type safe.

But how does that process actually work?

The same docs give us a bit more info:

JIT compilation takes into account the possibility that some code might never be called during execution. Instead of using time and memory to convert all the MSIL in a PE file to native code, it converts the MSIL as needed during execution and stores the resulting native code in memory so that it is accessible for subsequent calls in the context of that process. The loader creates and attaches a stub to each method in a type when the type is loaded and initialized. When a method is called for the first time, the stub passes control to the JIT compiler, which converts the MSIL for that method into native code and modifies the stub to point directly to the generated native code. Therefore, subsequent calls to the JIT-compiled method go directly to the native code.

Simple really!! However if you want to know more, the rest of this post will explore this process in detail.

In addition, we will look at a new feature that is making its way into the Core CLR, called ‘Tiered Compilation’. This is a big change for the CLR, up till now .NET methods have only been JIT compiled once, on their first usage. Tiered compilation is looking to change that, allowing methods to be re-compiled into a more optimised version much like the Java Hotspot compiler.

How it works

But before we look at future plans, how does the current CLR allow the JIT to transform a method from IL to native code? Well, they say ‘a pictures speaks a thousand words’

Before the method is JITed

After the method has been JITed

The main things to note are:

  • The CLR has put in a ‘precode’ and ‘stub’ to divert the initial method call to the PreStubWorker() method (which ultimately calls the JIT). These are hand-written assembly code fragments consisting of only a few instructions.
  • Once the method had been JITed into ‘native code’, a stable entry point it created. For the rest of the life-time of the method the CLR guarantees that this won’t change, so the rest of the run-time can depend on it remaining stable.
  • The ‘temporary entry point’ doesn’t go away, it’s still available because there may be other methods that are expecting to call it. However the associated ‘precode fixup’ has been re-written or ‘back patched’ to point to the newly created ‘native code’ instead of PreStubWorker().
  • The CLR doesn’t change the address of the call instruction in the method that called the method being JITted, it only changes the address inside the ‘precode’. But because all method calls in the CLR go via a precode, the 2nd time the newly JITed method is called, the call will end up at the ‘native code’.

For reference, the ‘stable entry point’ is the same memory location as the IntPtr that is returned when you call the RuntimeMethodHandle.GetFunctionPointer() method.

If you want to see this process in action for yourself, you can either re-compile the CoreCLR source and add the relevant debug information as I did or just use WinDbg and follow the steps in this excellent blog post (for more on the same topic see ‘Advanced Call Processing in the CLR’ and Vance Morrison’s excellent write-up ‘Digging into interface calls in the .NET Framework: Stub-based dispatch’).

Finally, the different parts of the Core CLR source code that are involved are listed below:

Note: this post isn’t going to look at how the JIT itself works, if you are interested in that take a look as this excellent overview written by one of the main developers.

JIT and Execution Engine (EE) Interaction

The make all this work the JIT and the EE have to work together, to get an idea of what is involved, take a look at this comment describing the rules that determine which type of precode the JIT can use. All this info is stored in the EE as it’s the only place that has the full knowledge of what a method does, so the JIT has to ask which mode to work in.

In addition, the JIT has to ask the EE what the address of a functions entry point is, this is done via the following methods:

Precode and Stubs

There are different types or ‘precode’ available, ‘FIXUP’, ‘REMOTING’ or ‘STUB’, you can see the rules for which one is used in MethodDesc::GetPrecodeType(). In addition, because they are such a low-level mechanism, they are implemented differently across CPU architectures, from a comment in the code:

There two implementation options for temporary entrypoints:

(1) Compact entrypoints. They provide as dense entrypoints as possible, but can’t be patched to point to the final code. The call to unjitted method is indirect call via slot.

(2) Precodes. The precode will be patched to point to the final code eventually, thus the temporary entrypoint can be embedded in the code. The call to unjitted method is direct call to direct jump.

We use (1) for x86 and (2) for 64-bit to get the best performance on each platform. For ARM (1) is used.

There’s also a whole lot more information about ‘precode’ available in the BOTR.

Finally, it turns out that you can’t go very far into the internals of the CLR without coming across ‘stubs’ (or ‘trampolines’, ‘thunks’, etc), for instance they’re used in

Tiered Compilation

Before we go any further I want to point out that Tiered Compilation is very much work-in-progress. As an indication, to get it working you currently have to set an environment variable called COMPLUS_EXPERIMENTAL_TieredCompilation. It appears that the current work is focussed on the infrastructure to make it possible (i.e. CLR changes), then I assume that there has to be a fair amount of testing and performance analysis before it’s enabled by default.

If you want to learn about the goals of the feature and how it fits into the wider process of ‘code versioning’, I recommend reading the excellent design docs, including the future roadmap possibilities.

To give an indications of what has been involved so far, there has been work going on in the:

If you want to follow along you can take a look at the related issues/PRs, here are the main ones to get you started:

There is also some nice background information available in Introduce a tiered JIT and if you want to understand how it will eventually makes use of changes in the JIT (‘MinOpts’), take a look at Low Tier Back-Off and JIT: enable aggressive inline policy for Tier1.

History - ReJIT

As an quick historical aside, you have previously been able to get the CLR to re-JIT a method for you, but it only worked with the Profiling APIs, which meant you had to write some C/C++ COM code to make it happen! In addition ReJIT only allowed the method to be re-compiled at the same level, so it wouldn’t ever produce more optimised code. It was mostly meant to help monitoring or profiling tools.

How it works

Finally, how does it work, again lets look at some diagrams. Firstly, as a recap, lets take a look at how things ends up once a method had been JITed, with tiered compilation turned off (the same diagram as above):

Now, as a comparison, here’s what the same stage looks like with tiered compilation enabled:

The main difference is that tiered compilation has forced the method call to go through another level of indirection, the ‘pre stub’. This is to make it possible to count the number of times the method is called, then once it has hit the threshold (currently 30), the ‘pre stub’ is re-written to point to the ‘optimised native code’ instead:

Note that the original ‘native code’ is still available, so if needed the changes can be reverted and the method call can go back to the unoptimised version.

Using a counter

We can see a bit more details about the counter in this comments from prestub.cpp:

    /***************************   CALL COUNTER    ***********************/
    // If we are counting calls for tiered compilation, leave the prestub
    // in place so that we can continue intercepting method invocations.
    // When the TieredCompilationManager has received enough call notifications
    // for this method only then do we back-patch it.
    BOOL fCanBackpatchPrestub = TRUE;
#ifdef FEATURE_TIERED_COMPILATION
    BOOL fEligibleForTieredCompilation = IsEligibleForTieredCompilation();
    if (fEligibleForTieredCompilation)
    {
        CallCounter * pCallCounter = GetCallCounter();
        fCanBackpatchPrestub = pCallCounter->OnMethodCalled(this);
    }
#endif

In essence the ‘stub’ calls back into the TieredCompilationManager until the ‘tiered compilation’ is triggered, once that happens the ‘stub’ is ‘back-patched’ to stop it being called any more.

Why not ‘Interpreted’?

If you’re wondering why tiered compilation doesn’t have an interpreted mode, you’re not alone, I asked the same question (for more info see my previous post on the .NET Interpreter)

And the answer I got was:

There’s already an Interpreter available, or is it not considered suitable for production code?

Its a fine question, but you guessed correctly - the interpreter is not in good enough shape to run production code as-is. There are also some significant issues if you want debugging and profiling tools to work (which we do). Given enough time and effort it is all solvable, it just isn’t the easiest place to start.

How different is the overhead between non-optimised and optimised JITting?

On my machine non-optimized jitting used about ~65% of the time that optimized jitting took for similar IL input sizes, but of course I expect results will vary by workload and hardware. Getting this first step checked in should make it easier to collect better measurements.

But that’s from a few months ago, maybe Mono’s New .NET Interpreter will change things, who knows?

Why not LLVM?

Finally, why aren’t they using a LLVM to compile the code, from Introduce a tiered JIT (comment)

There were (and likely still are) significant differences in the LLVM support needed for the CLR versus what is needed for Java, both in GC and in EH, and in the restrictions one must place on the optimizer. To cite just one example: the CLRs GC currently cannot tolerate managed pointers that point off the end of objects. Java handles this via a base/derived paired reporting mechanism. We’d either need to plumb support for this kind of paired reporting into the CLR or restrict LLVM’s optimizer passes to never create these kinds of pointers. On top of that, the LLILC jit was slow and we weren’t sure ultimately what kind of code quality it might produce.

So, figuring out how LLILC might fit into a potential multi-tier approach that did not yet exist seemed (and still seems) premature. The idea for now is to get tiering into the framework and use RyuJit for the second-tier jit. As we learn more, we may discover there is indeed room for higher tier jits, or, at least, understand better what else we need to do before such things make sense.

There is more background info in Introduce a tiered JIT

Summary

One of my favourite side-effects of Microsoft making .NET Open Source and developing out in the open is that we can follow along with work-in-progress features. It’s great being able to download the latest code, try them out and see how they work under-the-hood, yay for OSS!!

Discuss this post on Hacker News


Fri, 15 Dec 2017, 12:00 am


Exploring the BBC micro:bit Software Stack

If you grew up in the UK and went to school during the 1980’s or 1990’s there’s a good chance that this picture brings back fond memories:

(image courtesy of Classic Acorn)

I’d imagine that for a large amount of computer programmers (currently in their 30’s) the BBC Micro was their first experience of programming. If this applies to you and you want a trip down memory lane, have a read of Remembering: The BBC Micro and The BBC Micro in my education.

Programming the classic Turtle was done in Logo, with code like this:

FORWARD 100
LEFT 90
FORWARD 100
LEFT 90
FORWARD 100
LEFT 90
FORWARD 100
LEFT 90

Of course, once you knew what you were doing, you would re-write it like so:

REPEAT 4 [FORWARD 100 LEFT 90]

BBC micro:bit

The original Micro was launched as an education tool, as part of the BBC’s Computer Literacy Project and by most accounts was a big success. As a follow-up, in March 2016 the micro:bit was launched as part of the BBC’s ‘Make it Digital’ initiative and 1 million devices were given out to schools and libraries in the UK to ‘help develop a new generation of digital pioneers’ (i.e. get them into programming!)

Aside: I love the difference in branding across 30 years, ‘BBC Micro’ became ‘BBC micro:bit’ (you must include the colon) and ‘Computer Literacy Project’ changed to the ‘Make it Digital Initiative’.

A few weeks ago I walked into my local library, picked up a nice starter kit and then spent a fun few hours watching my son play around with it (I’m worried about how quickly he picked up the basics of programming, I think I might be out of a job in a few years time!!)

However once he’d gone to bed it was all mine! The result of my ‘playing around’ is this post, in it I will be exploring the software stack that makes up the micro:bit, what’s in it, what it does and how it all fits together.

If you want to learn about how to program the micro:bit, its hardware or anything else, take a look at this excellent list of resources.

Slightly off-topic, but if you enjoy reading source code you might like these other posts:

BBC micro:bit Software Stack

If we take a high-level view at the stack, it divides up into 3 discrete software components that all sit on top of the hardware itself:

If you would like to build this stack for yourself take a look at the Building with Yotta guide. I also found this post describing The First Video Game on the BBC micro:bit [probably] very helpful.

Runtimes

There are several high-level runtimes available, these are useful because they let you write code in a language other than C/C++ or even create programs by dragging blocks around on a screen. The main ones that I’ve come across are below (see ‘Programming’ for a full list):

They both work in a similar way, the users code (Python or TypeScript) is bundled up along with the C/C++ code of the runtime itself and then the entire binary (hex) file is deployed to the micro:bit. When the device starts up, the runtime then looks for the users code at a known location in memory and starts interpreting it.

Update It turns out that I was wrong about the Microsoft PXT, it actually compiles your TypeScript program to native code, very cool! Interestingly, they did it that way because:

Compared to a typical dynamic JavaScript engine, PXT compiles code statically, giving rise to significant time and space performance improvements:

  • user programs are compiled directly to machine code, and are never in any byte-code form that needs to be interpreted; this results in much faster execution than a typical JS interpreter
  • there is no RAM overhead for user-code - all code sits in flash; in a dynamic VM there are usually some data-structures representing code
  • due to lack of boxing for small integers and static class layout the memory consumption for objects is around half the one you get in a dynamic VM (not counting the user-code structures mentioned above)
  • while there is some runtime support code in PXT, it’s typically around 100KB smaller than a dynamic VM, bringing down flash consumption and leaving more space for user code

The execution time, RAM and flash consumption of PXT code is as a rule of thumb 2x of compiled C code, making it competitive to write drivers and other user-space libraries.

Memory Layout

Just before we go onto the other parts of the software stack I want to take a deeper look at the memory layout. This is important because memory is so constrained on the micro:bit, there is only 16KB of RAM. To put that into perspective, we’ll use the calculation from this StackOverflow question How many bytes of memory is a tweet?

Twitter uses UTF-8 encoded messages. UTF-8 code points can be up to six four octets long, making the maximum message size 140 x 4 = 560 8-bit bytes.

If we re-calculate for the newer, longer tweets 280 x 4 = 1,120 bytes. So we could only fit 10 tweets into the available RAM on the micro:bit (it turns out that only ~11K out of the total 16K is available for general use). Which is why it’s worth using a custom version of atoi() to save 350 bytes of RAM!

The memory layout is specified by the linker at compile-time using NRF51822.ld, there is a sample output available if you want to take a look. Because it’s done at compile-time you run into build errors such as “region RAM overflowed with stack” if you configure it incorrectly.

The table below shows the memory layout from the ‘no SD’ version of a ‘Hello World’ app, i.e. with the maximum amount of RAM available as the Bluetooth (BLE) Soft-Device (SD) support has been removed. By comparison with BLE enabled, you instantly have 8K less RAM available, so things start to get tight!

Name Start Address End Address Size Percentage .data 0x20000000 0x20000098 152 bytes 0.93% .bss 0x20000098 0x20000338 672 bytes 4.10% Heap (mbed) 0x20000338 0x20000b38 2,048 bytes 12.50% Empty 0x20000b38 0x20003800 11,464 bytes 69.97% Stack 0x20003800 0x20004000 2,048 bytes 12.50%

For more info on the column names see the Wikipedia pages for .data and .bss as well as text, data and bss: Code and Data Size Explained

As a comparison there is a nice image of the micro:bit RAM Layout in this article. It shows what things look like when running MicroPython and you can clearly see the main Python heap in the centre taking up all the remaining space.

microbit-dal

Sitting in the stack below the high-level runtime is the device abstraction layer (DAL), created at Lancaster University in the UK, it’s made up of 4 main components:

  • core
    • High-level components, such as Device, Font, HeapAllocator, Listener and Fiber, often implemented on-top of 1 or more driver classes
  • types
    • Helper types such as ManagedString, Image, Event and PacketBuffer
  • drivers
    • For control of a specific hardware component, such as Accelerometer, Button, Compass, Display, Flash, IO, Serial and Pin
  • bluetooth
  • asm
    • Just 4 functions are implemented in assembly, they are swap_context, save_context, save_register_context and restore_register_context. As the names suggest, they handle the ‘context switching’ necessary to make the MicroBit Fiber scheduler work

The image below shows the distribution of ‘Lines of Code’ (LOC), as you can see the majority of the code is in the drivers and bluetooth components.

In addition to providing nice helper classes for working with the underlying devices, the DAL provides the Fiber abstraction to allows asynchronous functions to work. This is useful because you can asynchronously display text on the LED display and your code won’t block whilst it’s scrolling across the screen. In addition the Fiber class is used to handle the interrupts that signal when the buttons on the micro:bit are pushed. This comment from the code clearly lays out what the Fiber scheduler does:

This lightweight, non-preemptive scheduler provides a simple threading mechanism for two main purposes:

1) To provide a clean abstraction for application languages to use when building async behaviour (callbacks). 2) To provide ISR decoupling for EventModel events generated in an ISR context.

Finally the high-level classes MicroBit.cpp and MicroBit.h are housed in the microbit repository. These classes define the API of the MicroBit runtime and setup the default configuration, as shown in the Constructor of MicroBit.cpp:

/**
  * Constructor.
  *
  * Create a representation of a MicroBit device, which includes member variables
  * that represent various device drivers used to control aspects of the micro:bit.
  */
MicroBit::MicroBit() :
    serial(USBTX, USBRX),
    resetButton(MICROBIT_PIN_BUTTON_RESET),
    storage(),
    i2c(I2C_SDA0, I2C_SCL0),
    messageBus(),
    display(),
    buttonA(MICROBIT_PIN_BUTTON_A, MICROBIT_ID_BUTTON_A),
    buttonB(MICROBIT_PIN_BUTTON_B, MICROBIT_ID_BUTTON_B),
    buttonAB(MICROBIT_ID_BUTTON_A,MICROBIT_ID_BUTTON_B, MICROBIT_ID_BUTTON_AB),
    accelerometer(i2c),
    compass(i2c, accelerometer, storage),
    compassCalibrator(compass, accelerometer, display),
    thermometer(storage),
    io(MICROBIT_ID_IO_P0,MICROBIT_ID_IO_P1,MICROBIT_ID_IO_P2,
       MICROBIT_ID_IO_P3,MICROBIT_ID_IO_P4,MICROBIT_ID_IO_P5,
       MICROBIT_ID_IO_P6,MICROBIT_ID_IO_P7,MICROBIT_ID_IO_P8,
       MICROBIT_ID_IO_P9,MICROBIT_ID_IO_P10,MICROBIT_ID_IO_P11,
       MICROBIT_ID_IO_P12,MICROBIT_ID_IO_P13,MICROBIT_ID_IO_P14,
       MICROBIT_ID_IO_P15,MICROBIT_ID_IO_P16,MICROBIT_ID_IO_P19,
       MICROBIT_ID_IO_P20),
    bleManager(storage),
    radio(),
    ble(NULL)
{
...
}

mbed-classic

The software at the bottom of the stack is making use of the ARM mbed OS which is:

.. an open-source embedded operating system designed for the “things” in the Internet of Things (IoT). mbed OS includes the features you need to develop a connected product using an ARM Cortex-M microcontroller.

mbed OS provides a platform that includes:

  • Security foundations.
  • Cloud management services.
  • Drivers for sensors, I/O devices and connectivity.

mbed OS is modular, configurable software that you can customize it to your device and to reduce memory requirements by excluding unused software.

We can see this from the layout of it’s source, it’s based around common components, which can be combined with a hal (Hardware Abstraction Layers) and a target specific to the hardware you are running on.

More specifically the micro:bit uses the yotta target bbc-microbit-classic-gcc, but it can also use others targets as needed.

For reference, here are the files from the common section of mbed that are used by the micro:bit-dal:

And here are the hardware specific files, targeting the NORDIC - MCU NRF51822:

End-to-end (or top-to-bottom)

Finally, lets look a few examples of how the different components within the stack are used in specific scenarios

Writing to the Display

Storing files on the Flash memory

  • microbit-dal
  • mbed-classic
    • Allows low-level control of the hardware, such as writing to the flash itself either directly or via the SoftDevice (SD) layer

In addition, this comment from MicroBitStorage.h gives a nice overview of how the file system is implemented on-top of the raw flash storage:

* The first 8 bytes are reserved for the KeyValueStore struct which gives core
* information such as the number of KeyValuePairs in the store, and whether the
* store has been initialised.
*
* After the KeyValueStore struct, KeyValuePairs are arranged contiguously until
* the end of the block used as persistent storage.
*
* |-------8-------|--------48-------|-----|---------48--------|
* | KeyValueStore | KeyValuePair[0] | ... | KeyValuePair[N-1] |
* |---------------|-----------------|-----|-------------------|

Summary

All-in-all the micro:bit is a very nice piece of kit and hopefully will achieve its goal ‘to help develop a new generation of digital pioneers’. However, it also has a really nice software stack, one that is easy to understand and find your way around.

Further Reading

I’ve got nothing to add that isn’t already included in this excellent, comprehensive list of resources, thanks Carlos for putting it together!!

Discuss this post on Hacker News or /r/microbit


Tue, 28 Nov 2017, 12:00 am


Microsoft & Open Source a 'Brave New World' - CORESTART 2.0

Recently I was fortunate enough to be invited to the CORESTART 2.0 conference to give a talk on Microsoft & Open Source a ‘Brave New World’. It was a great conference, well organised by Tomáš Herceg and the teams from .NET College and Riganti and I had a great time.

I encourage you to attend next years ‘Update’ conference if you can and as bonus you’ll get to see the sights of Prague! Including the Head of Franz Kafka as well as the amazing buildings, castles and bridges that all the guide-books will tell you about!

I’ve not been ‘invited’ to speak at a conference before, so I wasn’t sure what to expect, but there was a great audience and they seemed happy to learn about the Open Source projects that Microsoft are running and what is being done to encourage us (the ‘Community’) to contribute.

The slides for my talk are embedded below and you can also ‘watch’ the entire recording (audio and slides only, no video).

Microsoft & open source a 'brave new world' - CORESTART 2.0 from Matt Warren

Talk Outline

But if you don’t fancy sitting through the whole thing, you can read the summary below and jump straight to the relevant parts

Before

[jump to slide] [direct video link]

During

[jump to slide] [direct video link]

After

[jump to slide] [direct video link]

What Now?

[jump to slide] [direct video link]

Domino Chain Reaction

Finally, if you’re wondering what the section on ‘Domino Chain Reaction’ is all about, you’ll have to listen to that part of the talk, but the video itself is embedded below:

(Based on actual research, see The Curious Mathematics of Domino Chain Reactions)


Tue, 14 Nov 2017, 12:00 am


A DoS Attack against the C# Compiler

Generics in C# are certainly very useful and I find it amazing that we almost didn’t get them:

What would the cost of inaction have been? What would the cost of failure have been? No generics in C# 2.0? No LINQ in C# 3.0? No TPL in C# 4.0? No Async in C# 5.0? No F#? Ultimately, an erasure model of generics would have been adopted, as for Java, since the CLR team would never have pursued a in-the-VM generics design without external help.

So a big thanks is due to Don Syme and the rest of the team at Microsoft Research in Cambridge!

But as well as being useful, I also find some usages of generics mind-bending, for instance I’m still not sure what this code actually means or how to explain it in words:

class Blah where T : Blah

As always, reading an Eric Lippert post helps a lot, but even he recommends against using this specific ‘circular’ pattern.

Recently I spoke at the CORESTART 2.0 conference in Prague, giving a talk on ‘Microsoft and Open-Source – A ‘Brave New World’. Whilst I was there I met the very knowledgeable Jiri Cincura, who blogs at tabs ↹ over ␣ ␣ ␣ spaces. He was giving a great talk on ‘C# 7.1 and 7.2 features’, but also shared with me an excellent code snippet that he called ‘Crazy Class’:

class Class
{
    class Inner : Class
    {
        Inner.Inner.Inner.Inner.Inner.Inner.Inner.Inner.Inner inner;
    }
}

He said:

this is the class that takes crazy amount of time to compile. You can add more Inner.Inner.Inner... to make it even longer (and also generic parameters).

After a big of digging around I found that someone else had noticed this, see the StackOverflow question Why does field declaration with duplicated nested type in generic class results in huge source code increase? Helpfully the ‘accepted answer’ explains what is going on:

When you combine these two, the way you have done, something interesting happens. The type Outer.Inner is not the same type as Outer.Inner.Inner. Outer.Inner is a subclass of Outer while Outer.Inner.Inner is a subclass of Outer, which we established before as being different from Outer.Inner. So Outer.Inner.Inner and Outer.Inner are referring to different types.

When generating IL, the compiler always uses fully qualified names for types. You have cleverly found a way to refer to types with names whose lengths that grow at exponential rates. That is why as you increase the generic arity of Outer or add additional levels .Y to the field field in Inner the output IL size and compile time grow so quickly.

Clear? Good!!

You probably have to be Jon Skeet, Eric Lippert or a member of the C# Language Design Team (yay, ‘Matt Warren’) to really understand what’s going on here, but that doesn’t stop the rest of us having fun with the code!!

I can’t think of any reason why you’d actually want to write code like this, so please don’t!! (or at least if you do, don’t blame me!!)

For a simple idea of what’s actually happening, lets take this code (with only 2 ‘Levels’):

class Class
{
    class Inner : Class
    {
        Inner.Inner inner;
    }
}

The ‘decompiled’ version actually looks like this:

internal class Class
{
    private class Inner : Class
    {
        private Class.Inner inner;
    }
}

Wow, no wonder things go wrong quickly!!

Exponential Growth

Firstly let’s check the claim of exponential growth, if you don’t remember your Big O notation you can also think of this as O(very, very bad)!!

To test this out, I’m going to compile the code above, but vary the ‘level’ each time by adding a new .Inner, so ‘Level 5’ looks like this:

Inner.Inner.Inner.Inner.Inner inner;

‘Level 6’ like this, and so on

Inner.Inner.Inner.Inner.Inner.Inner inner;

We then get the following results:

Level Compile Time (secs) Working set (KB) Binary Size (Bytes) 5 1.15 54,288 135,680 6 1.22 59,500 788,992 7 2.00 70,728 4,707,840 8 6.43 121,852 28,222,464 9 33.23 405,472 169,310,208 10 202.10 2,141,272 CRASH

If we look at these results in graphical form, it’s very obvious what’s going on

(the dotted lines are a ‘best fit’ trend-line and they are exponential)

If I compile the code with dotnet build (version 2.0.0), things go really wrong at ‘Level 10’ and the compiler throws an error (full stack trace):

System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values.

Which looks similar to Internal compiler error when creating Portable PDB files #3866.

However your mileage may vary, when I ran the code in Visual Studio 2015 it threw an OutOfMemoryException instead and then promptly restarted itself!! I assume this is because VS is a 32-bit application and it runs out of memory before it can go really wrong!

Mono Compiler

As a comparison, here are the results from the Mono compiler, thanks to Egor Bogatov for putting them together.

Level Compile Time (secs) Memory Usage (Bytes) 5 0.480 134,144 6 0.502 786,944 7 0.745 4,706,304 8 2.053 28,220,928 9 10.134 169,308,672 10 57.307 1,015,835,136

At ‘Level 10’ it produced a 968.78 Mb binary!!

Profiling the Compiler

Finally, I want to look at just where the compiler is spending all it’s time. From the results above we saw that it was taking over 3 minutes to compile a simple program, with a peak memory usage of 2.14 GB, so what was it actually doing??

Well clearly there’s lots of Types involved and the Compiler seems happy for you to write this code, so I guess it needs to figure it all out. Once it’s done that, it then needs to write all this Type metadata out to a .dll or .exe, which can be 100’s of MB in size.

At a high-level the profiling summary produce by VS looks like this (click for full-size image):

However if we take a bit of a close look, we can see the ‘hot-path’ is inside the SerializeTypeReference(..) method in Compilers/Core/Portable/PEWriter/MetadataWriter.cs

Summary

I’m a bit torn about this, it is clearly an ‘abuse’ of generics!!

In some ways I think that it shouldn’t be fixed, it seems better that the compiler encourages you to not write code like this, rather than making is possible!!

So if it takes 3 mins to compile your code, allocates 2GB of memory and then crashes, take that as a warning!!

Discuss this post on Hacker News, /r/programming and /r/csharp

The post A DoS Attack against the C# Compiler first appeared on my blog Performance is a Feature!


Wed, 8 Nov 2017, 12:00 am


DotNetAnywhere: An Alternative .NET Runtime

Recently I was listening to the excellent DotNetRocks podcast and they had Steven Sanderson (of Knockout.js fame) talking about ‘WebAssembly and Blazor’.

In case you haven’t heard about it, Blazor is an attempt to bring .NET to the browser, using the magic of WebAssembly. If you want more info, Scott Hanselmen has done a nice write-up of the various .NET/WebAssembly projects.

However, as much as the mention of WebAssembly was pretty cool, what interested me even more how Blazor was using DotNetAnywhere as the underlying .NET runtime. This post will look at what DotNetAnywhere is, what you can do with it and how it compares to the full .NET framework.

DotNetAnywhere

Firstly it’s worth pointing out that DotNetAnywhere (DNA) is designed to be a fully compliant .NET runtime, which means that it can run .NET dlls/exes that have been compiled to run against the full framework. On top of that (at least in theory) it supports all the following .NET runtime features, which is a pretty impressive list!

  • Generics
  • Garbage collection and finalization
  • Weak references
  • Full exception handling - try/catch/finally
  • PInvoke
  • Interfaces
  • Delegates
  • Events
  • Nullable types
  • Single-dimensional arrays
  • Multi-threading

In addition there is some partial support for Reflection

  • Very limited read-only reflection
    • typeof(), .GetType(), Type.Name, Type.Namespace, Type.IsEnum(), .ToString() only

Finally, there are a few features that are currently unsupported:

  • Attributes
  • Most reflection
  • Multi-dimensional arrays
  • Unsafe code

There are various bugs or missing functionality that might prevent your code running under DotNetAnywhere, however several of these have been fixed since Blazor came along, so it’s worth checking against the Blazor version of DotNetAnywhere.

At this point in time the original DotNetAnywhere repo is no longer active (the last sustained activity was in Jan 2012), so it seems that any future development or bugs fixes will likely happen in the Blazor repo. If you have ever fixed something in DotNetAnywhere, consider sending a P.R there, to help the effort.

Update: In addition there are other forks with various bug fixes and enhancements:

Source Code Layout

What I find most impressive about the DotNetAnywhere runtime is that it was developed by one person and is less that 40,000 lines of code!! For a comparison the .NET framework Garbage Collector is almost 37,000 lines on it’s own (more info available in my previous post A Hitchhikers Guide to the CoreCLR Source Code).

This makes DotNetAnywhere an ideal learning resource!

Firstly, lets take a look at the Top-10 largest source files, to see where the complexity is:

Native Code - 17,710 lines in total

LOC File 3,164 JIT_Execute.c 1,778 JIT.c 1,109 PInvoke_CaseCode.h 630 Heap.c 618 MetaData.c 563 MetaDataTables.h 517 Type.c 491 MetaData_Fill.c 467 MetaData_Search.c 452 JIT_OpCodes.h

Managed Code - 28,783 lines in total

LOC File 2393 corlib/System.Globalization/CalendricalCalculations.cs 2314 corlib/System/NumberFormatter.cs 1582 System.Drawing/System.Drawing/Pens.cs 1443 System.Drawing/System.Drawing/Brushes.cs 1405 System.Core/System.Linq/Enumerable.cs 745 corlib/System/DateTime.cs 693 corlib/System.IO/Path.cs 632 corlib/System.Collections.Generic/Dictionary.cs 598 corlib/System/String.cs 467 corlib/System.Text/StringBuilder.cs

Main areas of functionality

Next, lets look at the key components in DotNetAnywhere as this gives us a really good idea about what you need to implement a .NET compatible runtime. Along the way, we will also see how they differ from the implementation found in Microsoft’s .NET Framework.

Reading .NET dlls

The first thing DotNetAnywhere has to do is read/understand/parse the .NET Metadata and Code that’s contained in a .dll/.exe. This all takes place in MetaData.c, primarily within the LoadSingleTable(..) function. By adding some debugging code, I was able to get a summary of all the different types of Metadata that are read in from a typical .NET dll, it’s quite an interesting list:

MetaData contains     1 Assemblies (MD_TABLE_ASSEMBLY)
MetaData contains     1 Assembly References (MD_TABLE_ASSEMBLYREF)
MetaData contains     0 Module References (MD_TABLE_MODULEREF)

MetaData contains    40 Type References (MD_TABLE_TYPEREF)
MetaData contains    13 Type Definitions (MD_TABLE_TYPEDEF)
MetaData contains    14 Type Specifications (MD_TABLE_TYPESPEC)
MetaData contains     5 Nested Classes (MD_TABLE_NESTEDCLASS)

MetaData contains    11 Field Definitions (MD_TABLE_FIELDDEF)
MetaData contains     0 Field RVA's (MD_TABLE_FIELDRVA)
MetaData contains     2 Propeties (MD_TABLE_PROPERTY)
MetaData contains    59 Member References (MD_TABLE_MEMBERREF)
MetaData contains     2 Constants (MD_TABLE_CONSTANT)

MetaData contains    35 Method Definitions (MD_TABLE_METHODDEF)
MetaData contains     5 Method Specifications (MD_TABLE_METHODSPEC)
MetaData contains     4 Method Semantics (MD_TABLE_PROPERTY)
MetaData contains     0 Method Implementations (MD_TABLE_METHODIMPL)
MetaData contains    22 Parameters (MD_TABLE_PARAM)

MetaData contains     2 Interface Implementations (MD_TABLE_INTERFACEIMPL)
MetaData contains     0 Implementation Maps? (MD_TABLE_IMPLMAP)

MetaData contains     2 Generic Parameters (MD_TABLE_GENERICPARAM)
MetaData contains     1 Generic Parameter Constraints (MD_TABLE_GENERICPARAMCONSTRAINT)

MetaData contains    22 Custom Attributes (MD_TABLE_CUSTOMATTRIBUTE)
MetaData contains     0 Security Info Items? (MD_TABLE_DECLSECURITY)

For more information on the Metadata see Introduction to CLR metadata, Anatomy of a .NET Assembly – PE Headers and the ECMA specification itself.

Executing .NET IL

Another large piece of functionality within DotNetAnywhere is the ‘Just-in-Time’ Compiler (JIT), i.e. the code that is responsible for executing the IL, this takes place initially in JIT_Execute.c and then JIT.c. The main ‘execution loop’ is in the JITit(..) function which contains an impressive 1,374 lines of code and over 200 case statements within a single switch!!

Taking a higher level view, the overall process that it goes through looks like this:

Where the .NET IL Op-Codes (CIL_XXX) are defined in CIL_OpCodes.h and the DotNetAnywhere JIT Op-Codes (JIT_XXX) are defined in JIT_OpCodes.h

Interesting enough, the JIT is the only place in DotNetAnywhere that uses assembly code and even then it’s only for win32. It is used to allow a ‘jump’ or a goto to labels in the C source code, so as IL instructions are executed it never actually leaves the JITit(..) function, control is just moved around without having to make a full method call.

#ifdef __GNUC__

#define GET_LABEL(var, label) var = &&label

#define GO_NEXT() goto **(void**)(pCurOp++)

#else
#ifdef WIN32

#define GET_LABEL(var, label) \
	{ __asm mov edi, label \
	__asm mov var, edi }

#define GO_NEXT() \
	{ __asm mov edi, pCurOp \
	__asm add edi, 4 \
	__asm mov pCurOp, edi \
	__asm jmp DWORD PTR [edi - 4] }

#endif

Differences with the .NET Framework

In the full .NET framework all IL code is turned into machine code by the Just-in-Time Compiler (JIT) before being executed by the CPU.

However as we’ve already seen, DotNetAnywhere ‘interprets’ the IL, instruction-by-instruction and even through it’s done in a file called JIT.c no machine code is emitted, so the naming seems strange!?

Maybe it’s just a difference of perspective, but it’s not clear to me at what point you move from ‘interpreting’ code to ‘JITting’ it, even after reading the following links I’m not sure!! (can someone enlighten me?)

Garbage Collector

All the code for the DotNetAnywhere Garbage Collector (GC) is contained in Heap.c and is a very readable 600 lines of code. To give you an overview of what it does, here is the list of functions that it exposes:

void Heap_Init();
void Heap_SetRoots(tHeapRoots *pHeapRoots, void *pRoots, U32 sizeInBytes);
void Heap_UnmarkFinalizer(HEAP_PTR heapPtr);
void Heap_GarbageCollect();
U32 Heap_NumCollections();
U32 Heap_GetTotalMemory();

HEAP_PTR Heap_Alloc(tMD_TypeDef *pTypeDef, U32 size);
HEAP_PTR Heap_AllocType(tMD_TypeDef *pTypeDef);
void Heap_MakeUndeletable(HEAP_PTR heapEntry);
void Heap_MakeDeletable(HEAP_PTR heapEntry);

tMD_TypeDef* Heap_GetType(HEAP_PTR heapEntry);

HEAP_PTR Heap_Box(tMD_TypeDef *pType, PTR pMem);
HEAP_PTR Heap_Clone(HEAP_PTR obj);

U32 Heap_SyncTryEnter(HEAP_PTR obj);
U32 Heap_SyncExit(HEAP_PTR obj);

HEAP_PTR Heap_SetWeakRefTarget(HEAP_PTR target, HEAP_PTR weakRef);
HEAP_PTR* Heap_GetWeakRefAddress(HEAP_PTR target);
void Heap_RemovedWeakRefTarget(HEAP_PTR target);

Differences with the .NET Framework

However, like the JIT/Interpreter, the GC has some fundamental differences when compared to the .NET Framework

Conservative Garbage Collection

Firstly DotNetAnywhere implements what is knows as a Conservative GC. In simple terms this means that is does not know (for sure) which areas of memory are actually references/pointers to objects and which are just a random number (that looks like a memory address). In the Microsoft .NET Framework the JIT calculates this information and stores it in the GCInfo structure so the GC can make use of it. But DotNetAnywhere doesn’t do this.

Instead, during the Mark phase the GC gets all the available ‘roots’, but it will consider all memory addresses within an object as ‘potential’ references (hence it is ‘conservative’). It then has to lookup each possible reference, to see if it really points to an ‘object reference’. It does this by keeping track of all memory/heap references in a balanced binary search tree (ordered by memory address), which looks something like this:

However, this means that all objects references have to be stored in the binary tree when they are allocated, which adds some overhead to allocation. In addition extra memory is needed, 20 bytes per heap entry. We can see this by looking at the tHeapEntry data structure (all pointers are 4 bytes, U8 = 1 byte and padding is ignored), tHeapEntry *pLink[2] is the extra data that is needed just to enable the binary tree lookup.

struct tHeapEntry_ {
    // Left/right links in the heap binary tree
    tHeapEntry *pLink[2];
    // The 'level' of this node. Leaf nodes have lowest level
    U8 level;
    // Used to mark that this node is still in use.
    // If this is set to 0xff, then this heap entry is undeletable.
    U8 marked;
    // Set to 1 if the Finalizer needs to be run.
    // Set to 2 if this has been added to the Finalizer queue
    // Set to 0 when the Finalizer has been run (or there is no Finalizer in the first place)
    // Only set on types that have a Finalizer
    U8 needToFinalize;
    
    // unused
    U8 padding;

    // The type in this heap entry
    tMD_TypeDef *pTypeDef;

    // Used for locking sync, and tracking WeakReference that point to this object
    tSync *pSync;

    // The user memory
    U8 memory[0];
};

But why does DotNetAnywhere work like this? Fortunately Chris Bacon the author of DotNetAnywhere explains

Mind you, the whole heap code really needs a rewrite to reduce per-object memory overhead, and to remove the need for the binary tree of allocations. Not really thinking of a generational GC, that would probably add to much code. This was something I vaguely intended to do, but never got around to. The current heap code was just the simplest thing to get GC working quickly. The very initial implementation did no GC at all. It was beautifully fast, but ran out of memory rather too quickly.

For more info on ‘Conservative’ and ‘Precise’ GCs see:

GC only does ‘Mark-Sweep’, it doesn’t Compact

Another area in which the GC behaviour differs is that it doesn’t do any Compaction of memory after it’s cleaned up, as Steve Sanderson found out when working on Blazor

.. During server-side execution we don’t actually need to pin anything, because there’s no interop outside .NET. During client-side execution, everything is (in effect) pinned regardless, because DNA’s GC only does mark-sweep - it doesn’t have any compaction phase.

In addition, when an object is allocated DotNetAnywhere just makes a call to malloc(), see the code that does this is in the Heap_Alloc(..) function. So there is no concept of ‘Generations’ or ‘Segments’ that you have in the .NET Framework GC, i.e. no ‘Gen 0’, ‘Gen 1’, or ‘Large Object Heap’.

Threading Model

Finally, lets take a look at the threading model, which is fundamentally different from the one found in the .NET Framework.

Differences with the .NET Framework

Whilst DotNetAnywhere will happily create new threads and execute them for you, it’s only providing the illusion of true multi-threading. In reality it only runs on one thread, but context switches between the different threads that your program creates:

You can see this in action in the code below, (from the Thread_Execute() function), note the call to JIT_Execute(..) with numInst set to 100:

for (;;) {
    U32 minSleepTime = 0xffffffff;
    I32 threadExitValue;

    status = JIT_Execute(pThread, 100);
    switch (status) {
        ....
    }
}

An interesting side-effect is that the threading code in the DotNetAnywhere corlib implementation is really simple. For instance the internal implementation of the Interlocked.CompareExchange() function looks like the following, note the lack of synchronisation that you would normally expect:

tAsyncCall* System_Threading_Interlocked_CompareExchange_Int32(
            PTR pThis_, PTR pParams, PTR pReturnValue) {
    U32 *pLoc = INTERNALCALL_PARAM(0, U32*);
    U32 value = INTERNALCALL_PARAM(4, U32);
    U32 comparand = INTERNALCALL_PARAM(8, U32);

    *(U32*)pReturnValue = *pLoc;
    if (*pLoc == comparand) {
        *pLoc = value;
    }

    return NULL;
}

Benchmarks

As a simple test, I ran some benchmarks from The Computer Language Benchmarks Game - binary-trees, using the simplest C# version

Note: DotNetAnywhere was designed to run on low-memory devices, so it was not meant to have the same performance as the full .NET Framework. Please bear that in mind when looking at the results!!

.NET Framework, 4.6.1 - 0.36 seconds

Invoked=TestApp.exe 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

Exit code      : 0
Elapsed time   : 0.36
Kernel time    : 0.06 (17.2%)
User time      : 0.16 (43.1%)
page fault #   : 6604
Working set    : 25720 KB
Paged pool     : 187 KB
Non-paged pool : 24 KB
Page file size : 31160 KB

DotNetAnywhere - 54.39 seconds

Invoked=dna TestApp.exe 15
stretch tree of depth 16         check: 131071
32768    trees of depth 4        check: 1015808
8192     trees of depth 6        check: 1040384
2048     trees of depth 8        check: 1046528
512      trees of depth 10       check: 1048064
128      trees of depth 12       check: 1048448
32       trees of depth 14       check: 1048544
long lived tree of depth 15      check: 65535

Total execution time = 54288.33 ms
Total GC time = 36857.03 ms
Exit code      : 0
Elapsed time   : 54.39
Kernel time    : 0.02 (0.0%)
User time      : 54.15 (99.6%)
page fault #   : 5699
Working set    : 15548 KB
Paged pool     : 105 KB
Non-paged pool : 8 KB
Page file size : 13144 KB

So clearly DotNetAnywhere doesn’t work as fast in this benchmark (0.36 seconds v 54 seconds). However if we look at other benchmarks from the same site, it performs a lot better. It seems that DotNetAnywhere has a significant overhead when allocating objects (a class), which is less obvious when using structs.

  Benchmark 1 (using classes) Benchmark 2 (using structs) Elapsed Time (secs) 3.1 2.0 GC Collections 96 67 Total GC time (msecs) 983.59 439.73

Finally, I really want to thank Chris Bacon, DotNetAnywhere is a great code base and gives a fantastic insight into what needs to happen for a .NET runtime to work.

Discuss this post on Hacker News and /r/programming

The post DotNetAnywhere: An Alternative .NET Runtime first appeared on my blog Performance is a Feature!


Thu, 19 Oct 2017, 12:00 am


Analysing C# code on GitHub with BigQuery

Just over a year ago Google made all the open source code on GitHub available for querying within BigQuery and as if that wasn’t enough you can run a terabyte of queries each month for free!

So in this post I am going to be looking at all the C# source code on GitHub and what we can find out from it. Handily a smaller, C# only, dataset has been made available (in BigQuery you are charged per byte read), called fh-bigquery:github_extracts.contents_net_cs and has

  • 5,885,933 unique ‘.cs’ files
  • 792,166,632 lines of code (LOC)
  • 37.17 GB of data

Which is a pretty comprehensive set of C# source code!

The rest of this post will attempt to answer the following questions:

  1. Tabs or Spaces?
  2. regions: ‘should be banned’ or ‘okay in some cases’?
  3. ‘K&R’ or ‘Allman’, where do C# devs like to put their braces?
  4. Do C# developers like writing functional code?

Then moving onto some less controversial C# topics:

  1. Which using statements are most widely used?
  2. What NuGet packages are most often included in a .NET project
  3. How many lines of code (LOC) are in a typical C# file?
  4. What is the most widely thrown Exception?
  5. ‘async/await all the things’ or not?
  6. Do C# developers like using the var keyword? (Updated)

Before we end up looking at repositories, not just individual C# files:

  1. What is the most popular repository with C# code in it?
  2. Just how many files should you have in a repository?
  3. What are the most popular C# class names?
  4. ‘Foo.cs’, ‘Program.cs’ or something else, what’s the most common file name?

If you want to try the queries for yourself (or find my mistakes), all of them are available in this gist. There’s a good chance that my regular expressions miss out some edge-cases, after all Regular Expressions: Now You Have Two Problems:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

Tabs or Spaces?

In the entire data-set there are 5,885,933 files, but here we only include ones that have more than 10 lines starting with a tab or a space

Tabs Tabs % Spaces Spaces % Total 799,055 17.15% 3,859,528 82.85% 4,658,583

Clearly, C# developers (on GitHub) prefer Spaces over Tabs, let the endless debates continue!! (I think some of this can be explained by the fact that Visual Studio uses ‘spaces’ by default)

If you want to see how C# compares to other programming languages, take a look at 400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?.

regions: ‘should be banned’ or ‘okay in some cases’?

It turns out that there are an impressive 712,498 C# files (out of 5.8 million) that contain at least one #region statement (query used), that’s just over 12%. (I’m hoping that a lot of those files have been auto-generated by a tool!)

‘K&R’ or ‘Allman’, where do C# devs like to put their braces?

C# developers overwhelmingly prefer putting an opening brace { on it’s own line (query used)

separate line same line same line (initializer)   total (with brace) total (all code) 81,306,320 (67%) 40,044,603 (33%) 3,631,947 (2.99%)   121,350,923 (15.32%) 792,166,632

(‘same line initializers’ include code like new { Name = "", .. }, new [] { 1, 2, 3.. })

Do C# developers like writing functional code?

This is slightly unscientific, but I wanted to see how widely the Lambda Operator => is used in C# code (query). Yes, I know, if you want to write functional code on .NET you really should use F#, but C# has become more ‘functional’ over the years and I wanted to see how much code was taking advantage of that.

Here’s the raw percentiles:

Percentile % of lines using lambdas 10 0.51 25 1.14 50 2.50 75 5.26 90 9.95 95 14.29 99 28.00

So we can say that:

  • 50% of all the C# code on GitHub uses => on 2.44% (or less) of their lines.
  • 10% of all C# files have lambdas on almost 1 in 10 of their lines (9.95%)
  • 5% use => on 1 in 7 lines (14.29%)
  • 1% of files have lambdas on over 1 in 3 lines (28%) of their lines of code, that’s pretty impressive!

Which using statements are most widely used?

Now on to some a bit more substantial, what are the most widely used using statements in C# code?

The top 10 looks like this (the full results are available):

using statement count using System.Collections.Generic; 1,780,646 using System; 1,477,019 using System.Linq; 1,319,830 using System.Text; 902,165 using System.Threading.Tasks; 628,195 using System.Runtime.InteropServices; 431,867 using System.IO; 407,848 using System.Runtime.CompilerServices; 338,686 using System.Collections; 289,867 using System.Reflection; 218,369

However, as was pointed out, the top 5 are included by default when you add a new file in Visual Studio and many people wouldn’t remove them. The same applies to ‘System.Runtime.InteropServices’ and ‘System.Runtime.CompilerServices’ which are include in ‘AssemblyInfo.cs` by default.

So if we adjust the list to take account of this, the top 10 looks like so:

using statement count using System.IO; 407,848 using System.Collections; 289,867 using System.Reflection; 218,369 using System.Diagnostics; 201,341 using System.Threading; 179,168 using System.ComponentModel; 160,681 using System.Web; 160,323 using System.Windows.Forms; 137,003 using System.Globalization; 132,113 using System.Drawing; 127,033

Finally, an interesting list is the top 10 using statements that aren’t System, Microsoft or Windows namespaces:

using statement count using NUnit.Framework; 119,463 using UnityEngine; 117,673 using Xunit; 99,099 using Newtonsoft.Json; 81,675 using Newtonsoft.Json.Linq; 29,416 using Moq; 23,546 using UnityEngine.UI; 20,355 using UnityEditor; 19,937 using Amazon.Runtime; 18,941 using log4net; 17,297

What NuGet packages are most often included in a .NET project?

It turns out that there is also a separate dataset containing all the ‘packages.config’ files on GitHub, it’s called contents_net_packages_config and has 104,808 entries. By querying this we can see that Json.Net is the clear winner!!

package count Newtonsoft.Json 45,055 Microsoft.Web.Infrastructure 16,022 Microsoft.AspNet.Razor 15,109 Microsoft.AspNet.WebPages 14,495 Microsoft.AspNet.Mvc 14,236 EntityFramework 14,191 Microsoft.AspNet.WebApi.Client 13,480 Microsoft.AspNet.WebApi.Core 12,210 Microsoft.Net.Http 11,625 jQuery 10,646 Microsoft.Bcl.Build 10,641 Microsoft.Bcl 10,349 NUnit 10,341 Owin 9,681 Microsoft.Owin 9,202 Microsoft.AspNet.WebApi.WebHost 9,007 WebGrease 8,743 Microsoft.AspNet.Web.Optimization 8,721 Microsoft.AspNet.WebApi 8,179

How many lines of code (LOC) are in a typical C# file?

Are C# developers prone to creating huge files that go one for 1000’s of lines? Well some are but fortunately it’s the minority of us!!

Note the Y-axis is ‘lines of code’ and is logarithmic, the raw data is available.

Oh dear, Uncle Bob isn’t going to be happy, whilst 96% of the files have 509 LOC of less, the other 4% don’t!! From Clean Code:

And in case you’re wondering, here’s the Top 10 longest C# files!!

File Lines MarMot/Input/test.marmot.cs 92663 src/CodenameGenerator/WordRepos/LastNamesRepository.cs 88810 cs_inputtest/cs_02_7000.cs 63004 cs_inputtest/cs_02_6000.cs 54004 src/ML NET20/Utility/UserName.cs 52014 MWBS/Dictionary/DefaultWordDictionary.cs 48912 Sources/Accord.Math/Matrix/Matrix.Comparisons1.Generated.cs 48407 UrduProofReader/UrduLibs/Utils.cs 48255 cs_inputtest/cs_02_5000.cs 45004 css/style.cs 44366

What is the most widely thrown Exception?

There’s a few interesting results in this query, for instance who knew that so many ApplicationExceptions were thrown and NotSupportedException being so high up the list is a bit worrying!!

Exception count throw new ArgumentNullException 699,526 throw new ArgumentException 361,616 throw new NotImplementedException 340,361 throw new InvalidOperationException 260,792 throw new ArgumentOutOfRangeException 160,640 throw new NotSupportedException 110,019 throw new HttpResponseException 74,498 throw new ValidationException 35,615 throw new ObjectDisposedException 31,129 throw new ApplicationException 30,849 throw new UnauthorizedException 21,133 throw new FormatException 19,510 throw new SerializationException 17,884 throw new IOException 15,779 throw new IndexOutOfRangeException 14,778 throw new NullReferenceException 12,372 throw new InvalidDataException 12,260 throw new ApiException 11,660 throw new InvalidCastException 10,510

‘async/await all the things’ or not?

The addition of the async and await keywords to the C# language makes writing asynchronous code much easier:

public async Task GetDotNetCountAsync()
{
    // Suspends GetDotNetCountAsync() to allow the caller (the web server)
    // to accept another request, rather than blocking on this one.
    var html = await _httpClient.DownloadStringAsync("http://dotnetfoundation.org");

    return Regex.Matches(html, ".NET").Count;
}

But how much is it used? Using the query below:

SELECT Count(*) count
FROM
  [fh-bigquery:github_extracts.contents_net_cs]
WHERE
  REGEXP_MATCH(content, r'\sasync\s|\sawait\s')

I found that there are 218,643 files (out of 5,885,933) that have at least one usage of async or await in them.

Do C# developers like using the var keyword?

Less that they use async and await, there are 130,590 files that have at least one usage of the var keyword

Update: thanks for jairbubbles for pointing out that my var regex was wrong and supplying a fixed version!

More than they use async and await, there are 1,457,154 files that have at least one usage of the var keyword

Just how many files should you have in a repository?

90% of the repositories (that have any C# files) have 95 files or less. 95% have 170 files or less and 99% have 535 files or less.

(again the Y-axis (# files) is logarithmic)

The top 10 largest repositories, by number of C# files are shown below:

Repository # Files https://github.com/xen2/mcs 23389 https://github.com/mater06/LEGOChimaOnlineReloaded 14241 https://github.com/Microsoft/referencesource 13051 https://github.com/dotnet/corefx 10652 https://github.com/apo-j/Projects_Working 10185 https://github.com/Microsoft/CodeContracts 9338 https://github.com/drazenzadravec/nequeo 8060 https://github.com/ClearCanvas/ClearCanvas 7946 https://github.com/mwilliamson-firefly/aws-sdk-net 7860 https://github.com/151706061/MacroMedicalSystem 7765

This time we are going to look at the most popular repositories (based on GitHub ‘stars’) that contain at least 50 C# files (query used):

repo stars files https://github.com/grpc/grpc 11075 237 https://github.com/dotnet/coreclr 8576 6503 https://github.com/dotnet/roslyn 8422 6351 https://github.com/facebook/yoga 8046 73 https://github.com/bazelbuild/bazel 7123 132 https://github.com/dotnet/corefx 7115 10652 https://github.com/SeleniumHQ/selenium 7024 512 https://github.com/Microsoft/WinObjC 6184 81 https://github.com/qianlifeng/Wox 5674 207 https://github.com/Wox-launcher/Wox 5674 142 https://github.com/ShareX/ShareX 5336 766 https://github.com/Microsoft/Windows-universal-samples 5130 1501 https://github.com/NancyFx/Nancy 3701 957 https://github.com/chocolatey/choco 3432 248 https://github.com/JamesNK/Newtonsoft.Json 3340 650

Interesting that the top spot is a Google Repository! (the C# files in it are sample code for using the GRPC library from .NET)

Assuming that I got the regex correct, the most popular C# class names are the following:

Class name Count class C 182480 class Program 163462 class Test 50593 class Settings 40841 class Resources 39345 class A 34687 class App 28462 class B 24246 class Startup 18238 class Foo 15198

Yay for Foo, just sneaking into the Top 10!!

‘Foo.cs’, ‘Program.cs’ or something else, what’s the most common file name?

Finally lets look at the different class names used, as with the using statement they are dominated by the default ones used in the Visual Studio templates:

File Count AssemblyInfo.cs 386822 Program.cs 105280 Resources.Designer.cs 40881 Settings.Designer.cs 35392 App.xaml.cs 21928 Global.asax.cs 16133 Startup.cs 14564 HomeController.cs 13574 RouteConfig.cs 11278 MainWindow.xaml.cs 11169

Discuss this post on Hacker News and /r/csharp

More Information

As always, if you’ve read this far your present is yet more blog posts to read, enjoy!!

How BigQuery Works (only put in at the end of the blog post)

BigQuery analysis of other Programming Languages

The post Analysing C# code on GitHub with BigQuery first appeared on my blog Performance is a Feature!


Thu, 12 Oct 2017, 12:00 am


A look at the internals of 'boxing' in the CLR

It’s a fundamental part of .NET and can often happen without you knowing, but how does it actually work? What is the .NET Runtime doing to make boxing possible?

Note: this post won’t be discussing how to detect boxing, how it can affect performance or how to remove it (speak to Ben Adams about that!). It will only be talking about how it works.

As an aside, if you like reading about CLR internals you may find these other posts interesting:

Boxing in the CLR Specification

Firstly it’s worth pointing out that boxing is mandated by the CLR specification ‘ECMA-335’, so the runtime has to provide it:

This means that there are a few key things that the CLR needs to take care of, which we will explore in the rest of this post.

Creating a ‘boxed’ Type

The first thing that the runtime needs to do is create the corresponding reference type (‘boxed type’) for any struct that it loads. You can see this in action, right at the beginning of the ‘Method Table’ creation where it first checks if it’s dealing with a ‘Value Type’, then behaves accordingly. So the ‘boxed type’ for any struct is created up front, when your .dll is imported, then it’s ready to be used by any ‘boxing’ that happens during program execution.

The comment in the linked code is pretty interesting, as it reveals some of the low-level details the runtime has to deal with:

// Check to see if the class is a valuetype; but we don't want to mark System.Enum
// as a ValueType. To accomplish this, the check takes advantage of the fact
// that System.ValueType and System.Enum are loaded one immediately after the
// other in that order, and so if the parent MethodTable is System.ValueType and
// the System.Enum MethodTable is unset, then we must be building System.Enum and
// so we don't mark it as a ValueType.

CPU-specific code-generation

But to see what happens during program execution, let’s start with a simple C# program. The code below creates a custom struct or Value Type, which is then ‘boxed’ and ‘unboxed’:

public struct MyStruct
{
    public int Value;
}

var myStruct = new MyStruct();

// boxing
var boxed = (object)myStruct;

// unboxing
var unboxed = (MyStruct)boxed;

This gets turned into the following IL code, in which you can see the box and unbox.any IL instructions:

L_0000: ldloca.s myStruct
L_0002: initobj TestNamespace.MyStruct
L_0008: ldloc.0 
L_0009: box TestNamespace.MyStruct
L_000e: stloc.1 
L_000f: ldloc.1 
L_0010: unbox.any TestNamespace.MyStruct

Runtime and JIT code

So what does the JIT do with these IL op codes? Well in the normal case it wires up and then inlines the optimised, hand-written, assembly code versions of the ‘JIT Helper Methods’ provided by the runtime. The links below take you to the relevant lines of code in the CoreCLR source:

Interesting enough, the only other ‘JIT Helper Methods’ that get this special treatment are object, string or array allocations, which goes to show just how performance sensitive boxing is.

In comparison, there is only one helper method for ‘unboxing’, called JIT_Unbox(..), which falls back to JIT_Unbox_Helper(..) in the uncommon case and is wired up here (CORINFO_HELP_UNBOX to JIT_Unbox). The JIT will also inline the helper call in the common case, to save the cost of a method call, see Compiler::impImportBlockCode(..).

Note that the ‘unbox helper’ only fetches a reference/pointer to the ‘boxed’ data, it has to then be put onto the stack. As we saw above, when the C# compiler does unboxing it uses the ‘Unbox_Any’ op-code not just the ‘Unbox’ one, see Unboxing does not create a copy of the value for more information.

Unboxing Stub Creation

As well as ‘boxing’ and ‘unboxing’ a struct, the runtime also needs to help out during the time that a type remains ‘boxed’. To see why, let’s extend MyStruct and override the ToString() method, so that it displays the current Value:

public struct MyStruct
{
    public int Value;
	
    public override string ToString()
    {
        return "Value = " + Value.ToString();
    }
}

Now, if we look at the ‘Method Table’ the runtime creates for the boxed version of MyStruct (remember, value types have no ‘Method Table’), we can see something strange going on. Note that there are 2 entries for MyStruct::ToString, one of which I’ve labelled as an ‘Unboxing Stub’

 Method table summary for 'MyStruct':
 Number of static fields: 0
 Number of instance fields: 1
 Number of static obj ref fields: 0
 Number of static boxed fields: 0
 Number of declared fields: 1
 Number of declared methods: 1
 Number of declared non-abstract methods: 1
 Vtable (with interface dupes) for 'MyStruct':
   Total duplicate slots = 0

 SD: MT::MethodIterator created for MyStruct (TestNamespace.MyStruct).
   slot  0: MyStruct::ToString  0x000007FE41170C10 (slot =  0) (Unboxing Stub)
   slot  1: System.ValueType::Equals  0x000007FEC1194078 (slot =  1) 
   slot  2: System.ValueType::GetHashCode  0x000007FEC1194080 (slot =  2) 
   slot  3: System.Object::Finalize  0x000007FEC14A30E0 (slot =  3) 
   slot  5: MyStruct::ToString  0x000007FE41170C18 (slot =  4)

Wed, 2 Aug 2017, 12:00 am


Memory Usage Inside the CLR

Have you ever wondered where and why the .NET Runtime (CLR) allocates memory? I don’t mean the ‘managed’ memory that your code allocates, e.g. via new MyClass(..) and the Garbage Collector (GC) then cleans up. I mean the memory that the CLR itself allocates, all the internal data structures that it needs to make is possible for your code to run.

Note just to clarify, this post will not be telling you how you can analyse the memory usage of your code, for that I recommend using one of the excellent .NET Profilers available such as dotMemory by JetBrains or the ANTS Memory Profiler from Redgate (I’ve personally used both and they’re great)

The high-level view

Fortunately there’s a fantastic tool that makes it very easy for us to get an overview of memory usage within the CLR itself. It’s called VMMap and it’s part of the excellent Sysinternals Suite.

For the post I will just be using a simple HelloWorld program, so that we can observe what the CLR does in the simplest possible scenario, obviously things may look a bit different in a more complex app.

Firstly, lets look at the data over time, in 1 second intervals. The HelloWorld program just prints to the Console and then waits until you press , so once the memory usage has reached it’s peak it remains there till the program exits. (Click for a larger version)

However, to get a more detailed view, we will now look at the snapshot from 2 seconds into the timeline, when the memory usage has stabilised.

Note: If you want to find out more about memory usage in general, but also specifically how measure it in .NET applications, I recommend reading this excellent series of posts by Sasha Goldshtein

Also, if like me you always get the different types of memory mixed-up, please read this Stackoverflow answer first What is private bytes, virtual bytes, working set?

‘Image’ Memory

Now we’ve seen the high-level view, lets take a close look at the individual chucks, the largest of which is labelled Image, which according to the VMMap help page (see here for all info on all memory types):

… represents an executable file such as a .exe or .dll and has been loaded into a process by the image loader. It does not include images mapped as data files, which would be included in the Mapped File memory type. Image mappings can include shareable memory like code. When data regions, like initialized data, is modified, additional private memory is created in the process.

At this point, it’s worth pointing out a few things:

  1. This memory is takes up a large amount of the total process memory because I’m using a simple HelloWorld program, in other types of programs it wouldn’t dominate the memory usage as much
  2. I was using a DEBUG version of the CoreCLR, so the CLR specific files System.Private.CoreLib.dll, coreclr.dll, clrjit.dll and CoreRun.exe may well be larger than if they were compiled in RELEASE mode
  3. Some of this memory is potentially ‘shared’ with other processes, compare the numbers in the ‘Total WS’, ‘Private WS’, ‘Shareable WS’ and ‘Shared WS’ columns to see this in action.

‘Managed Heaps’ created by the Garbage Collector

The next largest usage of memory is the GC itself, it pre-allocates several heaps that it can then give out whenever your program allocates an object, for example via code such as new MyClass() or new byte[].

The main thing to note about the image above is that you can clearly see the different heap, there is 256 MB allocated for Generations (Gen 0, 1, 2) and 128 MB for the ‘Large Object Heap’. In addition, note the difference between the amounts in the Size and the Committed columns. Only the Committed memory is actually being used, the total Size is what the GC pre-allocates or reserves up front from the address space.

If you’re interested, the rules for heap or more specifically segment sizes are helpfully explained in the Microsoft Docs, but simply put, it varies depending on the GC mode (Workstation v Server), whether the process is 32/64-bit and ‘Number of CPUs’.

Internal CLR ‘Heap’ memory

However the part that I’m going to look at for the rest of this post is the memory that is allocated by the CLR itself, that is unmanaged memory that is uses for all its internal data structures.

But if we just look at the VMMap UI view, it doesn’t really tell us that much!

However, using the excellent PerfView tool we can capture the full call-stack of any memory allocations, that is any calls to VirtualAlloc() or RtlAllocateHeap() (obviously these functions only apply when running the CoreCLR on Windows). If we do this, PerfView gives us the following data (yes, it’s not pretty, but it’s very powerful!!)

So lets explore this data in more detail.

Notable memory allocations

There are a few places where the CLR allocates significant chunks of memory up-front and then uses them through its lifetime, they are listed below:

Execution Engine Heaps

However another technique that it uses is to allocated ‘heaps’, often 64K at a time and then perform individual allocations within the heaps as needed. These heaps are split up into individual use-cases, the most common being for ‘frequently accessed’ data and it’s counter-part, data that is ‘rarely accessed’, see the explanation from this comment in loaderallocator.hpp for more. This is done to ensure that the CLR retains control over any memory allocations and can therefore prevent ‘fragmentation’.

These heaps are together known as ‘Loader Heaps’ as explained in Drill Into .NET Framework Internals to See How the CLR Creates Runtime Objects (wayback machine version):

LoaderHeaps LoaderHeaps are meant for loading various runtime CLR artifacts and optimization artifacts that live for the lifetime of the domain. These heaps grow by predictable chunks to minimize fragmentation. LoaderHeaps are different from the garbage collector (GC) Heap (or multiple heaps in case of a symmetric multiprocessor or SMP) in that the GC Heap hosts object instances while LoaderHeaps hold together the type system. Frequently accessed artifacts like MethodTables, MethodDescs, FieldDescs, and Interface Maps get allocated on a HighFrequencyHeap, while less frequently accessed data structures, such as EEClass and ClassLoader and its lookup tables, get allocated on a LowFrequencyHeap. The StubHeap hosts stubs that facilitate code access security (CAS), COM wrapper calls, and P/Invoke.

One of the main places you see this high/low-frequency of access is in the heart of the Type system, where different data items are either classified as ‘hot’ (high-frequency) or ‘cold’ (low-frequency), from the ‘Key Data Structures’ section of the BOTR page on ‘Type Loader Design’:

EEClass

MethodTable data are split into “hot” and “cold” structures to improve working set and cache utilization. MethodTable itself is meant to only store “hot” data that are needed in program steady state. EEClass stores “cold” data that are typically only needed by type loading, JITing or reflection. Each MethodTable points to one EEClass.

Further to this, listed below are some specific examples of when each heap type is used:

All the general ‘Loader Heaps’ listed above are allocated in the LoaderAllocator::Init(..) function (link to actual code), the executable and stub heap have the ‘executable’ flag set, all the rest don’t. The size of these heaps is configured in this code, they ‘reserve’ different amounts up front, but they all have a ‘commit’ size that is equivalent to one OS ‘page’.

In addition to the ‘general’ heaps, there are some others that are specifically used by the Virtual Stub Dispatch mechanism, they are known as the indcell_heap, cache_entry_heap, lookup_heap, dispatch_heap and resolve_heap, they’re allocated in this code, using the specified commit/reserve sizes.

Finally, if you’re interested in the mechanics of how the heaps actually work take a look at LoaderHeap.cpp.

JIT Memory Usage

Last, but by no means least, there is one other component in the CLR that extensively allocates memory and that is the JIT. It does so in 2 main scenarios:

  1. ‘Transient’ or temporary memory needed when it’s doing the job of converting IL code into machine code
  2. ‘Permanent’ memory used when it needs to emit the ‘machine code’ for a method

‘Transient’ Memory

This is needed by the JIT when it is doing the job of converting IL code into machine code for the current CPU architecture. This memory is only needed whilst the JIT is running and can be re-used/discarded later, it is used to hold the internal JIT data structures (e.g. Compiler, BasicBlock, GenTreeStmt, etc).

For example, take a look at the following code from Compiler::fgValueNumber():

...
 // Allocate the value number store.
assert(fgVNPassesCompleted > 0 || vnStore == nullptr);
if (fgVNPassesCompleted == 0)
{
    CompAllocator* allocator = new (this, CMK_ValueNumber) CompAllocator(this, CMK_ValueNumber);
    vnStore                  = new (this, CMK_ValueNumber) ValueNumStore(this, allocator);
}
...

The line vnStore = new (this, CMK_ValueNumber) ... ends up calling the specialised new operator defined in compiler.hpp (code shown below), which as per the comment, uses a customer ‘Arena Allocator’ that is implemented in /src/jit/alloc.cpp

/*****************************************************************************
 *  operator new
 *
 *  Note that compGetMem is an arena allocator that returns memory that is
 *  not zero-initialized and can contain data from a prior allocation lifetime.
 *  it also requires that 'sz' be aligned to a multiple of sizeof(int)
 */

inline void* __cdecl operator new(size_t sz, Compiler* context, CompMemKind cmk)
{
    sz = AlignUp(sz, sizeof(int));
    assert(sz != 0 && (sz & (sizeof(int) - 1)) == 0);
    return context->compGetMem(sz, cmk);
}

This technique (of overriding the new operator) is used in lots of places throughout the CLR, for instance there is a generic one implemented in the CLR Host.

‘Permanent’ Memory

The last type of memory that the JIT uses is ‘permanent’ memory to store the JITted machine code, this is done via calls to Compiler::compGetMem(..), starting from Compiler::compCompile(..) via the call-stack shown below. Note that as before this uses a customer ‘Arena Allocator’ that is implemented in /src/jit/alloc.cpp

+ clrjit!ClrAllocInProcessHeap
 + clrjit!ArenaAllocator::allocateHostMemory
  + clrjit!ArenaAllocator::allocateNewPage
   + clrjit!ArenaAllocator::allocateMemory
    + clrjit!Compiler::compGetMem
     + clrjit!emitter::emitGetMem
      + clrjit!emitter::emitAllocInstr
       + clrjit!emitter::emitNewInstrTiny
        + clrjit!emitter::emitIns_R_R
         + clrjit!emitter::emitInsBinary
          + clrjit!CodeGen::genCodeForStoreLclVar
           + clrjit!CodeGen::genCodeForTreeNode
            + clrjit!CodeGen::genCodeForBBlist
             + clrjit!CodeGen::genGenerateCode
              + clrjit!Compiler::compCompile

Real-world example

Finally, to prove that this investigation matches with more real-world scenarios, we can see similar memory usage breakdowns in this GitHub issue: [Question] Reduce memory consumption of CoreCLR

Yes, we have profiled several Xamarin GUI applications on Tizen Mobile.

Typical profile of CoreCLR’s memory on the GUI applications is the following:

  1. Mapped assembly images - 4.2 megabytes (50%)
  2. JIT-compiler’s memory - 1.7 megabytes (20%)
  3. Execution engine - about 1 megabyte (11%)
  4. Code heap - about 1 megabyte (11%)
  5. Type information - about 0.5 megabyte (6%)
  6. Objects heap - about 0.2 megabyte (2%)

Discuss this post on HackerNews

Further Reading

See the links below for additional information on ‘Loader Heaps’

The post Memory Usage Inside the CLR first appeared on my blog Performance is a Feature!


Mon, 10 Jul 2017, 12:00 am


How the .NET Runtime loads a Type

It is something we take for granted every time we run a .NET program, but it turns out that loading a Type or class is a fairly complex process.

So how does the .NET Runtime (CLR) actually load a Type?

If you want the tl;dr it’s done carefully, cautiously and step-by-step

Ensuring Type Safety

One of the key requirements of a ‘Managed Runtime’ is providing Type Safety, but what does it actually mean? From the MSDN page on Type Safety and Security

Type-safe code accesses only the memory locations it is authorized to access. (For this discussion, type safety specifically refers to memory type safety and should not be confused with type safety in a broader respect.) For example, type-safe code cannot read values from another object’s private fields. It accesses types only in well-defined, allowable ways.

So in effect, the CLR has to ensure your Types/Classes are well-behaved and following the rules.

Compiler prevents you from creating an ‘abstract’ class

But lets look at a more concrete example, using the C# code below

public abstract class AbstractClass
{
    public AbstractClass() {  }
}

public class NormalClass : AbstractClass 
{  
    public NormalClass() {  }
}

public static void Main(string[] args)
{
    var test = new AbstractClass();
}

The compiler quite rightly refuses to compile this and gives the following error, because abstract classes can’t be created, you can only inherit from them.

error CS0144: Cannot create an instance of the abstract class or interface 
        'ConsoleApplication.AbstractClass'

So that’s all well and good, but the CLR can’t rely on all code being created via a well-behaved compiler, or in fact via a compiler at all. So it has to check for and prevent any attempt to create an abstract class.

Writing IL code by hand

One way to circumvent the compiler is to write IL code by hand using the IL Assembler tool (ILAsm) which will do almost no checks on the validity of the IL you give it.

For instance the IL below is the equivalent of writing var test = new AbstractClass(); (if the C# compiler would let us):

.method public hidebysig static void Main(string[] args) cil managed
{
    .entrypoint
    .maxstack 1
    .locals init (
        [0] class ConsoleApplication.NormalClass class2)
	
    // System.InvalidOperationException: Instances of abstract classes cannot be created.
    newobj instance void ConsoleApplication.AbstractClass::.ctor()
		
    stloc.0 
    ldloc.0 
    callvirt instance class [mscorlib]System.Type [mscorlib]System.Object::GetType()
    callvirt instance string [mscorlib]System.Reflection.MemberInfo::get_Name()
    call void [mscorlib]Internal.Console::WriteLine(string)
    ret 
}

Fortunately the CLR has got this covered and will throw an InvalidOperationException when you execute the code. This is due to this check which is hit when the JIT compiles the newobj IL instruction.

Creating Types at run-time

One other way that you can attempt to create an abstract class is at run-time, using reflection (thanks to this blog post for giving me some tips on other ways of creating Types).

This is shown in the code below:

var abstractType = Type.GetType("ConsoleApplication.AbstractClass");
Console.WriteLine(abstractType.FullName);

// System.MissingMethodException: Cannot create an abstract class.
var abstractInstance = Activator.CreateInstance(abstractType);

The compiler is completely happy with this, it doesn’t do anything to prevent or warn you and nor should it. However when you run the code, it will throw an exception, strangely enough a MissingMethodException this time, but it does the job!

The call stack is below:

One final way (unless I’ve missed some out?) is to use GetUninitializedObject(..) in the FormatterServices class like so:

public static object CreateInstance(Type type)
{
    var constructor = type.GetConstructor(new Type[0]);
    if (constructor == null && !type.IsValueType)
    {
        throw new NotSupportedException(
            "Type '" + type.FullName + "' doesn't have a parameterless constructor");
    }

    var emptyInstance = FormatterServices.GetUninitializedObject(type);
	
    if (constructor == null)
        return null;
		
    return constructor.Invoke(emptyInstance, new object[0]) ?? emptyInstance;
}

var abstractType = Type.GetType("ConsoleApplication.AbstractClass");
Console.WriteLine(abstractType.FullName);

// System.MemberAccessException: Cannot create an abstract class.
var abstractInstance = CreateInstance(abstractType);

Again the run-time stops you from doing this, however this time it decides to throw a MemberAccessException?

This happens via the following call stack:

Further Type-Safety Checks

These checks are just one example of what the runtime has to validate when creating types, there are many more things is has to deal with. For instance you can’t:

Loading Types ‘step-by-step’

So we’ve seen that the CLR has to do multiple checks when it’s loading types, but why does it have to load them ‘step-by-step’?

Well in a nutshell, it’s because of circular references and recursion, particularly when dealing with generics types. If we take the code below from section ‘2.1 Load Levels’ in Type Loader Design (BotR):

classA : C
{ }

classB : C
{ }

classC
{ }

These are valid types and class A depends on class B and vice versa. So we can’t load A until we know that B is valid, but we can’t load B, until we’re sure that A is valid, a classic deadlock!!

How does the run-time get round this, well from the same BotR page:

The loader initially creates the structure(s) representing the type and initializes them with data that can be obtained without loading other types. When this “no-dependencies” work is done, the structure(s) can be referred from other places, usually by sticking pointers to them into another structures. After that the loader progresses in incremental steps and fills the structure(s) with more and more information until it finally arrives at a fully loaded type. In the above example, the base types of A and B will be approximated by something that does not include the other type, and substituted by the real thing later.

(there is also some more info here)

So it loads types in stages, step-by-step, ensuring each dependant type has reached the same stage before continuing. These ‘Class Load’ stages are shown in the image below and explained in detail in this very helpful source-code comment (Yay for Open-Sourcing the CoreCLR!!)

The different levels are handled in the ClassLoader::DoIncrementalLoad(..) method, which contains the switch statement that deals with them all in turn.

However this is part of a bigger process, which controls loading an entire file, also known as a Module or Assembly in .NET terminology. The entire process for that is handled in by another dispatch loop (switch statement), that works with the FileLoadLevel enum (definition). So in reality the whole process for loading an Assembly looks like this (the loading of one or more Types happens as sub-steps once the Module had reached the FILE_LOADED stage)

  1. FILE_LOAD_CREATE - DomainFile ctor()
  2. FILE_LOAD_BEGIN - Begin()
  3. FILE_LOAD_FIND_NATIVE_IMAGE - FindNativeImage()
  4. FILE_LOAD_VERIFY_NATIVE_IMAGE_DEPENDENCIES - VerifyNativeImageDependencies()
  5. FILE_LOAD_ALLOCATE - Allocate()
  6. FILE_LOAD_ADD_DEPENDENCIES - AddDependencies()
  7. FILE_LOAD_PRE_LOADLIBRARY - PreLoadLibrary()
  8. FILE_LOAD_LOADLIBRARY - LoadLibrary()
  9. FILE_LOAD_POST_LOADLIBRARY - PostLoadLibrary()
  10. FILE_LOAD_EAGER_FIXUPS - EagerFixups()
  11. FILE_LOAD_VTABLE_FIXUPS - VtableFixups()
  12. FILE_LOAD_DELIVER_EVENTS - DeliverSyncEvents()
  13. FILE_LOADED - FinishLoad()
    1. CLASS_LOAD_BEGIN
    2. CLASS_LOAD_UNRESTOREDTYPEKEY
    3. CLASS_LOAD_UNRESTORED
    4. CLASS_LOAD_APPROXPARENTS
    5. CLASS_LOAD_EXACTPARENTS
    6. CLASS_DEPENDENCIES_LOADED
    7. CLASS_LOADED
  14. FILE_LOAD_VERIFY_EXECUTION - VerifyExecution()
  15. FILE_ACTIVE - Activate()

We can see this in action if we build a Debug version of the CoreCLR and enable the relevant configuration knobs. For a simple ‘Hello World’ program we get the log output shown below, where LOADER: messages correspond to FILE_LOAD_XXX stages and PHASEDLOAD: messages indicate which CLASS_LOAD_XXX step we are on.

You can also see some of the other events that happen at the same time, these include creation of static variables (STATICS:), thread-statics (THREAD STATICS:) and PreStubWorker which indicates methods being prepared for the JITter.

-------------------------------------------------------------------------------------------------------
This is NOT the full output, it's only the parts that reference 'Program.exe' and it's modules/classses
-------------------------------------------------------------------------------------------------------

PEImage: Opened HMODULE C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe
StoreFile: Add cached entry (000007FE65174540) with PEFile 000000000040D6E0
Assembly C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe: bits=0x2
LOADER: 439e30:***Program*	>>>Load initiated, LOADED/LOADED
LOADER: 0000000000439E30:***Program*	   loading at level BEGIN
LOADER: 0000000000439E30:***Program*	   loading at level FIND_NATIVE_IMAGE
LOADER: 0000000000439E30:***Program*	   loading at level VERIFY_NATIVE_IMAGE_DEPENDENCIES
LOADER: 0000000000439E30:***Program*	   loading at level ALLOCATE
STATICS: Allocating statics for module Program
Loaded pModule: "C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe".
Module Program: bits=0x2
STATICS: Allocating 72 bytes for precomputed statics in module C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe in LoaderAllocator 000000000043AA18
StoreFile (StoreAssembly): Add cached entry (000007FE65174F28) with PEFile 000000000040D6E0Completed Load Level ALLOCATE for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 0000000000439E30:***Program*	   loading at level ADD_DEPENDENCIES
Completed Load Level ADD_DEPENDENCIES for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 0000000000439E30:***Program*	   loading at level PRE_LOADLIBRARY
LOADER: 0000000000439E30:***Program*	   loading at level LOADLIBRARY
LOADER: 0000000000439E30:***Program*	   loading at level POST_LOADLIBRARY
LOADER: 0000000000439E30:***Program*	   loading at level EAGER_FIXUPS
LOADER: 0000000000439E30:***Program*	   loading at level VTABLE FIXUPS
LOADER: 0000000000439E30:***Program*	   loading at level DELIVER_EVENTS
DRCT::IsReady - wait(0x100)=258, GetLastError() = 42424
DRCT::IsReady - wait(0x100)=258, GetLastError() = 42424
D::LA: Load Assembly Asy:0x000000000040D8C0 AD:0x0000000000439E30 which:C:\coreclr\bin\Product\Windows_NT.x64.Debug\Program.exe
Completed Load Level DELIVER_EVENTS for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 0000000000439E30:***Program*	   loading at level LOADED
Completed Load Level LOADED for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 439e30:***Program*	Load initiated, ACTIVE/ACTIVE
LOADER: 0000000000439E30:***Program*	   loading at level VERIFY_EXECUTION
LOADER: 0000000000439E30:***Program*	   loading at level ACTIVE
Completed Load Level ACTIVE for DomainFile 000000000040D8C0 in AD 1 - success = 1
LOADER: 439e30:***Program*

Thu, 15 Jun 2017, 12:00 am


Lowering in the C# Compiler (and what happens when you misuse it)

Turns out that what I’d always thought of as “Compiler magic” or “Syntactic sugar” is actually known by the technical term ‘Lowering’ and the C# compiler (a.k.a Roslyn) uses it extensively.

But what is it? Well this quote from So You Want To Write Your Own Language? gives us some idea:

Lowering One semantic technique that is obvious in hindsight (but took Andrei Alexandrescu to point out to me) is called “lowering.” It consists of, internally, rewriting more complex semantic constructs in terms of simpler ones. For example, while loops and foreach loops can be rewritten in terms of for loops. Then, the rest of the code only has to deal with for loops. This turned out to uncover a couple of latent bugs in how while loops were implemented in D, and so was a nice win. It’s also used to rewrite scope guard statements in terms of try-finally statements, etc. Every case where this can be found in the semantic processing will be win for the implementation.

– by Walter Bright (author of the D programming language)

But if you’re still not sure what it means, have a read of Eric Lippert’s post on the subject, Lowering in language design, which contains this quote:

A common technique along the way though is to have the compiler “lower” from high-level language features to low-level language features in the same language.

As an aside, if you like reading about the Roslyn compiler source you may like these other posts that I’ve written:

What does ‘Lowering’ look like?

The C# compiler has used lowering for a while, one of the oldest or most recognised examples is when this code:

using System.Collections.Generic;
public class C {
    public IEnumerable M() 
    {
        foreach (var value in new [] { 1, 2, 3, 4, 5 })
        {
            yield return value;
        }
    }
}

is turned into this

public class C
{
    [CompilerGenerated]
    private sealed class d__0 : IEnumerable, IEnumerable, IEnumerator, IDisposable, IEnumerator
    {
        private int 1__state;
        private int 2__current;
        private int l__initialThreadId;
        public C 4__this;
        private int[] s__1;
        private int s__2;
        private int 5__3;
        int IEnumerator.Current
        {
            [DebuggerHidden]
            get
            {
                return this.2__current;
            }
        }
        object IEnumerator.Current
        {
            [DebuggerHidden]
            get
            {
                return this.2__current;
            }
        }
        [DebuggerHidden]
        public d__0(int 1__state)
        {
            this.1__state = 1__state;
            this.l__initialThreadId = Environment.CurrentManagedThreadId;
        }
        [DebuggerHidden]
        void IDisposable.Dispose()
        {
        }
        bool IEnumerator.MoveNext()
        {
            int num = this.1__state;
            if (num != 0)
            {
                if (num != 1)
                {
                    return false;
                }
                this.1__state = -1;
                this.s__2++;
            }
            else
            {
                this.1__state = -1;
                this.s__1 = new int[] { 1, 2, 3, 4, 5 };
                this.s__2 = 0;
            }
            if (this.s__2 >= this.s__1.Length)
            {
                this.s__1 = null;
                return false;
            }
            this.5__3 = this.s__1[this.s__2];
            this.2__current = this.5__3;
            this.1__state = 1;
            return true;
        }
        [DebuggerHidden]
        void IEnumerator.Reset()
        {
            throw new NotSupportedException();
        }
        [DebuggerHidden]
        IEnumerator IEnumerable.GetEnumerator()
        {
            C.d__0 d__;
            if (this.1__state == -2 && this.l__initialThreadId == Environment.CurrentManagedThreadId)
            {
                this.1__state = 0;
                d__ = this;
            }
            else
            {
                d__ = new C.d__0(0);
                d__.4__this = this.4__this;
            }
            return d__;
        }
        [DebuggerHidden]
        IEnumerator IEnumerable.GetEnumerator()
        {
            return this.System.Collections.Generic.IEnumerable.GetEnumerator();
        }
    }
    [IteratorStateMachine(typeof(C.d__0))]
    public IEnumerable M()
    {
        C.d__0 expr_07 = new C.d__0(-2);
        expr_07.4__this = this;
        return expr_07;
    }
}

Yikes, I’m glad we don’t have to write that code ourselves!! There’s an entire state-machine in there, built to allow our original code to be halted/resumed each time round the loop (at the ‘yield’ statement).

The C# compiler and ‘Lowering’

But it turns out that the Roslyn compiler does a lot more ‘lowering’ than you might think. If you take a look at the code under ‘/src/Compilers/CSharp/Portable/Lowering’ (VB.NET equivalent here), you see the following folders:

Which correspond to some C# language features you might be familar with, such as ‘lambdas’, i.e. x => x.Name > 5, ‘iterators’ used by yield (above) and the async keyword.

However if we look at bit deeper, under the ‘LocalRewriter’ folder we can see lots more scenarios that we might never have considered ‘lowering’, such as:

So a big thank-you is due to all the past and present C# language developers and designers, they did all this work for us. Imagine that C# didn’t have all these high-level features, we’d be stuck writing them by hand.

It would be like writing Java :-)

What happens when you misuse it

But of course the real fun part is ‘misusing’ or outright ‘abusing’ the compiler. So I set up a little twitter competition just how much ‘lowering’ could we get the compiler to do for us (i.e the highest ratio of ‘input’ lines of code to ‘output’ lines).

It had the following rules (see this gist for more info):

  1. You can have as many lines as you want within method M()
  2. No single line can be longer than 100 chars
  3. To get your score, divide the ‘# of expanded lines’ by the ‘# of original line(s)’
    1. Based on the default output formatting of https://sharplab.io, no re-formatting allowed!!
    2. But you can format the intput however you want, i.e. make use of the full 100 chars
  4. Must compile with no warnings on https://sharplab.io (allows C# 7 features)
    1. But doesn’t have to do anything sensible when run
  5. You cannot modify the code that is already there, i.e. public class C {} and public void M()
    1. Cannot just add async to public void M(), that’s too easy!!
  6. You can add new using ... declarations, these do not count towards the line count

For instance with the following code (interactive version available on sharplab.io):

using System;
public class C {
    public void M() {
        Func test = () => "blah"?.ToString();
    }
}

This counts as 1 line of original code (only code inside method M() is counted)

This expands to 23 lines (again only lines of code inside the braces ({, }) of class C are counted.

Giving a total score of 23 (23 / 1)

....
public class C
{
    [CompilerGenerated]
    [Serializable]
    private sealed class c
    {
        public static readonly C.c 9;
        public static Func 9__0_0;
        static c()
        {
            // Note: this type is marked as 'beforefieldinit'.
            C.c.9 = new C.c();
        }
        internal string b__0_0()
        {
            return "blah";
        }
    }
    public void M()
    {
        if (C.c.9__0_0 == null)
        {
            C.c.9__0_0 = new Func(C.c.9.b__0_0);
        }
    }
}

Results

The first place entry was the following entry from Schabse Laks, which contains 9 lines-of-code inside the M() method:

using System.Linq;
using Y = System.Collections.Generic.IEnumerable;

public class C {
    public void M() {
((Y)null).Select(async x => await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await await
await await await await await await await await await await await await await await await x.x()());
    }
}

this expands to an impressive 7964 lines of code (yep you read that right!!) for a score of 885 (7964 / 9). The main trick he figured out was that adding more lines to the input increased the score, i.e is scales superlinearly. Although it you take things too far the compiler bails out with a pretty impressive error message:

error CS8078: An expression is too long or complex to compile

Here’s the Top 6 top results:

Submitter Entry Score Schabse Laks link 885 (7964 / 9) Andrey Dyatlov link 778 (778 / 1) alrz link 755 (755 / 1) Andy Gocke * link 633 (633 / 1) Jared Parsons * link 461 (461 / 1) Jonathan Chambers link 384 (384 / 1)

* = member of the Roslyn compiler team (they’re not disqualified, but maybe they should have some kind of handicap applied to ‘even out’ the playing field?)

Honourable mentions

However there were some other entries that whilst they didn’t make it into the Top 6, are still worth a mention due to the ingenuity involved:

Discuss this post on HackerNews, /r/programming or /r/csharp (whichever takes your fancy!!)

The post Lowering in the C# Compiler (and what happens when you misuse it) first appeared on my blog Performance is a Feature!


Thu, 25 May 2017, 12:00 am


Adding a new Bytecode Instruction to the CLR

Now that the CoreCLR is open-source we can do fun things, for instance find out if it’s possible to add new IL (Intermediate Language) instruction to the runtime.

TL;DR it turns out that it’s easier than you might think!! Here are the steps you need to go through:

Update: turns out that I wasn’t the only person to have this idea, see Beachhead implements new opcode on CLR JIT for another implementation by Kouji Matsui.

Step 0

But first a bit of background information. Adding a new IL instruction to the CLR is a pretty rare event, that last time is was done for real was in .NET 2.0 when support for generics was added. This is part of the reason why .NET code had good backwards-compatibility, from Backward compatibility and the .NET Framework 4.5:

The .NET Framework 4.5 and its point releases (4.5.1, 4.5.2, 4.6, 4.6.1, 4.6.2, and 4.7) are backward-compatible with apps that were built with earlier versions of the .NET Framework. In other words, apps and components built with previous versions will work without modification on the .NET Framework 4.5.

Side note: The .NET framework did break backwards compatibility when moving from 1.0 to 2.0, precisely so that support for generics could be added deep into the runtime, i.e. with support in the IL. Java took a different decision, I guess because it had been around longer, breaking backwards-comparability was a bigger issue. See the excellent blog post Comparing Java and C# Generics for more info.

Step 1

For this exercise I plan to add a new IL instruction (op-code) to the CoreCLR runtime and because I’m a raving narcissist (not really, see below) I’m going to name it after myself. So let me introduce the matt IL instruction, that you can use like so:

.method private hidebysig static int32 TestMattOpCodeMethod(int32 x, int32 y) 
        cil managed noinlining
{
    .maxstack 2
    ldarg.0
    ldarg.1
    matt  // yay, my name as an IL op-code!!!!
    ret
}

But because I’m actually a bit-British (i.e. I don’t like to ‘blow my own trumpet’), I’m going to make the matt op-code almost completely pointless, it’s going to do exactly the same thing as calling Math.Max(x, y), i.e. just return the largest of the 2 numbers.

The other reason for naming it matt is that I’d really like someone to make a version of the C# (Roslyn) compiler that allows you to write code like this:

Console.WriteLine("{0} m@ {1} = {2}", 1, 7, 1 m@ 7)); // prints '1 m@ 7 = 7'

I definitely want the m@ operator to be a thing (pronounced ‘matt’, not ‘m-at’), maybe the other ‘Matt Warren’ who works at Microsoft on the C# Language Design Team can help out!! Seriously though, if anyone reading this would like to write a similar blog post, showing how you’d add the m@ operator to the Roslyn compiler, please let me know I’d love to read it.

Update: Thanks to Marcin Juraszek (@mmjuraszek) you can now use the m@ in a C# program, see Adding Matt operator to Roslyn - Syntax, Lexer and Parser, Adding Matt operator to Roslyn - Binder and Adding Matt operator to Roslyn - Emitter for the full details.

Now we’ve defined the op-code, the first step is to ensure that the run-time and tooling can recognise it. In particular we need the IL Assembler (a.k.a ilasm) to be able to take the IL code above (TestMattOpCodeMethod(..)) and produce a .NET executable.

As the .NET runtime source code is nicely structured (+1 to the runtime devs), to make this possible we only need to makes changes in opcode.def:

--- a/src/inc/opcode.def
+++ b/src/inc/opcode.def
@@ -154,7 +154,7 @@ OPDEF(CEE_NEWOBJ,                     "newobj",           VarPop,             Pu
 OPDEF(CEE_CASTCLASS,                  "castclass",        PopRef,             PushRef,     InlineType,         IObjModel,   1,  0xFF,    0x74,    NEXT)
 OPDEF(CEE_ISINST,                     "isinst",           PopRef,             PushI,       InlineType,         IObjModel,   1,  0xFF,    0x75,    NEXT)
 OPDEF(CEE_CONV_R_UN,                  "conv.r.un",        Pop1,               PushR8,      InlineNone,         IPrimitive,  1,  0xFF,    0x76,    NEXT)
-OPDEF(CEE_UNUSED58,                   "unused",           Pop0,               Push0,       InlineNone,         IPrimitive,  1,  0xFF,    0x77,    NEXT)
+OPDEF(CEE_MATT,                       "matt",             Pop1+Pop1,          Push1,       InlineNone,         IPrimitive,  1,  0xFF,    0x77,    NEXT)
 OPDEF(CEE_UNUSED1,                    "unused",           Pop0,               Push0,       InlineNone,         IPrimitive,  1,  0xFF,    0x78,    NEXT)
 OPDEF(CEE_UNBOX,                      "unbox",            PopRef,             PushI,       InlineType,         IPrimitive,  1,  0xFF,    0x79,    NEXT)
 OPDEF(CEE_THROW,                      "throw",            PopRef,             Push0,       InlineNone,         IObjModel,   1,  0xFF,    0x7A,    THROW)

I just picked the first available unused slot and added matt in there. It’s defined as Pop1+Pop1 because it takes 2 values from the stack as input and Push1 because after is has executed, a single result is pushed back onto the stack.

Note: all the changes I made are available in one-place on GitHub if you’d rather look at them like that.

Once this change was done ilasm will successfully assembly the test code file HelloWorld.il that contains TestMattOpCodeMethod(..) as shown above:

λ ilasm /EXE /OUTPUT=HelloWorld.exe -NOLOGO HelloWorld.il

Assembling 'HelloWorld.il'  to EXE --> 'HelloWorld.exe'
Source file is ANSI

Assembled method HelloWorld::Main
Assembled method HelloWorld::TestMattOpCodeMethod

Creating PE file

Emitting classes:
Class 1:        HelloWorld

Emitting fields and methods:
Global
Class 1 Methods: 2;
Resolving local member refs: 1 -> 1 defs, 0 refs, 0 unresolved

Emitting events and properties:
Global
Class 1
Resolving local member refs: 0 -> 0 defs, 0 refs, 0 unresolved
Writing PE file
Operation completed successfully

Step 2

However at this point the matt op-code isn’t actually executed, at runtime the CoreCLR just throws an exception because it doesn’t know what to do with it. As a first (simpler) step, I just wanted to make the .NET Interpreter work, so I made the following changes to wire it up:

--- a/src/vm/interpreter.cpp
+++ b/src/vm/interpreter.cpp
@@ -2726,6 +2726,9 @@ void Interpreter::ExecuteMethod(ARG_SLOT* retVal, __out bool* pDoJmpCall, __out
         case CEE_REM_UN:
             BinaryIntOp();
             break;
+        case CEE_MATT:
+            BinaryArithOp();
+            break;
         case CEE_AND:
             BinaryIntOp();
             break;

--- a/src/vm/interpreter.hpp
+++ b/src/vm/interpreter.hpp
@@ -298,10 +298,14 @@ void Interpreter::BinaryArithOpWork(T val1, T val2)
         {
             res = val1 / val2;
         }
-        else 
+        else if (op == BA_Rem)
         {
             res = RemFunc(val1, val2);
         }
+        else if (op == BA_Matt)
+        {
+            res = MattFunc(val1, val2);
+        }
     }

and then I added the methods that would actually implement the interpreted code:

--- a/src/vm/interpreter.cpp
+++ b/src/vm/interpreter.cpp
@@ -10801,6 +10804,26 @@ double Interpreter::RemFunc(double v1, double v2)
     return fmod(v1, v2);
 }
 
+INT32 Interpreter::MattFunc(INT32 v1, INT32 v2)
+{
+	return v1 > v2 ? v1 : v2;
+}
+
+INT64 Interpreter::MattFunc(INT64 v1, INT64 v2)
+{
+	return v1 > v2 ? v1 : v2;
+}
+
+float Interpreter::MattFunc(float v1, float v2)
+{
+	return v1 > v2 ? v1 : v2;
+}
+
+double Interpreter::MattFunc(double v1, double v2)
+{
+	return v1 > v2 ? v1 : v2;
+}

So fairly straight-forward and the bonus is that at this point the matt operator is fully operational, you can actually write IL using it and it will run (interpreted only).

Step 3

However not everyone wants to re-compile the CoreCLR just to enable the Interpreter, so I want to also make it work for real via the Just-in-Time (JIT) compiler.

The full changes to make this work were spread across multiple files, but were mostly housekeeping so I won’t include them all here, check-out the full diff if you’re interested. But the significant parts are below:

--- a/src/jit/importer.cpp
+++ b/src/jit/importer.cpp
@@ -11112,6 +11112,10 @@ void Compiler::impImportBlockCode(BasicBlock* block)
                 oper = GT_UMOD;
                 goto MATH_MAYBE_CALL_NO_OVF;
 
+            case CEE_MATT:
+                oper = GT_MATT;
+                goto MATH_MAYBE_CALL_NO_OVF;
+
             MATH_MAYBE_CALL_NO_OVF:
                 ovfl = false;
             MATH_MAYBE_CALL_OVF:

--- a/src/vm/jithelpers.cpp
+++ b/src/vm/jithelpers.cpp
@@ -341,6 +341,14 @@ HCIMPL2(UINT32, JIT_UMod, UINT32 dividend, UINT32 divisor)
 HCIMPLEND
 
 /*********************************************************************/
+HCIMPL2(INT32, JIT_Matt, INT32 x, INT32 y)
+{
+    FCALL_CONTRACT;
+    return x > y ? x : y;
+}
+HCIMPLEND
+
+/*********************************************************************/
 HCIMPL2_VV(INT64, JIT_LDiv, INT64 dividend, INT64 divisor)
 {
     FCALL_CONTRACT;

In summary, these changes mean that during the JIT’s ‘Morph phase’ the IL containing the matt op code is converted from:

fgMorphTree BB01, stmt 1 (before)
       [000004] ------------             ▌  return    int   
       [000002] ------------             │  ┌──▌  lclVar    int    V01 arg1        
       [000003] ------------             └──▌  m@        int   
       [000001] ------------                └──▌  lclVar    int    V00 arg0               

into this:

fgMorphTree BB01, stmt 1 (after)
       [000004] --C--+------             ▌  return    int   
       [000003] --C--+------             └──▌  call help int    HELPER.CORINFO_HELP_MATT
       [000001] -----+------ arg0 in rcx    ├──▌  lclVar    int    V00 arg0         
       [000002] -----+------ arg1 in rdx    └──▌  lclVar    int    V01 arg1                 

Note the call to HELPER.CORINFO_HELP_MATT

When this is finally compiled into assembly code it ends up looking like so:

// Assembly listing for method HelloWorld:TestMattOpCodeMethod(int,int):int             
// Emitting BLENDED_CODE for X64 CPU with AVX                                           
// optimized code                                                                       
// rsp based frame                                                                      
// partially interruptible                                                              
// Final local variable assignments                                                     
//                                                                                      
//  V00 arg0         [V00,T00] (  3,  3   )     int  ->  rcx                            
//  V01 arg1         [V01,T01] (  3,  3   )     int  ->  rdx                            
//  V02 OutArgs      [V02    ] (  1,  1   )  lclBlk (32) [rsp+0x00]                     
//                                                                                      
// Lcl frame size = 40                                    
                                                                                       
G_M9261_IG01:                                                                          
       4883EC28             sub      rsp, 40                                           
                                                                                       
G_M9261_IG02:                                                                          
       E8976FEB5E           call     CORINFO_HELP_MATT                                 
       90                   nop                                                        
                                                                                       
G_M9261_IG03:                                                                          
       4883C428             add      rsp, 40                                           
       C3                   ret                                                        

I’m not entirely sure why there is a nop instruction in there? But it works, which is the main thing!!

Step 4

In the CLR you can also dynamically emit code at runtime using the methods that sit under the ‘System.Reflection.Emit’ namespace, so the last task is to add the OpCodes.Matt field and have it emit the correct values for the matt op-code.

--- a/src/mscorlib/src/System/Reflection/Emit/OpCodes.cs
+++ b/src/mscorlib/src/System/Reflection/Emit/OpCodes.cs
@@ -139,6 +139,7 @@ internal enum OpCodeValues
         Castclass = 0x74,
         Isinst = 0x75,
         Conv_R_Un = 0x76,
+        Matt = 0x77,
         Unbox = 0x79,
         Throw = 0x7a,
         Ldfld = 0x7b,
@@ -1450,6 +1451,16 @@ private OpCodes()
             (0

Fri, 19 May 2017, 12:00 am


Arrays and the CLR - a Very Special Relationship

A while ago I wrote about the ‘special relationship’ that exists between Strings and the CLR, well it turns out that Arrays and the CLR have an even deeper one, the type of closeness where you hold hands on your first meeting

As an aside, if you like reading about CLR internals you may find these other posts interesting:

Fundamental to the Common Language Runtime (CLR)

Arrays are such a fundamental part of the CLR that they are included in the ECMA specification, to make it clear that the runtime has to implement them:

In addition, there are several IL (Intermediate Language) instructions that specifically deal with arrays:

  • newarr
    • Create a new array with elements of type etype.
  • ldelem.ref
    • Load the element at index onto the top of the stack as an O. The type of the O is the same as the element type of the array pushed on the CIL stack.
  • stelem
    • Replace array element at index with the value on the stack (also stelem.i, stelem.i1, stelem.i2, stelem.r4 etc)
  • ldlen
    • Push the length (of type native unsigned int) of array on the stack.

This makes sense because arrays are the building blocks of so many other data types, you want them to be available, well defined and efficient in a modern high-level language like C#. Without arrays you can’t have lists, dictionaries, queues, stacks, trees, etc, they’re all built on-top of arrays which provided low-level access to contiguous pieces of memory in a type-safe way.

Memory and Type Safety

This memory and type-safety is important because without it .NET couldn’t be described as a ‘managed runtime’ and you’d be left having to deal with the types of issues you get when you are writing code in a more low-level language.

More specifically, the CLR provides the following protections when you are using arrays (from the section on Memory and Type Safety in the BOTR ‘Intro to the CLR’ page):

While a GC is necessary to ensure memory safety, it is not sufficient. The GC will not prevent the program from indexing off the end of an array or accessing a field off the end of an object (possible if you compute the field’s address using a base and offset computation). However, if we do prevent these cases, then we can indeed make it impossible for a programmer to create memory-unsafe programs.

While the common intermediate language (CIL) does have operators that can fetch and set arbitrary memory (and thus violate memory safety), it also has the following memory-safe operators and the CLR strongly encourages their use in most programming:

  1. Field-fetch operators (LDFLD, STFLD, LDFLDA) that fetch (read), set and take the address of a field by name.
  2. Array-fetch operators (LDELEM, STELEM, LDELEMA) that fetch, set and take the address of an array element by index. All arrays include a tag specifying their length. This facilitates an automatic bounds check before each access.

Also, from the section on Verifiable Code - Enforcing Memory and Type Safety in the same BOTR page

In practice, the number of run-time checks needed is actually very small. They include the following operations:

  1. Casting a pointer to a base type to be a pointer to a derived type (the opposite direction can be checked statically)
  2. Array bounds checks (just as we saw for memory safety)
  3. Assigning an element in an array of pointers to a new (pointer) value. This particular check is only required because CLR arrays have liberal casting rules (more on that later…)

However you don’t get this protection for free, there’s a cost to pay:

Note that the need to do these checks places requirements on the runtime. In particular:

  1. All memory in the GC heap must be tagged with its type (so the casting operator can be implemented). This type information must be available at runtime, and it must be rich enough to determine if casts are valid (e.g., the runtime needs to know the inheritance hierarchy). In fact, the first field in every object on the GC heap points to a runtime data structure that represents its type.
  2. All arrays must also have their size (for bounds checking).
  3. Arrays must have complete type information about their element type.

Implementation Details

It turns out that large parts of the internal implementation of arrays is best described as magic, this Stack Overflow comment from Marc Gravell sums it up nicely

Arrays are basically voodoo. Because they pre-date generics, yet must allow on-the-fly type-creation (even in .NET 1.0), they are implemented using tricks, hacks, and sleight of hand.

Yep that’s right, arrays were parametrised (i.e. generic) before generics even existed. That means you could create arrays such as int[] and string[], long before you were able to write List or List, which only became possible in .NET 2.0.

Special helper classes

All this magic or sleight of hand is made possible by 2 things:

  • The CLR breaking all the usual type-safety rules
  • A special array helper class called SZArrayHelper

But first the why, why were all these tricks needed? From .NET Arrays, IList, Generic Algorithms, and what about STL?:

When we were designing our generic collections classes, one of the things that bothered me was how to write a generic algorithm that would work on both arrays and collections. To drive generic programming, of course we must make arrays and generic collections as seamless as possible. It felt that there should be a simple solution to this problem that meant you shouldn’t have to write the same code twice, once taking an IList and again taking a T[]. The solution that dawned on me was that arrays needed to implement our generic IList. We made arrays in V1 implement the non-generic IList, which was rather simple due to the lack of strong typing with IList and our base class for all arrays (System.Array). What we needed was to do the same thing in a strongly typed way for IList.

But it was only done for the common case, i.e. ‘single dimensional’ arrays:

There were some restrictions here though – we didn’t want to support multidimensional arrays since IList only provides single dimensional accesses. Also, arrays with non-zero lower bounds are rather strange, and probably wouldn’t mesh well with IList, where most people may iterate from 0 to the return from the Count property on that IList. So, instead of making System.Array implement IList, we made T[] implement IList. Here, T[] means a single dimensional array with 0 as its lower bound (often called an SZArray internally, but I think Brad wanted to promote the term ‘vector’ publically at one point in time), and the element type is T. So Int32[] implements IList, and String[] implements IList.

Also, this comment from the array source code sheds some further light on the reasons:

//----------------------------------------------------------------------------------
// Calls to (IList)(array).Meth are actually implemented by SZArrayHelper.Meth
// This workaround exists for two reasons:
//
//    - For working set reasons, we don't want insert these methods in the array 
//      hierachy in the normal way.
//    - For platform and devtime reasons, we still want to use the C# compiler to 
//      generate the method bodies.
//
// (Though it's questionable whether any devtime was saved.)
//
// ....
//----------------------------------------------------------------------------------

So it was done for convenience and efficiently, as they didn’t want every instance of System.Array to carry around all the code for the IEnumerable and IList implementations.

This mapping takes places via a call to GetActualImplementationForArrayGenericIListOrIReadOnlyListMethod(..), which wins the prize for the best method name in the CoreCLR source!! It’s responsible for wiring up the corresponding method from the SZArrayHelper class, i.e. IList.Count -> SZArrayHelper.Count or if the method is part of the IEnumerator interface, the SZGenericArrayEnumerator is used.

But this has the potential to cause security holes, as it breaks the normal C# type system guarantees, specifically regarding the this pointer. To illustrate the problem, here’s the source code of the Count property, note the call to JitHelpers.UnsafeCast:

internal int get_Count()
{
    //! Warning: "this" is an array, not an SZArrayHelper. See comments above
    //! or you may introduce a security hole!
    T[] _this = JitHelpers.UnsafeCast(this);
    return _this.Length;
}

Yikes, it has to remap this to be able to call Length on the correct object!!

And just in case those comments aren’t enough, there is a very strongly worded comment at the top of the class that further spells out the risks!!

Generally all this magic is hidden from you, but occasionally it leaks out. For instance if you run the code below, SZArrayHelper will show up in the StackTrace and TargetSite of properties of the NotSupportedException:

try {
    int[] someInts = { 1, 2, 3, 4 };
    IList collection = someInts;
    // Throws NotSupportedException 'Collection is read-only'
    collection.Clear(); 		
} catch (NotSupportedException nsEx) {				
    Console.WriteLine("{0} - {1}", nsEx.TargetSite.DeclaringType, nsEx.TargetSite);
    Console.WriteLine(nsEx.StackTrace);
}

Removing Bounds Checks

The runtime also provides support for arrays in more conventional ways, the first of which is related to performance. Array bounds checks are all well and good when providing memory-safety, but they have a cost, so where possible the JIT removes any checks that it knows are redundant.

It does this by calculating the range of values that a for loop access and compares those to the actual length of the array. If it determines that there is never an attempt to access an item outside the permissible bounds of the array, the run-time checks are then removed.

For more information, the links below take you to the areas of the JIT source code that deal with this:

And if you are really keen, take a look at this gist that I put together to explore the scenarios where bounds checks are ‘removed’ and ‘not removed’.

Allocating an array

Another task that the runtime helps with is allocating arrays, using hand-written assembly code so the methods are as optimised as possible, see:

Run-time treats arrays differently

Finally, because arrays are so intertwined with the CLR, there are lots of places in which they are dealt with as a special-case. For instance a search for ‘IsArray()’ in the CoreCLR source returns over 60 hits, including:

So yes, it’s fair to say that arrays and the CLR have a Very Special Relationship

Further Reading

As always, here are some more links for your enjoyment!!

Array source code references

The post Arrays and the CLR - a Very Special Relationship first appeared on my blog Performance is a Feature!


Mon, 8 May 2017, 12:00 am


The CLR Thread Pool 'Thread Injection' Algorithm

If you’re near London at the end of April, I’ll be speaking at ProgSCon 2017 on Microsoft and Open-Source – A ‘Brave New World’. ProgSCon is 1-day conference, with talks covering an eclectic range of topics, you’ll learn lots!!

As part of a never-ending quest to explore the CoreCLR source code I stumbled across the intriguing titled ‘HillClimbing.cpp’ source file. This post explains what it does and why.

What is ‘Hill Climbing’

It turns out that ‘Hill Climbing’ is a general technique, from the Wikipedia page on the Hill Climbing Algorithm:

In computer science, hill climbing is a mathematical optimization technique which belongs to the family of local search. It is an iterative algorithm that starts with an arbitrary solution to a problem, then attempts to find a better solution by incrementally changing a single element of the solution. If the change produces a better solution, an incremental change is made to the new solution, repeating until no further improvements can be found.

But in the context of the CoreCLR, ‘Hill Climbing’ (HC) is used to control the rate at which threads are added to the Thread Pool, from the MSDN page on ‘Parallel Tasks’:

Thread Injection

The .NET thread pool automatically manages the number of worker threads in the pool. It adds and removes threads according to built-in heuristics. The .NET thread pool has two main mechanisms for injecting threads: a starvation-avoidance mechanism that adds worker threads if it sees no progress being made on queued items and a hill-climbing heuristic that tries to maximize throughput while using as few threads as possible. … A goal of the hill-climbing heuristic is to improve the utilization of cores when threads are blocked by I/O or other wait conditions that stall the processor …. The .NET thread pool has an opportunity to inject threads every time a work item completes or at 500 millisecond intervals, whichever is shorter. The thread pool uses this opportunity to try adding threads (or taking them away), guided by feedback from previous changes in the thread count. If adding threads seems to be helping throughput, the thread pool adds more; otherwise, it reduces the number of worker threads. This technique is called the hill-climbing heuristic.

For more specifics on what the algorithm is doing, you can read the research paper Optimizing Concurrency Levels in the .NET ThreadPool published by Microsoft, although it you want a brief outline of what it’s trying to achieve, this summary from the paper is helpful:

In addition the controller should have:

  1. short settling times so that cumulative throughput is maximized
  2. minimal oscillations since changing control settings incurs overheads that reduce throughput
  3. fast adaptation to changes in workloads and resource characteristics.

So reduce throughput, don’t add and then remove threads too fast, but still adapt quickly to changing work-loads, simple really!!

As an aside, after reading (and re-reading) the research paper I found it interesting that a considerable amount of it was dedicated to testing, as the following excerpt shows:

In fact the approach to testing was considered so important that they wrote an entire follow-up paper that discusses it, see Configuring Resource Managers Using Model Fuzzing.

Why is it needed?

Because, in short, just adding new threads doesn’t always increase throughput and ultimately having lots of threads has a cost. As this comment from Eric Eilebrecht, one of the authors of the research paper explains:

Throttling thread creation is not only about the cost of creating a thread; it’s mainly about the cost of having a large number of running threads on an ongoing basis. For example:

  • More threads means more context-switching, which adds CPU overhead. With a large number of threads, this can have a significant impact.
  • More threads means more active stacks, which impacts data locality. The more stacks a CPU is having to juggle in its various caches, the less effective those caches are.

The advantage of more threads than logical processors is, of course, that we can keep the CPU busy if some of the threads are blocked, and so get more work done. But we need to be careful not to “overreact” to blocking, and end up hurting performance by having too many threads.

Or in other words, from Concurrency - Throttling Concurrency in the CLR 4.0 ThreadPool

As opposed to what may be intuitive, concurrency control is about throttling and reducing the number of work items that can be run in parallel in order to improve the worker ThreadPool throughput (that is, controlling the degree of concurrency is preventing work from running).

So the algorithm was designed with all these criteria in mind and was then tested over a large range of scenarios, to ensure it actually worked! This is why it’s often said that you should just leave the .NET ThreadPool alone, not try and tinker with it. It’s been heavily tested to work across a multiple situations and it was designed to adapt over time, so it should have you covered! (although of course, there are times when it doesn’t work perfectly!!)

The Algorithm in Action

As the source in now available, we can actually play with the algorithm and try it out in a few scenarios to see what it does. It needs very few dependences and therefore all the relevant code is contained in the following files:

(For comparison, there’s an implementation of the same algorithm in the Mono source code)

I have a project up on my GitHub page that allows you to test the hill-climbing algorithm in a self-contained console app. If you’re interested you can see the changes/hacks I had to do to get it building, although in the end it was pretty simple! (Update Kudos to Christian Klutz who ported my self-contained app to C#, nice job!!)

The algorithm is controlled via the following HillClimbing_XXX settings:

Setting Default Value Notes HillClimbing_WavePeriod 4   HillClimbing_TargetSignalToNoiseRatio 300   HillClimbing_ErrorSmoothingFactor 1   HillClimbing_WaveMagnitudeMultiplier 100   HillClimbing_MaxWaveMagnitude 20   HillClimbing_WaveHistorySize 8   HillClimbing_Bias 15 The ‘cost’ of a thread. 0 means drive for increased throughput regardless of thread count; higher values bias more against higher thread counts HillClimbing_MaxChangePerSecond 4   HillClimbing_MaxChangePerSample 20   HillClimbing_MaxSampleErrorPercent 15   HillClimbing_SampleIntervalLow 10   HillClimbing_SampleIntervalHigh 200   HillClimbing_GainExponent 200 The exponent to apply to the gain, times 100. 100 means to use linear gain, higher values will enhance large moves and damp small ones

Because I was using the code in a self-contained console app, I just hard-coded the default values into the source, but in the CLR it appears that you can modify these values at runtime.

Working with the Hill Climbing code

There are several things I discovered when implementing a simple test app that works with the algorithm:

  1. The calculation is triggered by calling the function HillClimbingInstance.Update(currentThreadCount, sampleDuration, numCompletions, &threadAdjustmentInterval) and the return value is the new ‘maximum thread count’ that the algorithm is proposing.
  2. It calculates the desired number of threads based on the ‘current throughput’, which is the ‘# of tasks completed’ (numCompletions) during the current time-period (sampleDuration in seconds).
  3. It also takes the current thread count (currentThreadCount) into consideration.
  4. The core calculations (excluding error handling and house-keeping) are only just over 100 LOC, so it’s not too hard to follow.
  5. It works on the basis of ‘transitions’ (HillClimbingStateTransition), first Warmup, then Stabilizing and will only recommend a new value once it’s moved into the ClimbingMove state.
  6. The real .NET Thread Pool only increases the thread-count by one thread every 500 milliseconds. It keeps doing this until the ‘# of threads’ has reached the amount that the hill-climbing algorithm suggests. See ThreadpoolMgr::ShouldAdjustMaxWorkersActive() and ThreadpoolMgr::AdjustMaxWorkersActive() for the code that handles this.
  7. If it hasn’t got enough samples to do a ‘statistically significant’ calculation this algorithm will indicate this via the threadAdjustmentInterval variable. This means that you should not call HillClimbingInstance.Update(..) until another threadAdjustmentInterval milliseconds have elapsed. (link to source code that calculates this)
  8. The current thread count is only decreased when threads complete their current task. At that point the current count is compared to the desired amount and if necessary a thread is ‘retired’
  9. The algorithm with only returns values that respect the limits specified by ThreadPool.SetMinThreads(..) and ThreadPool.SetMaxThreads(..) (link to the code that handles this)
  10. In addition, it will only recommend increasing the thread count if the CPU Utilization is below 95%

First lets look at the graphs that were published in the research paper from Microsoft (Optimizing Concurrency Levels in the .NET ThreadPool):

They clearly show the thread-pool adapting the number of threads (up and down) as the throughput changes, so it appears the algorithm is doing what it promises.

Now for a similar image using the self-contained test app I wrote. Now, my test app only pretends to add/remove threads based on the results for the Hill Climbing algorithm, so it’s only an approximation of the real behaviour, but it does provide a nice way to see it in action outside of the CLR.

In this simple scenario, the work-load that we are asking the thread-pool to do is just moving up and then down (click for full-size image):

Finally, we’ll look at what the algorithm does in a more noisy scenario, here the current ‘work load’ randomly jumps around, rather than smoothly changing:

So with a combination of a very detailed MSDN article, a easy-to-read research paper and most significantly having the source code available, we are able to get an understanding of what the .NET Thread Pool is doing ‘under-the-hood’!

References

  1. Concurrency - Throttling Concurrency in the CLR 4.0 ThreadPool (I recommend reading this article before reading the research papers)
  2. Optimizing Concurrency Levels in the .NET ThreadPool: A case study of controller design and implementation
  3. Configuring Resource Managers Using Model Fuzzing: A Case Study of the .NET Thread Pool
  4. MSDN page on ‘Parallel Tasks’ (see section on ‘Thread Injection’)
  5. Patent US20100083272 - Managing pools of dynamic resources

Further Reading

  1. Erika Parsons and Eric Eilebrecht - CLR 4 - Inside the Thread Pool - Channel 9
  2. New and Improved CLR 4 Thread Pool Engine (Work-stealing and Local Queues)
  3. .NET CLR Thread Pool Internals (compares the new Hill Climbing algorithm, to the previous algorithm used in the Legacy Thread Pool)
  4. CLR thread pool injection, stuttering problems
  5. Why the CLR 2.0 SP1’s threadpool default max thread count was increased to 250/CPU
  6. Use a more dependable policy for thread pool thread injection (CoreCLR GitHub Issue)
  7. Use a more dependable policy for thread pool thread injection (CoreFX GitHub Issue)
  8. ThreadPool Growth: Some Important Details
  9. .NET’s ThreadPool Class - Behind The Scenes (Based on SSCLI source, not CoreCLR)
  10. CLR Execution Context (in Russian, but Google Translate does a reasonable job)
  11. Thread Pool + Task Testing (by Ben Adams)
  12. The Injector: A new Executor for Java (an improved thread-injector for the Java Thread Pool)

Discuss this post on Hacker News and /r/programming

The post The CLR Thread Pool 'Thread Injection' Algorithm first appeared on my blog Performance is a Feature!


Thu, 13 Apr 2017, 12:00 am


The .NET IL Interpreter

Whilst writing a previous blog post I stumbled across the .NET Interpreter, tucked away in the source code. Although, it I’d made even the smallest amount of effort to look for it, I’d have easily found it via the GitHub ‘magic’ file search:

Usage Scenarios

Before we look at how to use it and what it does, it’s worth pointing out that the Interpreter is not really meant for production code. As far as I can tell, its main purpose is to allow you to get the CLR up and running on a new CPU architecture. Without the interpreter you wouldn’t be able to test any C# code until you had a fully functioning JIT that could emit machine code for you. For instance see ‘[ARM32/Linux] Initial bring up of FEATURE_INTERPRETER’ and ‘[aarch64] Enable the interpreter on linux as well.

Also it doesn’t have a few key features, most notable debugging support, that is you can’t debug through C# code that has been interpreted, although you can of course debug the interpreter itself. From ‘Tiered Compilation step 1’:

…. - the interpreter is not in good enough shape to run production code as-is. There are also some significant issues if you want debugging and profiling tools to work (which we do).

You can see an example of this in ‘Interpreter: volatile ldobj appears to have incorrect semantics?’ (thanks to alexrp for telling me about this issue). There is also a fair amount of TODO comments in the code, although I haven’t verified what (if any) specific C# code breaks due to the missing functionality.

However, I think another really useful scenario for the Interpreter is to help you learn about the inner workings of the CLR. It’s only 8,000 lines long, but it’s all in one file and most significantly it’s written in C++. The code that the CLR/JIT uses when compiling for real is in multiple several files (the JIT on it’s own is over 200,000 L.O.C, spread across 100’s of files) and there are large amounts hand-written written in raw assembly.

In theory the Interpreter should work in the same way as the full runtime, albeit not as optimised. This means that it much simpler and those of us who aren’t CLR and/or assembly experts can have a chance of working out what’s going on!

Enabling the Interpreter

The Interpreter is disabled by default, so you have to build the CoreCLR from source to make it work (it used to be the fallback for ARM64 but that’s no longer the case), here’s the diff of the changes you need to make:

--- a/src/inc/switches.h
+++ b/src/inc/switches.h
@@ -233,5 +233,8 @@
 #define FEATURE_STACK_SAMPLING
 #endif // defined (ALLOW_SXS_JIT)

+// Let's test the .NET Interpreter!!
+#define FEATURE_INTERPRETER
+
 #endif // !defined(CROSSGEN_COMPILE)

You also need to enable some environment variables, the ones that I used are in the table below. For the full list, take a look at Host Configuration Knobs and search for ‘Interpreter’.

Name Description Interpret Selectively uses the interpreter to execute the specified methods InterpreterDoLoopMethods If set, don’t check for loops, start by interpreting all methods InterpreterPrintPostMortem Prints summary information about the execution to the console DumpInterpreterStubs Prints all interpreter stubs that are created to the console TraceInterpreterEntries Logs entries to interpreted methods to the console TraceInterpreterIL Logs individual instructions of interpreted methods to the console TraceInterpreterVerbose Logs interpreter progress with detailed messages to the console TraceInterpreterJITTransition Logs when the interpreter determines a method should be JITted

To test out the Interpreter, I will be using the code below:

public static void Main(string[] args)
{
    var max = 1000 * 1000;
    if (args.Length > 0)
        int.TryParse(args[0], out max);
    var timer = Stopwatch.StartNew();
    for (int i = 1; i

Thu, 30 Mar 2017, 12:00 am


A Hitchhikers Guide to the CoreCLR Source Code

photo by Alan O’Rourke

Just over 2 years ago Microsoft open-sourced the entire .NET framework, this posts attempts to provide a ‘Hitchhikers Guide’ to the source-code found in the CoreCLR GitHub repository.

To make it easier for you to get to the information you’re interested in, this post is split into several parts

It’s worth pointing out that .NET Developers have provided 2 excellent glossaries, the CoreCLR one and the CoreFX one, so if you come across any unfamiliar terms or abbreviations, check these first. Also there is extensive documentation available and if you are interested in the low-level details I really recommend checking out the ‘Book of the Runtime’ (BotR).

Overall Stats

If you take a look at the repository on GitHub, it shows the following stats for the entire repo

But most of the C# code is test code, so if we just look under /src (i.e. ignore any code under /tests) there are the following mix of Source file types, i.e. no ‘.txt’, ‘.dat’, etc:

  - 2,012 .cpp
  - 1,183 .h
  - 956 .cs
  - 113 .inl
  - 98 .hpp
  - 51 .S
  - 43 .py
  - 42 .asm
  - 24 .idl
  - 20 .c

So by far the majority of the code is written in C++, but there is still also a fair amount of C# code (all under ‘mscorlib’). Clearly there are low-level parts of the CLR that have to be written in C++ or Assembly code because they need to be ‘close to the metal’ or have high performance, but it’s interesting that there are large parts of the runtime written in managed code itself.

Note: All stats/lists in the post were calculated using commit 51a6b5c from the 9th March 2017.

Compared to ‘Rotor’

As a comparison here’s what the stats for ‘Rotor’ the Shared Source CLI looked like back in October 2002. Rotor was ‘Shared Source’, not truly ‘Open Source’, so it didn’t have the same community involvements as the CoreCLR.

Note: SSCLI aka ‘Rotor’ includes the fx or base class libraries (BCL), but the CoreCLR doesn’t as they are now hosted separately in the CoreFX GitHub repository

For reference, the equivalent stats for the CoreCLR source in March 2017 look like this:

  • Packaged as 61.2 MB .zip archive
    • Over 10.8 million lines of code (2.6 million of source code, under \src)
    • 24,485 Files (7,466 source)
      • 6,626 C# (956 source)
      • 2,074 C and C++
      • 3,701 IL
      • 93 Assembler
      • 43 Python
      • 6 Perl
  • Over 8.2 million lines of test code
  • Build output expands to over 1.2 G with tests
    • Product binaries 342 MB
    • Test binaries 909 MB

Top 10 lists

These lists are mostly just for fun, but they do give some insights into the code-base and how it’s structured.

Top 10 Largest Files

You might have heard about the mammoth source file that is gc.cpp, which is so large that GitHub refuses to display it.

But it turns out it’s not the only large file in the source, there are also several files in the JIT that are around 20K LOC. However it seems that all the large files are C++ source code, so if you’re only interested in C# code, you don’t have to worry!!

File # Lines of Code Type Location gc.cpp 37,037 .cpp \src\gc\ flowgraph.cpp 24,875 .cpp \src\jit\ codegenlegacy.cpp 21,727 .cpp \src\jit\ importer.cpp 18,680 .cpp \src\jit\ morph.cpp 18,381 .cpp \src\jit\ isolationpriv.h 18,263 .h \src\inc\ cordebug.h 18,111 .h \src\pal\prebuilt\inc\ gentree.cpp 17,177 .cpp \src\jit\ debugger.cpp 16,975 .cpp \src\debug\ee\

Top 10 Longest Methods

The large methods aren’t actually that hard to find, because they’re all have #pragma warning(disable:21000) before them, to keep the compiler happy! There are ~40 large methods in total, here’s the ‘Top 10’

Method # Lines of Code MarshalInfo::MarshalInfo(Module* pModule, 1,507 void gc_heap::plan_phase (int condemned_gen_number) 1,505 void CordbProcess::DispatchRCEvent() 1,351 void DbgTransportSession::TransportWorker() 1,238 LPCSTR Exception::GetHRSymbolicName(HRESULT hr) 1,216 BOOL Disassemble(IMDInternalImport *pImport, BYTE *ILHeader,… 1,081 bool Debugger::HandleIPCEvent(DebuggerIPCEvent * pEvent) 1,050 void LazyMachState::unwindLazyState(LazyMachState* baseState… 901 VOID ParseNativeType(Module* pModule, 886 VOID StubLinkerCPU::EmitArrayOpStub(const ArrayOpScript* pAr… 839

Top 10 files with the Most Commits

Finally, lets look at which files have been changed the most since the initial commit on GitHub back in January 2015 (ignore ‘merge’ commits)

File # Commits src\jit\morph.cpp 237 src\jit\compiler.h 231 src\jit\importer.cpp 196 src\jit\codegenxarch.cpp 190 src\jit\flowgraph.cpp 171 src\jit\compiler.cpp 161 src\jit\gentree.cpp 157 src\jit\lower.cpp 147 src\jit\gentree.h 137 src\pal\inc\pal.h 136

High-level Overview

Next we’ll take a look at how the source code is structured and what are the main components.

They say “A picture is worth a thousand words”, so below is a treemap with the source code files grouped by colour into the top-level sections they fall under. You can hover over an individual box to get more detailed information and can click on the different radio buttons to toggle the sizing (LOC/Files/Commits)

Total L.O.C # Files # Commits

Notes and Observations

  • The ‘# Commits’ only represent the commits made on GitHub, in the 2 1/2 years since the CoreCLR was open-sourced. So they are skewed to the recent work and don’t represent changes made over the entire history of the CLR. However it’s interesting to see which components have had more ‘churn’ in the last few years (i.e ‘jit’) and which have been left alone (e.g. ‘pal’)
  • From the number of LOC/files it’s clear to see what the significant components are within the CoreCLR source, e.g ‘vm’, ‘jit’, ‘pal’ & ‘mscorlib’ (these are covered in detail in the next part of this post)
  • In the ‘VM’ section it’s interesting to see how much code is generic ~650K LOC and how much is per-CPU architecture 25K LOC for ‘i386’, 16K for ‘amd64’, 14K for ‘arm’ and 7K for ‘arm64’. This suggests that the code is nicely organised so that the per-architecture work is minimised and cleanly separated out.
  • It’s surprising (to me) that the ‘GC’ section is as small as it is, I always thought of the GC is a very complex component, but there is way more code in the ‘debugger’ and the ‘pal’.
  • Likewise, I never really appreciated the complexity if the ‘JIT’, it’s the 2nd largest component, comprising over 370K LOC.

If you’re interested, this raw numbers for the code under ‘/src’ are available in this gist and for the code under ‘/tests/src’ in this gist.

Deep Dive into Individual Areas

As the source code is well organised, the top-level folders (under /src) correspond to the logical components within the CoreCLR. We’ll start off by looking at the most significant components, i.e. the ‘Debugger’, ‘Garbage Collector’ (GC), ‘Just-in-Time compiler’ (JIT), ‘mscorlib’ (all the C# code), ‘Platform Adaptation Layer’ (PAL) and the CLR ‘Virtual Machine’ (VM).

mscorlib

The ‘mscorlib’ folder contains all the C# code within the CoreCLR, so it’s the place that most C# developers would start looking if they wanted to contribute. For this reason it deserves it’s own treemap, so we can see how it’s structured:

Total L.O.C # Files # Commits

So by-far the bulk of the code is at the ‘top-level’, i.e. directly in the ‘System’ namespace, this contains the fundamental types that have to exist for the CLR to run, such as:

  • AppDomain, WeakReference, Type,
  • Array, Delegate, Object, String
  • Boolean, Byte, Char, Int16, Int32, etc
  • Tuple, Span, ArraySegment, Attribute, DateTime

Where possible the CoreCLR is written in C#, because of the benefits that ‘managed code’ brings, so there is a significant amount of code within the ‘mscorlib’ section. Note that anything under here is not externally exposed, when you write C# code that runs against the CoreCLR, you actually access everything through the CoreFX, which then type-forwards to the CoreCLR where appropriate.

I don’t know the rules for what lives in CoreCLR v CoreFX, but based on what I’ve read on various GitHub issues, it seems that over time, more and more code is moving from CoreCLR -> CoreFX.

However the managed C# code is often deeply entwined with unmanaged C++, for instance several types are implemented across multiple files, e.g.

From what I understand this is done for performance reasons, any code that is perf sensitive will end up being implemented in C++ (or even Assembly), unless the JIT can suitable optimise the C# code.

Code shared with CoreRT

Recently there has been a significant amount of work done to moved more and more code over into the ‘shared partition’. This is the area of the CoreCLR source code that is shared with CoreRT (‘the .NET Core runtime optimized for AOT compilation’). Because certain classes are implemented in both runtimes, they’ve ensured that the work isn’t duplicated and any fixes are shared in both locations. You can see how this works by looking at the links below:

Other parts of mscorlib

All the other sections of mscorlib line up with namespaces available in the .NET runtime and contain functionality that most C# devs will have used at one time or another. The largest ones in there are shown below (click to go directly to the source code):

vm (Virtual Machine)

The VM, not surprisingly, is the largest component of the CoreCLR, with over 640K L.O.C spread across 576 files, and it contains the guts of the runtime. The bulk of the code is OS and CPU independent and written in C++, however there is also a significant amount of architecture-specific assembly code, see the section ‘CPU Architecture-specific code’ for more info.

The VM contains the main start-up routine of the entire runtime EEStartupHelper() in ceemain.cpp, see ‘The 68 things the CLR does before executing a single line of your code’ for all the details. In addition it provides the following functionality:

CPU Architecture-specific code

All the architecture-specific code is kept separately in several sub-folders, amd64, arm, arm64 and i386. For example here’s the various implementations of the WriteBarrier function used by the GC:

jit (Just-in-Time compiler)

Before we look at the actual source code, it’s worth looking at the different ‘flavours’ or the JIT that are available:

Fortunately one of the Microsoft developers has clarified which one should be used

Here’s my guidance on how non-MS contributors should think about contributing to the JIT: If you want to help advance the state of the production code-generators for .NET, then contribute to the new RyuJIT x86/ARM32 backend. This is our long term direction. If instead your interest is around getting the .NET Core runtime working on x86 or ARM32 platforms to do other things, by all means use and contribute bug fixes if necessary to the LEGACY_BACKEND paths in the RyuJIT code base today to unblock yourself. We do run testing on these paths today in our internal testing infrastructure and will do our best to avoid regressing it until we can replace it with something better. We just want to make sure that there will be no surprises or hard feelings for when the time comes to remove them from the code-base.

JIT Phases

The JIT has almost 90 source files, but fortunately they correspond to the different phases it goes through, so it’s not too hard to find your way around. Using the table from ‘Phases of RyuyJIT’, I added the right-hand column so you can jump to the relevant source file(s):

Phase IR Transformations File Pre-import Compiler->lvaTable created and filled in for each user argument and variable. BasicBlock list initialized. compiler.hpp Importation GenTree nodes created and linked in to Statements, and Statements into BasicBlocks. Inlining candidates identified. importer.cpp Inlining The IR for inlined methods is incorporated into the flowgraph. inline.cpp and inlinepolicy.cpp Struct Promotion New lvlVars are created for each field of a promoted struct. morph.cpp Mark Address-Exposed Locals lvlVars with references occurring in an address-taken context are marked. This must be kept up-to-date. compiler.hpp Morph Blocks Performs localized transformations, including mandatory normalization as well as simple optimizations. morph.cpp Eliminate Qmarks All GT_QMARK nodes are eliminated, other than simple ones that do not require control flow. compiler.cpp Flowgraph Analysis BasicBlock predecessors are computed, and must be kept valid. Loops are identified, and normalized, cloned and/or unrolled. flowgraph.cpp Normalize IR for Optimization lvlVar references counts are set, and must be kept valid. Evaluation order of GenTree nodes (gtNext/gtPrev) is determined, and must be kept valid. compiler.cpp and lclvars.cpp SSA and Value Numbering Optimizations Computes liveness (bbLiveIn and bbLiveOut on BasicBlocks), and dominators. Builds SSA for tracked lvlVars. Computes value numbers. liveness.cpp Loop Invariant Code Hoisting Hoists expressions out of loops. optimizer.cpp Copy Propagation Copy propagation based on value numbers. copyprop.cpp Common Subexpression Elimination (CSE) Elimination of redundant subexressions based on value numbers. optcse.cpp Assertion Propagation Utilizes value numbers to propagate and transform based on properties such as non-nullness. assertionprop.cpp Range analysis Eliminate array index range checks based on value numbers and assertions rangecheck.cpp Rationalization Flowgraph order changes from FGOrderTree to FGOrderLinear. All GT_COMMA, GT_ASG and GT_ADDR nodes are transformed. rationalize.cpp Lowering Register requirements are fully specified (gtLsraInfo). All control flow is explicit. lower.cpp, lowerarm.cpp, lowerarm64.cpp and lowerxarch.cpp Register allocation Registers are assigned (gtRegNum and/or gtRsvdRegs),and the number of spill temps calculated. regalloc.cpp and register_arg_convention.cp Code Generation Determines frame layout. Generates code for each BasicBlock. Generates prolog & epilog code for the method. Emit EH, GC and Debug info. codegenarm.cpp, codegenarm64.cpp, codegencommon.cpp, codegenlegacy.cpp, codegenlinear.cpp and codegenxarch.cpp

pal (Platform Adaptation Layer)

The PAL provides an OS independent layer to give access to common low-level functionality such as:

As .NET was originally written to run on Windows, all the APIs look very similar to the Win32 APIs. However for non-Windows platforms they are actually implemented using the functionality available on that OS. For example this is what PAL code to read/write a file looks like:

int main(int argc, char *argv[])
{
  WCHAR  src[4] = {'f', 'o', 'o', '\0'};
  WCHAR dest[4] = {'b', 'a', 'r', '\0'};
  WCHAR  dir[5] = {'/', 't', 'm', 'p', '\0'};
  HANDLE h;
  unsigned int b;

  PAL_Initialize(argc, (const char**)argv);
  SetCurrentDirectoryW(dir);
  SetCurrentDirectoryW(dir);
  h =  CreateFileW(src, GENERIC_WRITE, FILE_SHARE_READ, NULL, CREATE_NEW, 0, NULL);
  WriteFile(h, "Testing\n", 8, &b, FALSE);
  CloseHandle(h);
  CopyFileW(src, dest, FALSE);
  DeleteFileW(src);
  PAL_Terminate();
  return 0;
}

The PAL does contain some per-CPU assembly code, but it’s only for very low-level functionality, for instance here’s the different implementations of the DebugBreak function:

gc (Garbage Collector)

The GC is clearly a very complex piece of code, lying right at the heart of the CLR, so for more information about what it does I recommend reading the BotR entry on ‘Garbage Collection Design’ and if you’re interested I’ve also written several blog posts looking at its functionality.

However from a source code point-of-view the GC is pretty simple, it’s spread across just 19 .cpp files, but the bulk of the work is in gc.cpp (raw version) all ~37K L.O.C of it!!

If you want to get deeper into the GC code (warning, it’s pretty dense), a good way to start is to search for the occurrences of various ETW events that are fired as the GC moves through the phases outlined in the BotR post above, these events are listed below:

  • FireEtwGCTriggered(..)
  • FireEtwGCAllocationTick_V1(..)
  • FireEtwGCFullNotify_V1(..)
  • FireEtwGCJoin_V2(..)
  • FireEtwGCMarkWithType(..)
  • FireEtwGCPerHeapHistory_V3(..)
  • FireEtwGCGlobalHeapHistory_V2(..)
  • FireEtwGCCreateSegment_V1(..)
  • FireEtwGCFreeSegment_V1(..)
  • FireEtwBGCAllocWaitBegin(..)
  • FireEtwBGCAllocWaitEnd(..)
  • FireEtwBGCDrainMark(..)
  • FireEtwBGCRevisit(..)
  • FireEtwBGCOverflow(..)
  • FireEtwPinPlugAtGCTime(..)
  • FireEtwGCCreateConcurrentThread_V1(..)
  • FireEtwGCTerminateConcurrentThread_V1(..)

But the GC doesn’t work in isolation, it also requires help from the Execute Engine (EE), this is done via the GCToEEInterface which is implemented in gcenv.ee.cpp.

Local GC and GC Sample

Finally, there are 2 others ways you can get into the GC code and understand what it does.

Firstly there is a GC sample the lets you use the full GC independent of the rest of the runtime. It shows you how to ‘create type layout information in format that the GC expects’, ‘implement fast object allocator and write barrier’ and ‘allocate objects and work with GC handles’, all in under 250 LOC!!

Also worth mentioning is the ‘Local GC’ project, which is an ongoing effort to decouple the GC from the rest of the runtime, they even have a dashboard so you can track its progress. Currently the GC code is too intertwined with the runtime and vica-versa, so ‘Local GC’ is aiming to break that link by providing a set of clear interfaces, GCToOSInterface and GCToEEInterface. This will help with the CoreCLR cross-platform efforts, making the GC easier to port to new OSes.

debug

The CLR is a ‘managed runtime’ and one of the significant components it provides is a advanced debugging experience, via Visual Studio or WinDBG. This debugging experience is very complex and I’m not going to go into it in detail here, however if you want to learn more I recommend you read ‘Data Access Component (DAC) Notes’.

But what does the source look like, how is it laid out? Well the a several main sub-components under the top-level /debug folder:

  • dacaccess - the provides the ‘Data Access Component’ (DAC) functionality as outlined in the BotR page linked to above. The DAC is an abstraction layer over the internal structures in the runtime, which the debugger uses to inspect objects/classes
  • di - this contains the exposed APIs (or entry points) of the debugger, implemented by CoreCLRCreateCordbObject(..) in cordb.cpp
  • ee - the section of debugger that works with the Execution Engine (EE) to do things like stack-walking
  • inc - all the interfaces (.h) files that the debugger components implement

All the rest

As well as the main components, there are various other top-level folders in the source, the full list is below:

  • binder
    • The ‘binder’ is responsible for loading assemblies within a .NET program (except the mscorlib binder which is elsewhere). The ‘binder’ comprises low-level code that controls Assemblies, Application Contexts and the all-important Fusion Log for diagnosing why assemblies aren’t loading!
  • classlibnative
    • Code for native implementations of many of the core data types in the CoreCLR, e.g. Arrays, System.Object, String, decimal, float and double.
    • Also includes all the native methods exposed in the ‘System.Environment’ namespace, e.g. Environment.ProcessorCount, Environment.TickCount, Environment.GetCommandLineArgs(), Environment.FailFast(), etc
  • coreclr
  • corefx
  • dlls
  • gcdump and gcinfo
    • Code that will write-out the GCInfo that is produced by the JIT to help the GC do it’s job. This GCInfo includes information about the ‘liveness’ of variables within a section of code and whether the method is fully or partially interruptible, which enables the EE to suspend methods when the GC is working.
  • ilasm
    • IL (Intermediate Language) Assembler is a tool for converting IL code into a .NET executable, see the MSDN page for more info and usage examples.
  • ildasm
    • Tool for disassembling a .NET executable into the corresponding IL source code, again, see the MSDN page for info and usage examples.
  • inc
    • Header files that define the ‘interfaces’ between the sub-components that make up the CoreCLR. For example corjit.h covers all communication between the Execution Engine (EE) and the JIT, that is ‘EE -> JIT’ and corinfo.h is the interface going the other way, i.e. ‘JIT -> EE’
  • ipcman
    • Code that enables the ‘Inter-Process Communication’ (IPC) used in .NET (mostly legacy and probably not cross-platform)
  • md
    • The MetaData (md) code provides the ability to gather information about methods, classes, types and assemblies and is what makes Reflection possible.
  • nativeresources
    • A simple tool that is responsible for converting/extracting resources from a Windows Resource File.
  • palrt
    • The PAL (Platform Adaptation Layer) Run-Time, contains specific parts of the PAL layer.
  • scripts
    • Several Python scripts for auto-generating various files in the source (e.g. ETW events).
  • strongname
  • ToolBox
    • Contains 2 stand-alone tools
      • SOS (son-of-strike) the CLR debugging extension that enables reporting of .NET specific information when using WinDBG
      • SuperPMI which enables testing of the JIT without requiring the full Execution Engine (EE)
  • tools
  • unwinder
    • Provides the low-level functionality to make it possible for the debugger and exception handling components to walk or unwind the stack. This is done via 2 functions, GetModuleBase(..) and GetFunctionEntry(..) which are implemented in CPU architecture-specific code, see amd64, arm, arm64 and i386
  • utilcode
    • Shared utility code that is used by the VM, Debugger and JIT
  • zap

If you’ve read this far ‘So long and thanks for all the fish’ (YouTube)

Discuss this post on Hacker News and /r/programming


Thu, 23 Mar 2017, 12:00 am


The 68 things the CLR does before executing a single line of your code (*)

Because the CLR is a managed environment there are several components within the runtime that need to be initialised before any of your code can be executed. This post will take a look at the EE (Execution Engine) start-up routine and examine the initialisation process in detail.

(*) 68 is only a rough guide, it depends on which version of the runtime you are using, which features are enabled and a few other things

‘Hello World’

Imagine you have the simplest possible C# program, what has to happen before the CLR prints ‘Hello World’ out to the console?

using System;

namespace ConsoleApplication
{
    public class Program
    {
        public static void Main(string[] args)
        {
            Console.WriteLine("Hello World!");
        }
    }
}

The code path into the EE (Execution Engine)

When a .NET executable runs, control gets into the EE via the following code path:

  1. _CorExeMain() (the external entry point)
  2. _CorExeMainInternal()
  3. EnsureEEStarted()
  4. EEStartup()
  5. EEStartupHelper()

(if you’re interested in what happens before this, i.e. how a CLR Host can start-up the runtime, see my previous post ‘How the dotnet CLI tooling runs your code’)

And so we end up in EEStartupHelper(), which at a high-level does the following (from a comment in ceemain.cpp):

EEStartup is responsible for all the one time initialization of the runtime.
Some of the highlights of what it does include

  • Creates the default and shared, appdomains.
  • Loads mscorlib.dll and loads up the fundamental types (System.Object …)

The main phases in EE (Execution Engine) start-up routine

But let’s look at what it does in detail, the lists below contain all the individual function calls made from EEStartupHelper() (~500 L.O.C). To make them easier to understand, we’ll split them up into separate phases:

  • Phase 1 - Set-up the infrastructure that needs to be in place before anything else can run
  • Phase 2 - Initialise the core, low-level components
  • Phase 3 - Start-up the low-level components, i.e. error handling, profiling API, debugging
  • Phase 4 - Start the main components, i.e. Garbage Collector (GC), AppDomains, Security
  • Phase 5 - Final setup and then notify other components that the EE has started

Note some items in the list below are only included if a particular feature is defined at build-time, these are indicated by the inclusion on an ifdef statement. Also note that the links take you to the code for the function being called, not the line of code within EEStartupHelper().

Phase 1 - Set-up the infrastructure that needs to be in place before anything else can run

  1. Wire-up console handling - SetConsoleCtrlHandler(..) (ifndef FEATURE_PAL)
  2. Initialise the internal SString class (everything uses strings!) - SString::Startup()
  3. Make sure the configuration is set-up, so settings that control run-time options can be accessed - EEConfig::Set-up() and InitializeHostConfigFile() (#if !defined(CROSSGEN_COMPILE))
  4. Initialize Numa and CPU group information - NumaNodeInfo::InitNumaNodeInfo() and CPUGroupInfo::EnsureInitialized() (#ifndef CROSSGEN_COMPILE)
  5. Initialize global configuration settings based on startup flags - InitializeStartupFlags()
  6. Set-up the Thread Manager that gives the runtime access to the OS threading functionality (StartThread(), Join(), SetThreadPriority() etc) - InitThreadManager()
  7. Initialize Event Tracing (ETW) and fire off the CLR startup events - InitializeEventTracing() and ETWFireEvent(EEStartupStart_V1) (#ifdef FEATURE_EVENT_TRACE)
  8. Set-up the GS Cookie (Buffer Security Check) to help prevent buffer overruns - InitGSCookie()
  9. Create the data-structures needed to hold the ‘frames’ used for stack-traces - Frame::Init()
  10. Ensure initialization of Apphacks environment variables - GetGlobalCompatibilityFlags() (#ifndef FEATURE_CORECLR)
  11. Create the diagnostic and performance logs used by the runtime - InitializeLogging() (#ifdef LOGGING) and PerfLog::PerfLogInitialize() (#ifdef ENABLE_PERF_LOG)

Phase 2 - Initialise the core, low-level components

  1. Write to the log ===================EEStartup Starting===================
  2. Ensure that the Runtime Library functions (that interact with ntdll.dll) are enabled - EnsureRtlFunctions() (#ifndef FEATURE_PAL)
  3. Set-up the global store for events (mutexes, semaphores) used for synchronisation within the runtime - InitEventStore()
  4. Create the Assembly Binding logging mechanism a.k.a Fusion - InitializeFusion() (#ifdef FEATURE_FUSION)
  5. Then initialize the actual Assembly Binder infrastructure - CCoreCLRBinderHelper::Init() which in turn calls AssemblyBinder::Startup() (#ifdef FEATURE_FUSION is NOT defined)
  6. Set-up the heuristics used to control Monitors, Crsts, and SimpleRWLocks - InitializeSpinConstants()
  7. Initialize the InterProcess Communication with COM (IPC) - InitializeIPCManager() (#ifdef FEATURE_IPCMAN)
  8. Set-up and enable Performance Counters - PerfCounters::Init() (#ifdef ENABLE_PERF_COUNTERS)
  9. Set-up the CLR interpreter - Interpreter::Initialize() (#ifdef FEATURE_INTERPRETER), turns out that the CLR has a mode where your code is interpreted instead of compiled!
  10. Initialise the stubs that are used by the CLR for calling methods and triggering the JIT - StubManager::InitializeStubManagers(), also Stub::Init() and StubLinkerCPU::Init()
  11. Set up the core handle map, used to load assemblies into memory - PEImage::Startup()
  12. Startup the access checks options, used for granting/denying security demands on method calls - AccessCheckOptions::Startup()
  13. Startup the mscorlib binder (used for loading “known” types from mscorlib.dll) - MscorlibBinder::Startup()
  14. Initialize remoting, which allows out-of-process communication - CRemotingServices::Initialize() (#ifdef FEATURE_REMOTING)
  15. Set-up the data structures used by the GC for weak, strong and no-pin references - Ref_Initialize()
  16. Set-up the contexts used to proxy method calls across App Domains - Context::Initialize()
  17. Wire-up events that allow the EE to synchronise shut-down - g_pEEShutDownEvent->CreateManualEvent(FALSE)
  18. Initialise the process-wide data structures used for reader-writer lock implementation - CRWLock::ProcessInit() (#ifdef FEATURE_RWLOCK)
  19. Initialize the debugger manager - CCLRDebugManager::ProcessInit() (#ifdef FEATURE_INCLUDE_ALL_INTERFACES)
  20. Initialize the CLR Security Attribute Manager - CCLRSecurityAttributeManager::ProcessInit() (#ifdef FEATURE_IPCMAN)
  21. Set-up the manager for Virtual call stubs - VirtualCallStubManager::InitStatic()
  22. Initialise the lock that that GC uses when controlling memory pressure - GCInterface::m_MemoryPressureLock.Init(CrstGCMemoryPressure)
  23. Initialize Assembly Usage Logger - InitAssemblyUsageLogManager() (#ifndef FEATURE_CORECLR)

Phase 3 - Start-up the low-level components, i.e. error handling, profiling API, debugging

  1. Set-up the App Domains used by the CLR - SystemDomain::Attach() (also creates the DefaultDomain and the SharedDomain by calling SystemDomain::CreateDefaultDomain() and SharedDomain::Attach())
  2. Start up the ECall interface, a private native calling interface used within the CLR - ECall::Init()
  3. Set-up the caches for the stubs used by delegates - COMDelegate::Init()
  4. Set-up all the global/static variables used by the EE itself - ExecutionManager::Init()
  5. Initialise Watson, for windows error reporting - InitializeWatson(fFlags) (#ifndef FEATURE_PAL)
  6. Initialize the debugging services, this must be done before any EE thread objects are created, and before any classes or modules are loaded - InitializeDebugger() (#ifdef DEBUGGING_SUPPORTED)
  7. Activate the Managed Debugging Assistants that the CLR provides - ManagedDebuggingAssistants::EEStartupActivation() (ifdef MDA_SUPPORTED)
  8. Initialise the Profiling API - ProfilingAPIUtility::InitializeProfiling() (#ifdef PROFILING_SUPPORTED)
  9. Initialise the exception handling mechanism -