This work was carried out as part of a grant received from the Gordon and Betty Moore Foundation’s Data Driven Discovery Initiative.
Julia uses type inference to determine the types of program variables and generate fast, optimized code. But how does it really work? I recently redesigned the implementation of Julia’s type inference algorithm, and decided to blog what I’ve learned. If you’re already comfortable with these algorithms, I might not say anything new. But since only 26 people have committed any changes to Julia’s inference.jl
file, I suspect that this is an unfamiliar topic for most people.
However, I’m not going to focus on explaining the typical algorithms, but instead focus on some of the implementation challenges.
A high level view of type inference when approached as a data-flow problem (as Julia does) is that it involves running an “interpreter” on the program, but only looking at types instead of values. Unlike normal program execution, this process can be proven to always converge and terminate if certain criteria are met. A typical explanation of the algorithm might end with a statement such as “now iterate to convergence”. But this simple phrase masks a significant design complexity, which can impact the speed and precision of the algorithm. This is the component of the Julia algorithm that needed a significant redesign to address a particular recursion bug and make inference on self-recursive functions faster.
Many algorithms for type inference exist. For statically typed languages, the algorithm of choice is usually a flow-insensitive unification-based algorithm like Hindley-Milner. Julia’s usage of data-flow analysis, however, is better suited for dynamic typing. This approach allows, for example, variables to have different types at different points in the program, adding extra flexibility and convenience. Unlike static typing systems, type inference is used as an optimization, rather than to guarantee correctness at compile time. To succeed, it does not need to prove a unique tightest bound on types, so it is allowed to widen (over-approximate) types heuristically. As we will see, these heuristics have an interesting design space.
I have illustrated Julia’s algorithm below for a simple for loop. In the animation below, pc
is the “program counter” indicating the line currently being processed and pc´
is highlighted to indicate the lines that remain to be processed next:
In Julia’s data-flow algorithm, inference execution starts with the statements in a function and initializes several states:
Union{}
(aka Bottom
) or the type of the corresponding argument. Union{}
(does not return). Then the algorithm acts like a “virtual machine”, iterating the following process until there are no lines waiting to be examined:
if
statement or other control operation (while
, goto
, etc.), the previous operation is performed for all of the possible next statements, instead of just the next one. For example, given an if
-statement, inference will take both the if-true
and if-false
paths through the function (and merge the results at the end). As I shall discuss next, once convergence has been reached, inference is done. The return type and variable types are marked and cached for code generation and future reuse (for example, if the function is inlined). The function can also then potentially be better optimized now that all of the types are known.
Now we come to the real challenge of type inference: reaching convergence.
The data-flow algorithm can be proven to converge if two conditions are met: monotonicity and finiteness.
This first requirement means that merging two types must result in a less specific type, and that knowing less about type for an argument to a function call does not lead to knowing more about its return type. This condition is fairly easy to meet, since all we need to do is describe the behavior of language primitives in a well-behaved way.
The second requirement means that there must be limit on number of types that can be constructed. This condition is more interesting, since many useful type lattices, including Julia’s, do not obey it^{1}. Instead, for the purposes of reaching convergence, the lattice is “forced” to finite height using a technique called “widening”. In widening, heuristics are applied to make types grow much faster than they would in a naive implementation of the algorithm, leading to faster convergence. This ensures that Julia’s type-inference algorithm is guaranteed to reach convergence eventually.
That still leaves the practical challenge of determining how to reach convergence. In Julia, detecting convergence is handled in several parts: Julia has local convergence at the statement level within a function, convergence of static_typeof
variables (comprehensions), and global convergence of all functions in the entire call graph.
(A diagram representing a simple directed call-graph showing all of the function called by orange
)
Initially, Julia only had the first test. This made inference quite simple to implement as a recursive descent over the depth-first search on the call graph. At each call, the algorithm pauses inference on the current function to recur into the callee. Eventually a leaf function is reached (one which does not make any calls to an uninferred function) and then the whole stack can unravel, filling in the return type information as it goes, and continuing with iterating convergence on the current function.
(A diagram representing a call-graph with two disjoint recursion cycles)
There is a slight wrinkle with this plan however if recursion is encountered. If there is a recursion, there is no leaf function. Instead, when entering a new function, the call stack is first checked to see if this function had already been encountered. Then the algorithm needed to take several actions. First, it marked the function as being recursive (toprec
) and marked every intervening caller as depending on a recursive function (rec
). Then as the recursion unwound, those functions would record that their return type was an incomplete, under-approximation of the real return type and discard their incomplete inference work (marked true
in the method’s tfunc
cache). The topmost function in the recursion would then iterate this process until its return type reached a fixed-point.
(A diagram representing a call-graph with several interleaved recursion cycles)
Over time, this test was made more complicated to handle cases where recursion in the call graph can form multiple loops. Until v0.5, the algorithm computed a cycle-id for each recursion detected, to attempt to separate the dependency graph into separate recursion loops. This was not entirely correct because the graph is not always separable, which led to bug #9770 and then PR #9926 to compute it more correctly (I never attempted to prove it was fully correct, just that it was more correct on that issue). Unfortunately, that PR implementation also discarded and repeated significantly more incomplete work while handling recursion. The result was that inference became significantly slower on most tasks (such as inferring a single key handler for the repl), so the PR was never merged. To overcome these limitations, a new approach is required to deal with this case.
I mentioned previously that cycles cause issues for the existing type inference algorithm. The root of the problem is that using recursion to solve for the return types makes it inherently difficult to pause and resume type inference to incorporate new information. This impairs computation of the simultaneous convergence of all of the functions involved a cycle and instead led to a solution where only the top-most function in the call-graph is converged at a time. The resulting rework is very time-expensive in cases with heavy recursion, such as running inference on the inference code itself.
The solution is allow pausing and restarting inference on any part of a function at any time. This could still be written recursively, but once the state has been encapsulated in this way, it is easier to represent the algorithm as pure iteration. Indeed, this has obvious symmetries to inference within a function, where the working state is represented via an integer set of lines to revisit. This should not be particularly surprising, since there exists an isomorphism between statements and functions allowing one to be rewritten in terms of the other (e.g. replacing recursion with iteration or inlining one function into another).
The remaining construct that affects convergence is the static_typeof
expression. This construct appears in the lowered^{2} code for a comprehension is an unusual complication in Julia’s type inference algorithm. It has required special care to make it look like the result of comprehensions are predictable, even though the current design is not^{3}. That is intended to be fixed soon, so I’ve relegated an overview of this construct to the footnotes^{4}.
The corrected convergence algorithm in PR #15300 works iteratively over all functions in the call graph, building up an implicit graph of the call edges as it comes across them, until all functions have collectively reached a global convergence. When asked for the return type of a function edge, inference first checks a cache of previously computed return type information, to avoid frequent expensive re-computation. (This is implemented by searching the tfunc
field of a Method for existing work for the particular call signature. This could be an in-progress InferenceState
object, an inferred LambdaInfo
code object, or just a return type). If the type signature is not found, the new function edge is added to the work queue to be inferred. Then, in any of those cases, the current best-guess is returned. If inference has not finished with the function, an explicit edge is also added to the inference state for these two functions (caller and callee), which tracks the dependency of the return type of the callee on that line number in the caller.
There are several partially-overlapping sets (or states) that a function can be in. I’ll list them all here for your reference, but I won’t fully explain what my terms mean until later:
Any number of functions could be added to the work queue before starting global inference; the process doesn’t necessarily need to be rooted in one caller. Once the loop is called to start inference on everything, one function is selected^{5} from the work queue for the algorithm to make progress on. It is important that the algorithm is always making progress to ensure that it will eventually terminate (more on this later). The algorithm then iterates over the instructions in the current function until it runs out of lines that can make progress (the algorithm could pause work earlier, but restarting frequently is less efficient and makes it harder to compute the “making progress” metric). Each time the algorithm reaches a call to another function that is not yet inferred, it adds an edge to the inference state for this function.
Once all active lines are processed, the algorithm determines if it is completely finished by counting the number of unresolved edges. If the number is zero, all child functions have also finished inference and the current return type result is fully converged. In that case, the function can be removed from the inference queue and moved from the active
set to the finished
set (the result is also cached in the tfunc
map). Then all parent functions are re-added to the work queue (by following the list of back-edges) so they can check for their own convergence. This is then repeated until the entire queue is empty (*the queue can also become empty due to a recursion cycle – I’ll get to that shortly). Alternatively, the item is moved to the stuck
set until the functions that it called have been inferred and can provide their actual return type.
If all active lines have not been processed, the algorithm picks another function off of the work queue and repeats the process. Because each call-edge between two functions is explicitly tracked, each time the return value of a function changes the algorithm can easily re-mark the appropriate lines in all of its callers for revisiting by the type inference algorithm, by following the back-edges (thereby moving the caller from stuck
to stuck´
).
Now comes the tricky part: dealing with cycles.
I mentioned earlier that the work queue may also become empty due to recursion. When this happens, all of the unresolved edges of the function that was just moved – from work queue
to stuck
– contain cycles. There are several ways to deal with this situation, depending on how the implementation chooses to represent, detect, and resolve the cycle. The following is a description of the implementation proposed for Julia (editors note: the proposed PR #15300 has been accepted and merged for v0.5-dev). In terms of efficiency, this design optimizes for the case where most of the functions in the graph converge quickly (versus the possibility of needing to iterate the cycle many times before reaching a fixed point). Since most Julia code is written to have predictable “type-stable” leaf return types (and inference widens type information very quickly), this is expected to work out well in practice.
When the work queue becomes empty, the algorithm marks the current function as having reached a fixed-point cycle
. It then adds all of its non-fixed-point
edges to the work queue and marks them as having reached a fixed-point
. To guarantee convergence, the following invariant is maintained on this set at all times:
Any function marked as fixed-point
must either:
a. Be in the work queue
set of functions (or processing
)
b. Have all of its edges in the workqueue and marked as fixed-point
This makes it remarkably easy to track this property and maintain these invariants. If the return type of the function changes, it and all of its immediate back-edges are marked as not being at a fixed point (moving them to stuck´
). The back-edges of each of those back-edges are then recursively also unmarked (moving them to stuck
). If a function marked as fixed-point
is retired from processing
, it adds all of its edges to fixed-point
.
If the work queue is still empty at this point, then the edges of all functions in the fixed-point
set are also in the fixed-point
set, and the entire cycle has reached convergence (yay!), and can be moved en masse to the finished set.
Finally, if the work queue is still empty, then the active
set contained multiple independent call graphs (and all of the remaining ones contain cycles). A random one gets moved to processing
to kickstart continued cycle resolution.
Once the active
set is empty, inference is done. It is now the compiler’s job to decide what to do with all this information to maximize speed.
But wait! There’s one more complication: in addition to cycles on a function with the same signature, because we specialize each signature on the inferred type signature there can be a cycle for recursively calling a method where the type signature changes at each call.
(A diagram showing how the context-sensativity of the specialization type signature can create repeating patterns instead of simple recursion)
In this case, various heuristics are required to ensure this can still terminate. The current heuristics place a limit on the complexity of the type signature^{1} in the recursion to force it to to be finite. But a full discussion of these heuristics will have to wait for a future blog post.
After writing this post, testing of the algorithm above showed that it was unacceptably slower and more memory-hungry than the existing code. So after all of that work to remove all recursion and make it a simple work queue, I made one final tweak: I made it recursive. This made its behavior considerably more complex to analyze, but it pays for its complexity in the speed benefit of being able to infer the common case (non-cyclic functions) without creating a deferred edge. The fall-back, however, is still the corrected looping algorithm given previously. This change however demonstrates one advantage of the states chosen above for the algorithm: it is highly decoupled, so that only the act of making and finalizing an edge needs to be synchronized. The rest of the work can be executed in any order – in the future, it could even be run multi-threaded.
My thanks to @carnaval for correcting my convergence algorithm, encouraging me to simplify my state transitions, and generally for teaching me about this algorithm.
Also, his paper on static analysis capabilities for Julia helped me better understand some of the future features that may be desirable for Julia’s inference algorithm, and how those would affect the convergence criteria.
[M02] Markus Mohnen. A Graph-Free Approach to Data-flow Analysis. In R. Horspool, editor, Compiler Construction, volume 2304 of Lecture Notes in Computer Science, pages 185–213. Springer Berlin / Heidelberg, 2002.
[CC77] Patrick Cousot, Radhia Cousot – Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints (SIGPLAN 1977)
There are a finite number of user types defined, but the type parameters can be arbitrarily nested in depth and there are tuples and unions, which (in addition to depth), can have arbitrary length. This permits the following definitions that together characterize the size of a type:
depth(type) = 1 + maximum(depth, type.parameters)
length(type) = length(type.parameters)
Limiting the depth of a Type can be achieved by replacing some type parameters with a TypeVar of its supertype:
Type{Type{Type{Type{Int}}}} -> Type{Type{T <: Type}}
Limiting the length of a Union can be achieved by replacing several elements with their common supertype:
Union{Int, Float32, Float64} -> Union{Int, FloatingPoint}
Limiting the length of a Tuple can be achieved by replacing the tail types with a Vararg:
Tuple{Int, Int, Int} -> Tuple{Int, Vararg{Int}}
The “lowered” form of code is a simplified representation of the body of the function which is more efficient to analyze. ↩
The element type of the Array returned from a comprehension is defined as the result that type-inference was able to compute. This makes the correctness of the expected return type for a comprehension dependent on the behavior and quality of type-inference on the containing function. This is not desirable, since the behavior of the language spec is supposed to be independent of the result of the type-inference optimization. ↩
The static_typeof
expressions are an unusual operation that allows using the expected type of a variable prior to that variable’s actual definition. This odd behavior has added another dimension to the complexity of reaching convergence.
This construct has been used to implement comprehensions. For example, the code:
c = [f(x) for x in iter]
is handled by inference by first representing it as:
c = Array{$(Expr(:static_typeof, :fx),1}(0)
for x in iter
fx = f(x)
$(Expr(:type_goto, :fx))
push!(c, fx)
end
Typically, inference (in Julia) is done as a flow-sensitive analysis. However, static_typeof
explicitly violates that condition. Indeed, the placeholder variable fx
may not ever be defined (the type of that variable is thus the empty set: Union{}
).
Note: inference actually uses a more simplified representation than this. I’ve taken the liberty here to illustrate the approximate result of just this one transform in quasi-Julia syntax to preserve clarity. The fully lowered^{2} form that code_lowered(func, argtypes)
returns performs a similar transform while simultaneously simplifying the expression form to permit easier analysis.
The constraint that makes this computationally feasible is that no assignment to fx
may occur after any usage of c
in any execution path (or equivalently: there is no execution path in which a usage of c
occurs before the assignment to fx
). This structurally prohibits the type of fx
from depending recursively on the type of c
.
One issue remains, which is that initially the type of fx
is Union{}
. Normally, this would signal that an error was thrown on that line and stop inference from proceeding to the next statement. For a static_typeof
statement, the first iteration must instead proceed as if that statement will be computable in the future, then resume back at that point after the type information is known. However, this can cause the initial type inference result for c
to end up being Union{Array{Union{}}, Array{T}}
, even though the actual type of c
would be Array{T}
. This type over-approximation can be seen, for example, in the type inference result of a nested comprehension. In order to narrow this type, inference restarts while preserving only the static_typeof
variable inputs. This step is then iterated until the types of these variables remain unchanged across one of these restart iterations. ↩
For the correctness of the algorithm, it doesn’t matter which function is selected. Conceptually, it could be imagined that a function is selected by calling rand
. In practice, the algorithm tries to arrange the queue such that there is likely progress that can be made on the next item. But the invariants required by this implementation of the algorithm even allow that the function could be selected from stuck
or stuck´
instead. ↩