Note: the code for the TypeScript (to help you follow along) is available here: Encoding Algebraic Data Types in TypeScript code
Background
Recently I've been reviewing TypeScript code at work. A common pattern I've observed in pull requests and some open source projects is a type that looks like the following:
type FooTag = 'Bar' | 'Baz' | 'Qux';
type Foo<A> = {
_tag : FooTag,
/* Required for the 'Bar' tagged values of Foo */
bar? : string,
/* Required for the 'Baz' tagged values of Foo */
baz? : number,
/* Required for the 'Qux' tagged values of Foo */
qux? : boolean,
/* Optional for the 'Qux' tagged values of Foo */
quux? : A
}
Notice that Foo
is a parameterized type (also called
generic type) which means it places a value of type A
which is
determined by the caller somewhere in its structure.
In a business application we might place a Currency
enumeration (or
another domain type) that might look like the following:
type Currency =
'USD'
| 'JPY'
| 'EUR'
| 'CHF'
You might notice that (except for the _tag
attribute value) all the
other attributes of type Foo
are optionally required to be
present. The above type declaration would allow us to construct the
following values that typecheck against this type:
// According to the documentation, this is an INVALID Bar kind of Foo
const bar : Foo<Currency> = { _tag : 'Bar', baz : 123 };
// This is an INVALID Baz kind of Foo
const baz : Foo<Currency> = { _tag : 'Baz' }
// This is an INVALID Qux without an optional quxx attribute
const qux : Foo<Currency> = { _tag : 'Qux', qux : true, bar : 'WAT?' }
// According to the documentation this is an INVALID Qux kind of Foo
const qux2 : Foo<Currency> = { qux : false, quux : 'JPY' }
Here only the last invalid construction (according to the
documentation) is raised as a type error by TypeScript. All the other
invalid forms of the data get past tsc
with no noise being made.
Remembering Algebraic Data Types (ADTs)
For the remainder of this blog post, I'll assume you know the basics of algebraic data types before continuing with this post, but if you need a refresher, I wrote about algebraic data types while ago.
Inconsistencies of the single record approach
The invalid data constructions above (assuming the documentation is
correct) did not get caught by the tsc
typechecker even though the
data was badly constructed. This adds overhead for the developer using
these types as inputs to their functions or outputs from their functions.
I often see extra logic in utility or combinator functions that accommodate bad constructions scattered over all the impacted functions like this:
// I did not want to introduce fp-ts to readers who are new to it
// as I wanted the focus to be on encoding algebraic data types in
// TypeScrpt, but ordinarily I would return fp-ts's Option<Foo<A>> or
// Either<AddError, Foo<A>> depending on whether the calling code
// might need to differentiate between error cases.
const addBars = <A extends unknown>(bar0 : Foo<A>, bar1 : Foo<A>) : Foo<A> | null => {
const tags : string = bar0._tag + "/" + bar1._tag;
switch (tags) {
case "Bar/Bar":
// In an ill-constructed Bar value the bar attribute might not
// be set because the bar attribute was optional.
return (bar0.bar && bar1.bar) ? { _tag : 'Bar', bar : bar0.bar + bar1.bar } : null;
default: return null
}
};
Using this in ts-node
we would see:
>>> // hits path that returns a new Bar value appending the bar attribute
>>> addBars({ _tag: 'Bar', bar: 'hello'}, { _tag: 'Bar', bar: ' world'})
{ "_tag": "Bar", "bar": "hello world" }
>>> // hits the ternary else null path
>>> addBars({ _tag: 'Bar', baz: 1234}, { _tag: 'Bar', bar: 'wow' })
null
>>> // hits the default path that returns null
>>> addBars({ _tag: 'Baz', baz: 1234}, { _tag: 'Bar', bar: 'wow' })
null
The clutter in addBars
adds up across many functions and is not
isolated to one area of the code or we end up writing partial
functions that ignore bad constructions.
We could offer clients of our library "smart constructors" which is a fancy way to say "functions that create well-formed data constructions". Something like the following:
const mkBar = (s : string) : Foo<Currency> => ({ _tag : 'Bar', bar : s });
const mkBaz = (n : number) : Foo<Currency> => ({ _tag : 'Baz', baz : n });
const mkQux = (qux : boolean, quux? : Currency) : Foo<Currency> => ({ '_tag' : 'Qux', qux, quux });
We can construct valid data from these smart constructors like so:
const foo0 = mkBar('bla bla');
const foo1 = mkBaz(12345);
const foo2 = mkQux(true);
const foo3 = mkQux(false, 'USD');
All these values get inferred as type Foo
as expected (due to the
type annotation for the return type of the smart constructors).
This certainly helps keep clients of our library construct valid forms
of Foo
but the type declaration can still be used by rogue callers
or those that don't bother reading the documentation (myself
included). We could argue about whether it is worth our time nannying
these developers but for the sake of this blog post, I will assume we
want to do the best we can (within reason) to eliminate bad
constructions of type Foo
so that we don't need to add extra logic
to our functions that consume these values to check they are
well-formed according to how we intended (read the inline
documentation in the type declaration).
Native Algebraic Data Types
Stepping back this is how we might define data types in a language that natively supports algebraic data types satisfying the documentation's intent.
Here is a datatype declaration in PureScript:
data Foo a
= Bar { bar :: String }
| Baz { baz :: Int }
| Qux { qux :: Boolean, quux :: Maybe a }
What it says is that we have one type named Foo
and there are only three
ways we can construct a valid value of Foo
according to this type. Namely:
foo0 = Bar { bar : "wowser" }
foo1 = Baz { baz : 3000 }
foo2 = Qux { qux : true, quux : Nothing }
Similarly, the application type Currency
used in the TypeScript
example looks like this in PureScript:
data Currency = USD | CHF | EUR | JPY
With the declaration of Foo a
in PureScript, we are not able to
construct the equivalent to the invalid forms that escaped the
TypeScript typechecker unscathed above.
Here is us attempting the equivalent of { _tag : 'Bar', baz : 123
}
. It yields the following type error:
>>> invalid0 = Bar { baz : 123 }
Could not match type
()
with type
( baz :: Int
| t0
)
while trying to match type ()
with type ( baz :: Int
| t0
)
while checking that expression { baz: 123
}
has type { bar :: String
}
in value declaration invalid0
where t0 is an unknown type
Now attempting the PureScript equivalent of { _tag : 'Baz' }
does not
give us a type error but will infer the type as { baz :: Int } -> Foo
Currency
.
Annotating with the type of Foo Currency
yields the following error from the
PureScript typechecker:
>>> invalid1 = Baz :: Foo Currency
Could not match type
Function
{ baz :: Int
}
with type
Foo
while trying to match type { baz :: Int
}
-> Foo t0
with type Foo Currency
while checking that expression Baz
has type Foo Currency
in value declaration invalid1
where t0 is an unknown type
Attempting to do the equivalent of { _tag : 'Qux', qux : true, bar :
'WAT?' }
in PureScript with the sum of products algebraic data type
that describes our intent (based on documentation) yields:
>>> invalid2 = Qux { qux : true, bar : "WAT?" }
Could not match type
()
with type
( bar :: String
| t0
)
while trying to match type ()
with type ( bar :: String
| t0
)
while checking that expression { qux: true
, bar: "WAT?"
}
has type { quux :: Maybe String
, qux :: Boolean
}
in value declaration invalid2
where t0 is an unknown type
We can already see that describing our Foo a
parameterized type in a
sum of products in PureScript keeps our troublesome callers out of
much more trouble than our record-with-many-optional-attributes
encoding in TypeScript at the top.
TypeScript encodings of non-native algebraic data types
So how can we encode this kind of algebraic data type in TypeScript which doesn't natively support them (yet)?
We will explore some of the design space and note the progressive solution's trade-offs as we go along, attempting to improve on the faithfulness of the encoding to PureScript's native algebraic data type's characteristics.
Our first attempt might be the following:
type Bar2 = { _tag : 'Bar', bar : string }
type Baz2 = { _tag : 'Baz', baz : number }
type Qux2<A> = { _tag : 'Qux', qux : boolean, quux? : A }
type Foo2<A> = Bar2 | Baz2 | Qux2<A>
We can then determine which data construction was used via a switch statement like so:
const switcher = (foo : Foo2<Currency>) : string | Currency | undefined => {
switch (foo._tag) {
case 'Bar': return 'fizz';
case 'Baz': return 'buzz';
case 'Qux': return foo.quux;
default: return undefined;
}
};
We can no longer mix and match attributes that don't correspond to the
data constructor "tag" like we could with the "tagged" single
record. However, we can pass a null
value for one of the constructor
arguments instead of giving it the required Currency
value
expected. We also probably still want to provide smart
constructors. Also note that we now have four types: Bar
, Baz
,
Qux
, and Foo
not just Foo
. This point may not be that important
for all library designers.
However, let's see I we can prevent the nullability of the constructor attributes:
type Bar3 = { _tag : 'Bar', bar : string }
type Baz3 = { _tag : 'Baz', baz : number }
type Qux3<A> = { _tag : 'Qux', qux : boolean, quux? : NonNullable<A> }
type Foo3<A> = Bar3 | Baz3 | Qux3<A>
Note that:
- primitive types like
string
,number
,boolean
cannot accept anull
value. - we only have to specifically annotate the parameterized type for the
optional
quuz
attribute asNonNullable
to get the semantics we want since it can be anything.
Now if we wanted to eliminate the possibility of constructing a value
of type Bar4
instead of unifying all the values under Foo4
then we
could update the type declaration like so:
interface Bar4 { _tag : 'Bar', bar : string }
interface Baz4 { _tag : 'Baz', baz : number }
interface Qux4<A> { _tag : 'Qux', qux : boolean, quux? : NonNullable<A> }
type Foo4<A> = Bar4 | Baz4 | Qux4<A>
const mkBar4 = <A extends unknown>(s : string) : Foo4<A> => ({ _tag : 'Bar', bar : s });
const mkBaz4 = <A extends unknown>(n : number) : Foo4<A> => ({ _tag : 'Baz', baz : n });
const mkQux4 = <A extends unknown>(qux : boolean, quux? : NonNullable<A>) : Foo4<A> => ({ '_tag' : 'Qux', qux, quux });
Do not fear, we can still create a Qux4
construction of Foo4
with
no quux
attribute, but if one is given, it cannot be null
. This
is our intent based on the original documentation.
This ensures we don't create values of types Bar4
, Baz4
, or Qux4
instead of Foo4
now giving us the same basic guarantees as
PureScript's sum of products algebraic data type definition for data
construction without pattern matching or exhaustivity checking.
>>> mkBar4('bla bla bla')
{ "_tag": "Bar", "bar": "bla bla bla" }
>>> mkBaz4(3000);
{ "_tag": "Baz", "baz": 3000 }
>>> mkQux4(false, 'Coronavirus')
{ "_tag": "Qux", "qux": false, "quux": "Coronavirus" }
>>> mkQux4<MyType>(false)
{ "_tag": "Qux", "qux": false }
The _tag
values in each case correspond to the data constructor of
Foo
so we can switch on it when we need to but are the types
inferred what we expect (all values infer as Foo
)?
First let's see what happens when we tell the typechecker they should
be Bar4
, Baz4
, and Qux4
respectively.
>>> const v0 : Bar4<Currency> = mkBar4('bla bla bla');
Type 'Foo4' is not assignable to type 'Bar4'.
Property 'bar' is missing in type 'Baz4' but required in type 'Bar4'.(2322)
>>> const v1 : Baz4<Currency> = mkBaz4(3000);
Type 'Foo4' is not assignable to type 'Baz4'.
Property 'baz' is missing in type 'Bar4' but required in type 'Baz4'.(2322)
>>> const v2 : Qux4<Currency> = mkQux4<string>(false, 'Coronavirus');
Type 'Foo4' is not assignable to type 'Qux4'.
Property 'qux' is missing in type 'Bar4' but required in type 'Qux4'.(2322)
Removing explicit type annotations we see TypeScript infers type
Foo2<A>
which is ideally what we want for a faithful encoding of a
sum type, which is "closed" (some familiar with JVM
languages may say sealed) over a specified number of data
constructions at time of declaration (it is not extensible later).
In languages that support algebraic data types natively, like PureScript and Haskell, the typechecker can do exhaustivity checking on our pattern matching clauses. This one kind of check early in development can eliminate a large category of common runtime defects very early in the development process converting them to compile-time checks yielding with fast developer feedback. This means the developer retains their full context of the bugfix or feature being worked on. This is hugely valuable on several fronts:
- developers can fix these partial function problems closer to the type they were created
- reduces developer frustration
- less likely to ship partial logic to environments higher than local development, saving everyone else's time too.
Next, we will attempt to trick TypeScript into providing exhaustivity checking for our switch-case statements that we are using as a poor substitute for pattern matching.
Can we get tsc to do exhaustivity checking on our _tag switch statements?
In the earlier switch
statement, the default
branch returned a null
which always irritated me because we had to update our return type to
include null
. Now let's remove the return null
statement and tighten
the return type to string
:
const matcher = (foo : Foo4<Currency>) : string => {
switch (foo._tag) {
case 'Bar': return 'sand';
case 'Baz': return 'zap';
// Uncomment the following line to get tsc to succeed typechecking
//case 'Qux': return 'xoom';
}
};
When we comment one or more of the possible _tag
cases out of our
switch statement the TypeScript compiler gives us the error:
Function lacks ending return statement and return type does not include 'undefined'.
Uncommenting the missing switch statement gives us typechecked code.
We now have some form of exhaustivity checking from this encoding.
Alternative encodings of algebraic data types in TypeScript
This is just one way to encode algebraic data types in TypeScript with the characteristics we care most about and I hope this post gives you a launchpad to explore the design space further.
The following are other approaches or explorations that accomplish the same or similar goals in TypeScript without native algebraic data types:
Parting Thoughts
I hope I've shown that algebraic data types offer developers a way to build up descriptions of data from the small up to the very big using the same basic construct.
Some languages like PureScript (plus Haskell, Fsharp, OCaml) offer a direct or native encoding of algebraic data types (ADTs) that lets the developer focus on modeling their domain for their application's needs.
I also demonstrated that even in languages that do not have a direct or native encoding of algebraic data types we can still describe data in a compositional way with some extra boilerplate needed to gain the ideal characteristics of ADTs, specifically in TypeScript.
My 2012 article about Algebraic Data Types showed Scala using an indirect encoding with sealed traits and case classes that was very common in Scala2 codebases. Since then Scala3 provides more direct encodings of ADTs though still not as native as ML-style languages support.
By building up a rich domain model from smaller ADTs we keep things compositional which makes it easier for developers to break down solutions to big problems into smaller parts that can be composed together without magic or fear and the ability to test and validate the smaller parts of the solution independently. This helps us keep our sanity the longer we maintain our codebases.
Have fun and exploit algebraic data types in your language (or consider using a language that more directly encodes this idea so you can worry less about the boilerplate and more about modeling your application domain).
If you enjoyed this content, please consider sharing this link with a friend, following my GitHub, Twitter/X or LinkedIn accounts, or subscribing to my RSS feed.