Finding back a ReDoS vulnerability in Zod
On the 28th of September 2023, a critical security vulnerability affecting all versions of Zod was reported under the identifier CVE-2023-4316. Zod is known as "TypeScript-first schema validation with static type inference".
This post proposes an approach to anticipate and prevent such vulnerabilities. by leveraging fast-check and its ecosystem.
The vulnerability
This particular vulnerability has been classified as a Regular Expression Denial of Service (ReDOS).It falls under a specific category of issues where regular expressions suffer from uncontrolled execution time. If you'd like a more in-depth explanation of ReDOS, you can refer to this blog post. For detailed information on this specific vulnerability, check out the report on Snyk.
In the case of Zod, the ReDOS vulnerability affected the email address validator and has been addressed through colinhacks/zod#2824.
The concept
fast-check has been designed for bug detection. It not only generates random inputs for the algorithm under test but also strategically exploits known weaknesses in JavaScript code. In the context of ReDOS, our objective isn't just to identify crashes or unfulfilled invariants. Instead, we aim to pinpoint inputs driving our algorithm into uncontrollable or prolonged executions.
In simpler terms, we want fast-check to alert us whenever it encounters inputs putting our code out-of-control.
The basic setup
Our initial setup consists into repharsing our concept in terms of property understandable by fast-check. The skeleton for such test is as follows:
import fc from 'fast-check';
const timeLimitMs = 1_000; // TODO: specify a limit based on the algorithm
const arbitraries = []; // TODO: our arbitraries or generators for random data
fc.assert(
fc.property(...arbitraries, (...inputs) => {
const startTime = performance.now();
try {
algorithm(...inputs);
} catch (err) {}
const endTime = performance.now();
const delayMs = endTime - startTime;
if (delayMs > timeLimitMs) {
throw new Error(`The computation of algorithm(...inputs) took ${delayMs}ms`);
}
}),
);
Applying this to the Zod case, we would write:
import { z } from 'zod'; // eg.: 3.22.2 is vulnerable
import fc from 'fast-check';
const timeLimitMs = 1_000;
const validator = z.string().email();
fc.assert(
fc.property(fc.emailAddress(), (rawString) => {
const startTime = performance.now();
try {
validator.parse(rawString);
} catch (err) {}
const endTime = performance.now();
const delayMs = endTime - startTime;
if (delayMs > timeLimitMs) {
throw new Error(`The computation of validator.parse took ${delayMs}ms`);
}
}),
);
While this approach effectively identifies the vulnerability, there are several gotchas to consider, which we'll address in the following section.
The advanced setup
Input size
For many ReDOS or DOS-related issues, large inputs often play a crucial role. By default, fast-check limits itself to relatively small inputs to prevent tests from running excessively long. However, in our investigations of ReDOS, we require the generation of large entries.
To enable this, add the following line:
fc.configureGlobal({ baseSize: 'xlarge' });
This should be done before instantiating any arbitraries.
Size can also be specified on an arbitrary-by-arbitrary basis. You can for instance use fc.emailAddress({ size: "xlarge" })
instead of fc.emailAddress()
if you only want to override the default for size on enmail addresses.
Number of entries
On stable, well-maintained projects like Zod, vulnerabilities can be challenging to pinpoint, as they may have already been addressed. Thus, limiting the number of property executions to the default 100 in fast-check may not be efficient.
We recommend increasing the number of runs by providing an extra argument to fc.assert
:
fc.assert(property, { numRuns: 1_000_000 });
Remember, this increase may be adjusted when incorporating this test into your continuous integration pipeline.
Shrinker
While shrinking is generally useful, if our goal is to identify inputs causing long execution times, immediate shrinking may not be necessary. In this case, we can pass an extra option to fc.assert
to bypass the shrinking logic:
fc.assert(property, { endOnFailure: true });
// with the number of entries and shrinker together:
// fc.assert(property, { numRuns: 1_000_000, endOnFailure: true });
Omitting the shrinker doesn't mean you won't be able to perform it later. Once you encounter the error and wish to shrink it, you can add the seed
and path
and remove the endOnFailure
option from fc.assert
.
Time
In such tests, there's nothing inherently preventing the code from running for hours or potentially indefinitely. By itself, fast-check can't intervene with a synchronously running piece of code, except waiting for it to conclude.
The package @fast-check/worker has been designed to address that problem. Instead of executing the synchronous code in the main thread, it spawns a worker to handle the property on the side. This ensures that any synchronous code can be stopped.
To implement this, we need to replace our fc.property
, fc.asyncProperty
and fc.assert
with the helpers provided by the worker package. Our template would be updated as follows:
import fc from 'fast-check';
import { assert, propertyFor } from '@fast-check/worker';
const property = propertyFor(new URL(import.meta.url));
const timeLimitMs = 1_000; // TODO: specify a limit based on the algorithm
const arbitraries = []; // TODO: our arbitraries or generators for random data
await assert(
property(...arbitraries, (rawString) => {
const startTime = performance.now();
try {
validator.parse(rawString);
} catch (err) {}
const endTime = performance.now();
const delayMs = endTime - startTime;
if (delayMs > timeLimitMs) {
throw new Error(`The computation of validator.parse took ${delayMs}ms`);
}
}),
);
However, as it stands, the code will not stop anything and will continue to wait indefinitely if the predicate does not conclude on its own. To address this, we need to introduce two extra options to assert
: interruptAfterTimeLimit
and markInterruptAsFailure
. These ensure that if one run exceeds interruptAfterTimeLimit
milliseconds, it will be terminated and marked as failed. Nevertheless, we will retain our original timeLimitMs
and its corresponding performance.now()
. Indeed, the worker runner needs to spin up a new worker periodically, and this process is accounted for in the time limit used by interruptAfterTimeLimit
.
The interruptAfterTimeLimit
guarantees that the time taken to spawn a new worker, transmit data to it, and execute the predicate will never exceed the specified limit.
Invalid items
In certain cases, DOS scenarios might be more likely for cases involving faulty inputs. While we initially proposed generating only valid email addresses for Zod, if we want to ensure that the code handles any possible user input without any issue, we may want to generate more than just valid email addresses.
For instance, we can replace fc.emailAddress()
with:
fc.oneof(fc.emailAddress(), fc.fullUnicodeString());
The final snippet
Now that we've addressed the basics and tackled its potential pitfalls, let's assemble everything to rediscover the issue reported in Zod:
import { z } from 'zod'; // eg.: 3.22.2 is vulnerable
import fc from 'fast-check';
import { isMainThread } from 'node:worker_threads';
import { assert, propertyFor } from '@fast-check/worker';
const property = propertyFor(new URL(import.meta.url));
const timeLimitMs = 1_000;
const validator = z.string().email();
await assert(
property(fc.emailAddress(), (...inputs) => {
const startTime = performance.now();
try {
algorithm(...inputs);
} catch (err) {}
const endTime = performance.now();
const delayMs = endTime - startTime;
if (delayMs > timeLimitMs) {
throw new Error(`The computation of algorithm(...inputs) took ${delayMs}ms`);
}
}),
{
// we want to stop immediately on failure to report issues asap, drop it to have shrinking
endOnFailure: true,
// we want to kill the predicate if it takes more than {interruptAfterTimeLimit}ms
interruptAfterTimeLimit: 60_000,
// and mark the run as failed
markInterruptAsFailure: true,
// fuzzing implies possibly running for longer than usual tests (when we want to look for the issues, not in CI)
numRuns: 1_000_000,
},
);
Upon running this locally, we promptly identify an issue:
Error: Property failed after 1233 tests
{ seed: 2051841007, path: "1232:5:1:2:12:18:24:30:36", endOnFailure: true }
Counterexample: ["aaaaaakeyconstructorcall1nl=constructorcalle.&g//{%fvf|&q+!v7@npd.z3n5vfs0ivqopytanq2ye37swpycij2a0.v6usxu6qfov9sb9rmown92tk6omw7ujl4-pa274fnbgnx0l9xdn18rq.nmsvklo9r3a-frz-2.gxqagvl7h2c5.imvj9wk-tw1rv8a.i.q3
hpcqgdugnhc8ydfjvvcfci4k1adqgnssmkecpqmiabqux08cfrh3su5zkf.binumohcqsyzjjetfbuntgknunsjeklecfoirjngvpzi"]
Shrunk 8 time(s)
Got error: The computation took 1667.1613000035286ms
This technique not only enables us to identify ReDOS, but can also be extended to uncover algorithmic issues unrelated to regular expressions. It essentially provides a framework for identifying accidental algorithmic complexities or infinite loops.