June 10, 2026 · The Katabench team

Green checkmarks lie: why we grade C# on performance

Submit a working solution on almost any practice platform and you get the same reward: a green checkmark. It doesn't matter if your solution allocates a gigabyte, hammers the database with a thousand queries, or runs a hundred times slower than it should. Correct is correct.

Except in production, correct isn't correct.

In production, the O(n²) loop that sailed through the test cases melts when the input is a million rows. The innocent-looking LINQ query fans out into an N+1 stampede. The "working" service is a dependency tangle nobody dares to touch. None of these are correctness bugs, and all of them are the difference between a junior and a senior engineer.

Practice what actually gets graded

Our thesis is simple: you get good at what your feedback loop measures. If the loop only measures correctness, that's all you'll train. So we measure what production measures:

Wall-clock time against a budget, per test. Every puzzle has hidden tests with large inputs, hundreds of thousands of elements. A naive solution doesn't fail a code review here; it fails the test suite, with a timeout and a bar showing exactly how far over budget you were.
Allocated bytes. Speed isn't the whole story in .NET. Allocation pressure is the silent tax. Every test reports how much your solution allocated on the managed heap.
The actual SQL. Database puzzles run your LINQ against a real database and show you every query your code generates. The N+1 isn't an abstract warning. It's forty queries staring at you where one should be.
Structure and architecture rules. Refactoring katas are graded on Roslyn-measured structural metrics, and architecture katas on design rules: dependency direction, layering, abstraction boundaries, checked automatically on every submission.

What your suite says

returns_expected_result

✓

handles_empty_input

✓

handles_duplicates

✓

sample_cases_3_of_3

✓

✓ All green 4 / 4

What production measures

wall-clock 840 ms / budget 200 ms

✗

allocations 1.2 GB

✗

40 queries where 1 belongs

✗

query plan: seq scan

✗

✗ Over budget same code

Both columns are true of the same submission. The suite checked whether it works; production measures what it costs.

What that feels like

Take the classic warm-up, Two Sum. The obvious solution is two nested loops:

public int[] TwoSum(int[] nums, int target)
{
    for (var i = 0; i < nums.Length; i++)
        for (var j = i + 1; j < nums.Length; j++)
            if (nums[i] + nums[j] == target)
                return [i, j];
    return [];
}

Correct? Completely. Submit it here and the sample cases pass, then a hidden test named "Large input (n = 200,000), must be O(n)" times out at its budget. The grade isn't a vague "try to do better." It's a hard fail with a number attached.

One dictionary later:

public int[] TwoSum(int[] nums, int target)
{
    var seen = new Dictionary<int, int>();
    for (var i = 0; i < nums.Length; i++)
    {
        if (seen.TryGetValue(target - nums[i], out var j))
            return [j, i];
        seen[nums[i]] = i;
    }
    return [];
}

Same inputs, 4 milliseconds, 2% of the budget. That contrast (felt, not read about) is how Big-O stops being an interview fact and becomes an instinct.

Measured where you can't fake it

One more thing matters: all of this is measured server-side, right next to the executing code, inside an isolated sandbox. The browser never grades anything, so the numbers are clean and comparable. Your 4 ms and the leaderboard's 4 ms are the same 4 ms.

(If "you run my code on your servers" raises an eyebrow, good, it should. We wrote up exactly how the pipeline and the sandbox work, because we'd want to know too.)

The platform is live, and the Free plan includes the full algorithm catalog, performance-graded, with time and memory metrics on every test. Come find out how fast your C# really is.

Practice what actually gets graded

What that feels like

Measured where you can't fake it

Get new puzzles and .NET tips in your inbox