How grading works

3 min read

The pipeline

Every submission takes the same journey: a safety screen (hostile code never executes), server-side compilation (errors come back with full diagnostics), execution in a sealed, disposable sandbox, and grading. The full story (including what the sandbox does and doesn't allow) is on the How it works page.

Result statuses

Status	Meaning
Passed	Correct output, within the time budget.
Failed	Wrong output, visible tests show a character-level diff of expected vs. actual.
Timeout	Correct or not, it exceeded the per-test time budget. Usually means your complexity is wrong, not your logic.
Error	The code threw an exception, you'll see the type and message.

Time budgets

Each test has its own wall-clock budget, measured on the server, immediately around your code. The results panel shows a bar of how much budget you used, passing at 95% is a hint even when it's technically a pass. Hidden performance tests use inputs large enough that complexity decides the outcome: an efficient solution passes comfortably, and a naive one cannot.

Memory

Tests also report bytes allocated on the managed heap during your run, and some puzzles set an allocation budget. Watch this number even where it isn't enforced. Allocation pressure is the silent performance tax in .NET.

Beyond time and memory

Each track adds its own gate on top of the behavioral tests:

Database puzzles can grade the execution plan the engine chose: a full-table scan where an index should be used fails, even when the rows come back right.
Refactoring puzzles enforce structural metrics measured from your source: method length, cyclomatic complexity, nesting depth, and duplicate blocks, each against an explicit limit.
Architecture katas verify design rules: dependency direction, layer isolation, required abstractions, sealed types.
Secure-coding puzzles run a separate adversarial suite of real attack payloads alongside the functional tests, both must pass.

Metric gates report the measured value next to the limit, so a failure tells you exactly how far off you are, and a pass tells you how close you came.

Hidden tests

Hidden tests never reveal their input, expected output, or your actual output: only the test name, pass/fail, and timing. This keeps the suite meaningful: you can't hardcode your way past it. The test names are written to be honest hints ("Large input (n = 200,000), must be O(n)").

Fairness

Timings come from the grading servers (not your browser, not your network), so results are consistent and comparable. Everyone on the leaderboard is measured by the same clock.