A Legit Analysis

Willem Vandercat of ROPtimus Prime has posted a fantastic analysis of our 2013 finals game, based on the data we released: A BS Analysis Based on Legit Data. This blog post is the requested response, so read their post first. It's really good!

1. Review

A: As far we can tell, these flags became completely inaccessible to the teams and were thus effectively destroyed, invalidating the zero-sum assertion.

The plan was to make a list of all the teams that redeemed flags this round, and split up the remainders evenly, holding on to the leftovers for next round:

scoring_teams = Team.
joins(:redemptions).
where(redemptions: {round: ending_round}).
distinct.to_a

Unfortunately, I wasn’t storing which round redemptions were made in:

scorebot=# select round_id, count(*) from redemptions group by round_id;
round_id | count
----------+-------
   NULL   | 18880
(1 row)

This meant that for the purposes of remainder reallocation, nobody ever scored, and we never reallocated those flags.

Admitting you have a problem is the first step towards recovery (or maybe causing the problem is), so this won’t happen again.

B: However, it appears that flags were never initially paired to specific services (see discussion of flags table in the database section below) making it difficult to prove that this in fact was the case throughout the contest.

This was a miscommunication on my part. We debated and discussed this internally, and while it did make it into the rules document, I got mixed up over what we decided and it never made it into the source code.

E: The database provides no transaction records for flags that were redistributed as a result of failed availability checks.

This is something I want to fix for next year. It made many things more difficult for us, and makes a lot of analysis impossible. Ideally, there’ll be a separate table just for flag movements, that can join to either redemptions or availabilities.

E:  A failed availability check resulted in the loss of 19 flags (presumably from that service's pool of flags, but again we can't tell) which were redistributed among the teams whose same service was deemed to be up in the same scoring period.

When failing an availability check, you lose 19 or all your flags, whichever is less. You can still receive flags lost by other teams.

2. Description of the database

The final CTF score [Fig 2] appears to be derived directly from [the flags] table.

Yes:

SELECT t.name, count(f.id) as score
FROM teams as t
left JOIN flags AS f
ON f.team_id = t.id
WHERE t.certname != 'legitbs'
GROUP BY t.name
ORDER BY
score desc,
t.name asc

ASSUMPTION: Lacking any other guidance, we assume for the purposes of this analysis that any availability status other than zero represents a failed availability check.

Correct. Status checks were separate programs (mostly Python and Perl scripts) executed by the main scoring script. A zero exit status indicates success, non-zero indicates failure. For most of the game, we logged stdout and stderr too.

(the round_id field in the redemptions table is unused)

 Whoops!

It is important to understand that the captures table contains the only record of flags transfers within the game and that this record is further limited to flags transfers that resulted from token redemptions. There is no data in the database that allows us to audit the transfer of flags that resulted from failed availability checks.

Yup :(

3. Interesting? data extracts from the database

This section is wonderful!

ASSUMPTION: a redemption with no associated capture record indicates that the team whose token was redeemed had no flags remaining to redistribute.

Yeah. Can’t get water from a stone.

SUBMITTING YOUR OWN TOKENS: According to the published scoring algorithm, this practice should not have increased a team's score, it merely should have stopped their score from decreasing as quickly as it might otherwise have.

This agrees with my interpretation, although the more important part is it stops your opponents’ scores from increasing as quickly as they might otherwise have too.

SOME THINGS WE CAN SAY ABOUT AVAILABILITY:  From rounds 1-99, LegitBS was also checked for availability and they failed 16 of these checks [Fig 8]. We have no idea whether there is any significance to this or not.

We maintained our own instance to run availability checks against, unreachable by teams, as a way to make sure our scoring scripts were working correctly. If we failed an availability check, it was because of bugginess on our end, and teams would be forgiven a failed SLA check that round.

Our availability checks only show up for the first day because we did some reëngineering of how they work Friday night. Running 84 or more checks sequentially is very time-consuming, so starting Saturday morning, we ran twenty at a time. In the refactoring to get this feature in, while we kept the result of our check in memory, we never persisted it to the database.

4. Replaying the game attempting to validate the published results

This is good, thanks for doing it!

However seven teams finished with zero flags and others may have hovered around zero at times. Any team that possessed fewer that 19 flags at the time of an availability failure lost fewer than 19, perhaps even no flags so the formula can't be used in all case and we must rely on a simulation of the game to compute the correct deductions.

Correct. In our internal discussions, we considered any team with fewer than about 50 flags to be "circling the drain." They might fluctuate up and down based on the incidental (not planned or sorted, but not really random) order that services were scored and redemptions were processed. We're still looking at what to do with this in 2014.

One caveat though: we adjusted flags after the game due to a network error on our part:

The misconfigured network caused teams to be incorrectly throttled in their connections to the REST API that redeemed tokens for flag captures. This meant that some teams weren't able to redeem captured tokens due to the busy and hostile network environment. Since this was discovered on Sunday morning, after a long night of discovering new vulnerabilities, it was especially painful.

We have reprocessed those expired tokens based on logs and scorebot data, since they disproportionately and unfairly affected individual teams unevenly. They are included in the final results.

 5. Conclusions

Thanks for analyzing the data, and pointing out some of the shortcomings that make it difficult to audit, or make the game not match its documentation. 

If you want to run your own analyses, check out our 2013 releases at https://blog.legitbs.net/p/2013-releases.html.