Wednesday, September 10, 2014

Making decisions with limited data

It is challenging but possible to make decisions with limited data. For example, take the rollout saga of public key pinning.

The first implementation of public key pinning included enforcing pinning on In retrospect, this was a bad decision because it broke the Addons Panel and generated pinning warnings 86% of the time. As it turns out, the pinset was missing some Verisign certificates used by, and the pinning enforcement on included subdomains. Having more data lets us avoid bad decisions.

To enable safer rollouts, we implemented a test mode for pinning. In test mode, pinning violations are counted but not enforced. With sufficient telemetry, it is possible to measure how badly sites would break without actually breaking the site.

Due to privacy restrictions in telemetry, we do not collect per-organization pinning violations except for Mozilla sites that are operationally critical to Firefox. This means that it is not possible to distinguish pinning violations for Google domains from Twitter domains, for example. I do not believe that collecting the aggregated number of pinning violations for sites on the Alexa top 10 list constitutes a privacy violation, but I look forward to the day when technologies such as RAPPOR make it easier to collect actionable data in a privacy-preserving way.

Fortunately for us, Chrome has already implemented pinning on many high-traffic sites. This is fantastic news, because it means we can import Chrome’s pin list in test mode with relatively high assurance that the pin list won’t break Firefox, since it is already in production in Chrome.

Given sufficient test mode telemetry, we can decide whether to enforce pins instead of just counting violations. If the pinning violation rate is sufficiently low, it is probably safe to promote the pinned domain from test mode to production mode.

Because the current implementation of pinning in Firefox relies on built-in static pinsets and we are unable to count violations per-pinset, it is important to track changes to the pinset file in the dashboard. Fortunately HighStock supports event markers which somewhat alleviates this problem, and David Keeler also contributed some tooltip code to roughly associate dates with Mercurial revisions. Armed with the timeseries of pinning violation rates, event markers for dates that we promoted organizations to production mode (or high-traffic organizations like Dropbox were added in test mode due to a new import from Chromium) we can see whether pinning is working or not.

Telemetry is useful for forensics, but in our case, it is not useful for catching problems as they occur. This limitation is due to several difficulties, which I hope will be overcome by more generalized, comprehensive SSL error-reporting and HPKP:
  • Because pinsets are static and built-in, there is sometimes a 24-hour lag between making a change to a pinset and reaching the next Nightly build.
  • Telemetry information is only sent back once per day, so we are looking at a 2-day delay between making a change and receiving any data back at all.
  • Telemetry dashboards (as accessible from telemetry.js and need about a day to aggregate, which adds another day.
  • Update uptake rates are slow. The median time to update Nightly is around 3 days, getting to 80% takes 10 days or longer.
Due to these latency issues, pinning violation rates take at least a week to stabilize. Thankfully, telemetry is on by default in all pre-release channels as of Firefox 31, which gives us a lot more confidence that the pinning violation rates are representative.

Despite all the caveats and limitations, using these simple tools we were able to successfully roll out pinning pretty much all sites that we’ve attempted (including AMO, our unlucky canary) as of Firefox 34 and look forward to expanding coverage.

Thanks for reading, and don’t forget to update your Nightly if you love Mozilla! :)