Rings of Power - Multiple Signoff in Balrog
I want to tell you about an important new Balrog feature that we're working on. But I also want to tell you about how we planned it, because I think that part is even more interesting that the project itself.
The Project
Balrog is Mozilla’s update server. It is responsible for deciding which updates to deliver for a given update request. Because updates deliver arbitrary code to users this means that a bad data in update server could result in orphaning users, or be used as an attack vector to infect users with malware. It is crucial that we make it more difficult for a single user’s account to make changes that affect large populations of users. Not only does this provide some footgun protection, but it safeguards our users from attacks if an account is compromise or an employee goes rogue.
While the current version of Balrog has a notion of permissions, most people effectively have carte-blanche access to one or more products. This means that an under-caffeinated Release Engineer could ship the wrong thing, or a single compromised account can begin an attack. Requiring multiple different accounts to sign off on any sensitive changes will protect us against both of these scenarios.
Multiple sign offs may also be used to enhance Balrog’s ability to support workflows that are more reflective of reality. For example, the Release Management team are the final gatekeepers for most products (ie: we can’t ship without their sign off), but they are usually not the people in the best place to propose changes to Rules. A multiple sign off system that supports different types of roles would allow some people to propose changes and others to sign off on them.
The Planning Process
Earlier this year I blogged about Streamlining the throttled rollout of Firefox releases, which was the largest Balrog projects to-date at the time. While we did some up-front planning for it, it took significantly longer to implement than I'd originally hoped. This isn't uncommon for software projects, but I was still very disappointed with the slow pace. One of the biggest reasons for this was discovering new edge cases or implementation difficulties after we were deep into coding. Often this would result in needing to rework code that was thought to be finished already, or require new non-trivial enhancements to be made. For Multiple Signoff, I wanted to do better. Instead of a few hours of brainstorming, we've taken a more formal approach with it, and I'd like to share both the process, and the plan we've come up with.
Setting Requirements
I really enjoy writing code. I find it intellectually challenging and fun. This quality is usually very useful, but I think it can be a hinderance when in the early stages of large projects, as I tend to jump straight to thinking about implementation before even knowing the full set of requirements. Recognizing this, I made a concious effort to purge implementation-related thoughts until we had a full set of requirements for Multiple Signoff reviewed by all stakeholders. Forcing myself not to go down the (more fun) path of thinking about code made me spend more time thinking about what we want and why we want it. All of this, particularly the early involvement of stakeholders, uncovered many hidden requirements and things that not everyone agreed. I believe that identifying them at such an early stage made them much easier to resolve, largely because there was no sunk-cost to consider.
Planning the Implementation
Once our full set of requirements were written, I was amazed at how obvious much of the implementation was. New objects and pieces of data stood out like neon signs, and I simply plucked them out of the requirements section. Most of the interactions between them came very naturally as well. I wrote some use cases that acted almost as unit tests for the implementation proposal, and identified a lot of edge cases and bugs in the first pass of the implementation proposal. In retrospect, I probably should've written the use cases at the same time as the requirments. Between all of that and another round of review from stakeholders, I have significantly more confidence that the proposed implementation will look like the actual implementation than I have with any other projects of similar size.
Bugs and Dependencies
Just like the implementation flowed easily from the requirements, the bugs and dependencies between them were easy to find by rereading the implementation proposal. In the end, I identified 18 distinct pieces of work, and filed them all as separate bugs. Because the dependencies were easy to identify, I was able to convince Bugzilla to give me a decent graph that helps identify the critical path, and which bugs are ready to be worked on.
But will it even help?
Overall, we probably spend a couple of people-weeks of active time on the requirements and implementation proposal. This isn't an overwhelming amount of time upfront, but it's enough that it's important to know if it's worthwhile next time. This is a question that can only be answered in retrospect. If the work goes faster and the implementation has less churn, I think it's safe to say that it was time well spent. Those are both things that are relatively easy to measure, so I hope to be able to measure this fairly objectively in the end.
The Plan
If you're interested in reading the full set of requirements, implementation plan, and use cases, I've published them here. A HUGE thanks goes out to Ritu, Nick, Varun, Hal, Johan, Justin, Julien, Mark, and everyone else who contributed to it.