Buildbot <-> Taskcluster Bridge

Why?

  • Allows for a graceful transition from Buildbot to Taskcluster
  • Lets us reimplement Schedulers as Task Graphs
  • Lets us port individual Builds to Taskcluster Tasks if/when it makes sense to
  • Build Promotion requires intermixed Task Graphs, and one of the driving forces behind the Bridge

High Level Architecture

  • The Bridge acts as both a Taskcluster Worker and a Provisioner, but delegates everything to Buildbot
  • Three services: Taskcluster Listener, Buildbot Listener, Reflector
  • Small database specifically for the Bridge that associates BuildRequests with Task IDs, and Run IDs
  • All multihomed
  • Dependent on Pulse, SchedulerDB, Self-serve

Taskcluster Listener

  • Reacts to events on Taskcluster Pulse exchanges
  • Creates BuildRequests in response to Tasks becoming pending
  • Cancels Builds and BuildRequests when Tasks are cancelled

Buildbot Listener

  • Reacts to events on Buildbot Pulse exchanges
  • Claims Tasks when Builds start
  • Attaches Buildbot Properties to Tasks as artifacts
  • Resolves Tasks when Builds complete
Build Result Taskcluster Resolution
SUCCESS Completed
WARNINGS Failed
FAILURE Failed
EXCEPTION Exception (reason: malformed-payload)
RETRY Exception (reason: malformed-payload)
CANCELLED Exception (reason: canceled)

Reflector

  • Runs on a timer
  • Periodically reclaims Tasks
  • Cancels Tasks when a BuildRequest is cancelled

Some scenarios

Simple, successful build

  • Task is created
  • TCListener receives task-pending event, creates BuildRequest
  • Buildbot creates a Build
  • BuildbotListener receives build started event, claims the Task
  • Reflector reclaims the Task while the Build is running
  • Build completes successfully
  • BuildbotListener receives log uploaded event, reports success to Taskcluster

Build fails initially, succeeds upon retry

  • Task is created
  • TCListener receives task-pending event, creates BuildRequest
  • Buildbot creates a Build
  • BuildbotListener receives build started event, claims the Task
  • Build fails, marked as RETRY
  • BuildbotListener receives log uploaded event, reports exception to Taskcluster and calls rerunTask
  • Buildbot has already started a new Build
  • TCListener receives task-pending event, updates runId, does _not_ create a new BuildRequest
  • Build completes successfully
  • BuildbotListener receives log uploaded event, reports success to Taskcluster

Task exceeds deadline before Build starts

  • Task is created
  • TCListener receives task-pending event, creates BuildRequest
  • Nothing happens
  • Task goes past deadline, Taskcluster cancels it
  • TCListener receives task-exception event, cancels BuildRequest through Self Serve

Roll-out plan

  1. Deploy in production, restricted to the Alder branch (done)
  2. Replace some Alder Schedulers with Task Graphs (in progress)
  3. Furious testing/verification on Alder
  4. ???

TODO

  • Speed up some queries (the Buildbot Listener sometimes falls behind)
  • Don't reclaim Tasks so often
  • Support Task creation for Builds/BuildRequests started in Buildbot ("reverse bridge")
  • Handle Pulse message acking better