Two financial services companies, Vanguard and Morgan Stanley, use a similar approach to dev transition, based on a model of shared responsibility for developer teams, supported by deployment and application security specialists.
Asset manager Vanguard and global bank Morgan Stanley are trying to carefully balance their software development and operational functions as part of a large-scale move to the cloud. Vanguard went through what it called an iterative transformation, moving from managing 2,000 of its own servers in 2015 to primarily working on Amazon Web Services (AWS). As a result, its 7,000 developers have moved from updating monolithic applications on a quarterly cycle to a collection of microservices developed and maintained by separate teams. They are now supported by a centralized platform team that provides standardized CI/CD pipelines and infrastructure for their code repository, with Site Reliability Engineering (SRE) oversight both centralized and integrated into these teams.
Morgan Stanley began its agile and transformation in 2018, and is more aligned with Azure. The initiative began with a three -year training effort to establish innovative DevOps and SRE skills within the bank’s 15,000 strong technical team. This program is built around what Gus Paul, executive director of application infrastructure at Morgan Stanley, identified as three main areas: “Accelerating software development and delivery; increase the predictability, frequency and quality of changes; and change the way technology is used, “he said in a presentation at the Devops Enterprise Summit.
A successful change
Right now, Morgan Stanley has “agile teams with product owners, engineers with development and operations expertise, and they can target the infrastructure on the spot or in the cloud,” he said. Trevor. Brosnan, head of Devops architecture and enterprise technology at Morgan Stanley. “My philosophy is that everyone has specializations; we all have super power in technology ”.
Changing proper build and run behavior will always be a challenge for organizations as large, complex, and conservative as Vanguard and Morgan Stanley. This hasn’t stopped them from finding a way to give developers the ability to go faster, while maintaining the level of control that these companies expect to manage billions, if not trillions of dollars. These are businesses and cultures that do not tolerate risk or downtime.
Flexibility and risk management
Christina Yakomin is a Site Reliability Engineer at Vanguard, where she is part of a team that supports business-aligned developer teams. His team defines and implements certain deployment controls by operating what they call “shared service platforms,” such as standardized CI/CD pipelines and cloud infrastructure platforms. This allows the risk-averse financial services company to ensure that certain controls are applied at the deployment stage, while reducing repetitive tasks between different development teams. , “so that each team doesn’t have to reinvent the wheel,” he explains.
Taking inspiration from streaming giant Spotify’s golden path playbook, Christina Yakomin was clearly influenced by the cloud-native concept of providing golden paths to developers. “We’ve found that because of the complexity of the controls needed to develop apps in this industry, we’re working hard to pave the common path with gold, while making sure it’s open to deviation”, he says. However, due to the strict level of control required, Yakomin said most developers tend to stay on the set path. If teams manage to deviate from another technology or technique, they will immediately be responsible for it.
Morgan Stanley is modernizing its approach
Despite the similar structure, Morgan Stanley uses a different risk management approach when deploying into production. Previously, a developer had to switch between three separate Jira instances, file a replacement ticket, and go through 81 steps to get approval for one line of code. Now the bank has started adopting modern infrastructure as a code and CI/CD practices to streamline this process among its various developer teams, with a core team tasked with overseeing and encouraging others. to follow.
In addition, the bank has developed an automated risk calculator, which assesses each change and assigns it a risk score. Changes below a particular threshold can be deployed using an automated pipeline; those above this threshold are subject to a more manual approval process.
The SRE security blanket
The implementation of the SRE security framework, both at the central operations level and within developer teams, has given Vanguard and Morgan Stanley confidence to get the right balance between developer speed and operational stability. . However, this feature opens up the possibility of separating concerns and again creating a disconnect between development and operations. “It’s a nuanced problem to solve,” Christina Yakomin said. “The introduction of SRE makes people feel that we are once again excluding ops from this role.”
Similarly, at Morgan Stanley, the establishment of SRE principles is “sometimes misunderstood as a rebranding of the ops team,” Trevor Brosnan said. Rather than separate development and operations, Christina Yakomin wants to encourage Vanguard developers and operations specialists to share responsibility for security and ensure that teams sharing platforms take full responsibility. .
Centralization benefits engineers
Robbie Daitzman, head of intermediary technology platforms at Vanguard, said they overcame this problem by “creating a rallying cry to center certain platforms.” Centralization benefits engineers “by balancing the cognitive load and implementing a shared responsibility model,” he added.
Similarly, at Morgan Stanley, Trevor Brosnan considers that “SRE is about both development and operations, as well as the entire development cycle”. For example, the basic skill of SRE, which is to eliminate repetitive work, is usually felt by operation specialists, but developers are well positioned to automate these tedious tasks. Reliability, which is a major concern of SRE, is also the responsibility of developers, who are responsible for designing their applications “to be robust from the start,” Brosnan said.
Build resilient and monitored systems
Vanguard’s central SRE team is also responsible for ensuring that its various systems are resilient and monitored. Christina Yakomin and Robbie Daitzman both previously worked on Vanguard’s chaos engineering team. Chaos mode testing is already very important to validate the stability of enterprise systems. Vanguard has also moved away from providing visibility to its core systems through Amazon CloudWatch usage alerts, Honeycomb’s native cloud monitoring, and the open-source OpenTelemetry standard for collecting metrics, logs , and traces. .
“Observing SRE is a great thing for engineers to help understand if we’re in good shape or negatively affecting customers,” Robbie Daitzman said. In addition to these shared tracking metrics, Vanguard has developed a set of dashboards itself, which each developer team can modify to suit their needs. However, this has not prevented teams from asking for the latest and greatest observability platform to layer on top of this infrastructure. “Every team wants different things and if we had all of that it would be very expensive,” Ms Yakomin said.
Finding the right balance
Despite all these developments, Ms. Yakomin that his team at Vanguard is still trying to find the right balance between efficiency and flexibility for its developers. His plan is to make sure everyone gets the training they need to move to the next model of shared responsibility, while being able to work on delivery, with accurate and flawless post-incident assessments. Finally, he wants the developer team to more easily experiment safely and deviate from the golden path when deemed useful.
For Trevor Brosnan, of Morgan Stanley, “we’re not really done”. He promises to continue to “focus on maintaining the momentum of the teams, to help make it a permanent part of the culture.”