How Atlassian moved Jira and Confluence users to Amazon Web Services, and what it learned along the way

A sample Jira screen. (Atlassian Image)

If your business is built around servicing software developers who know exactly what state-of-the-art tools should be capable of doing, at a certain point it’s time to bite the bullet and modernize your infrastructure.

Atlassian just completed a two-year-long migration to Amazon Web Services after hitting scaling issues with its old hosted approach, created and developed before the public cloud was a viable option for larger companies. Users of Atlassian’s Jira bug-management tool and Confluence, its collaboration software product, used to have their applications run on their own dedicated virtual machine on a server in Atlassian’s data centers, but around 2014 that system started to break down, said Mike Tria, head of infrastructure for the Sydney, Australia-based company, in a recent interview.

About 70 percent of Atlassian’s customers were running its software on Atlassian-hosted infrastructure (the rest ran it on their own servers), and as those numbers grew, Atlassian’s infrastructure began to strain under the weight of thousands of servers and tens of thousands of virtual machines, Tria said. Atlassian’s original hosted product was set up as a single-tenant service, which meant that each customer got a dedicated server for their instance of the software.

Mike Tria, head of infrastructure, Atlassian (Atlassian Photo)

That was standard practice back in 2010 when Atlassian first set up this system, but growing pains and the benefits of multitenant architectures have steadily changed the thinking around how to provision applications across big distributed systems. Public clouds are multitenant, which means that different customers can share the same servers in the name of efficiency.

So around the time Atlassian decided it needed to embrace the benefits of the public cloud in 2013 and 2014 (“we had to replace disks all the time,” Tria said) it also decided to rewrite Jira and Confluence in cloud-native fashion to take advantage of multitenancy and microservices, rather than simply “lifting and shifting” that code into AWS.

This required the company to develop several tools along the way in order to make sure customer data would not mix on a multitenant cloud, which is the base fear of any CIO thinking about a move to cloud computing. Atlassian hopes to release some of those tools as open-source projects in the coming months.

“(The migration) is definitely the largest engineering project that we’ve ever done,” Tria said.

Atlassian evaluated other cloud providers, including Microsoft Azure and Google Cloud Platform, but when it was first planning the project in 2013 felt AWS offered the most proven platform. The company was also drawn by the breadth of services offered by AWS; of the nearly 100 separate services offered by AWS, Atlassian is using all of them but three, Tria said.

The company completed the migration of its cloud customers from its own infrastructure to AWS in December, and in most cases the end user of that software had no idea, he said. That’s not to say, however, that everything went smoothly.

Under the old system, customers who wanted to search for something specific across all of their bug filing and tracking systems had to re-index all that data with every query, which took a lot of time. So Atlassian decided that it would move from a search-engine style interface for those queries to the Postgres database, which would be much faster.

However, there was a catch: Postgres queries returned different answers than the old system, which threw the team into a frenzy trying to figure out how to replicate the old results under the new system. “We probably had 30 or 40 developers banging away on keyboards just trying to get it done,” Tria said.

As it turned out, however, the Postgres queries actually produced better results than the old system. Still, it took quite a bit of time to realize that, and in not wanting “to replace their reindexing pain with other pain, it took longer than we had thought,” he said.

Atlassian was also forced to discard years’ worth of tricks and tactics for squeezing performance out of a single-tenant architecture with the move to a multitenant architecture, he said. Luckily, some of other products in the Atlassian family, such as Trello and Bitbucket, were built for the cloud era and were able to share some of their knowledge with the Jira and Confluence teams, Tria said.

And just last week, the company got a rude lesson in the benefits of redundancy — an issue it thought it had tackled with this move — last week thanks to what Tria called a “black-swan event” that he said took out all the availability zones in the U.S. East region run by AWS. Atlassian thought it had planned for such an event by using multiple availability zones for its networking connections to AWS, but it was one of the more prominent companies affected by last week’s weather-related outage, which also took out a fair amount of Capital One’s services as well as Amazon’s own Alexa service.

However, that incident is still an advertisement for the public cloud, Tria said, because it would have taken Atlassian far longer to recover from such an incident running a single-tenant infrastructure managed by its own people.

via GeekWire
How Atlassian moved Jira and Confluence users to Amazon Web Services, and what it learned along the way