09 September 2016
A typical cloud operation undergoes thousands of changes each month. You want to manage these day-to-day tasks in your company's infrastructure without having a human being operate, touch or be the bottleneck for any functions. You don't want any human error that can bring down a system, and you can achieve this with zero touch automation (ZTA).

What is ZTA? It's the ability to automate end-to-end processes associated with operating a private, public or hybrid cloud infrastructure at scale. These processes can be simple, like rebooting a server or restarting an application node, to complex, like elastically growing or shrinking capacity or performing disaster recovery. We have yet to see someone accurately perform the same task a thousand times. Couple that with the amount of multitasking in a typical cloud infrastructure operations team, and you have a recipe for disaster -- read: customer outage!

With ZTA, a business can achieve increased productivity, reduced costs, improved predictability and increased agility that all result in increased service health and availability -- in other words, improved customer satisfaction. Automation lets operators focus on developing creative solutions rather than shepherding mind-numbing processes for the umpteenth time. The staff is challenged also, as Level 1 engineers start doing work that Level 2 engineers would do and that phenomenon keeps repeating itself up the chain.

Implementing ZTA is a long process though. We followed these steps while building the end-to-end ZTA for common operations that power ServiceNow's enterprise-scale, highly-available cloud:

Step 1: Get buy-in from key stakeholders
For any successful ZTA effort, you need senior management and the engineers and developers on the frontline to buy into the new process. Socializing the goals, benefits and approach of ZTA in the operational fabric will prevent having a disproportionate impact on ZTA's long-term success in your organization.

Step 2: Identify and prioritize processes to automate
For the cloud infrastructure to run, there's usually a library of standard operating procedures (SOPs) that outline how to execute a variety of tasks -- you'll decide what to automate from what's in this library. As you go through the decision-making process, also analyze the cost versus ROI and the impact on the operations team and the customer. Consider the number of times each SOP is executed, the critical nature of the SOP and the tolerance for an SOP to go wrong.

This means you have to know how often each task is performed and its complexity. Automating an SOP that's performed 100 times a month and takes two hours each time, for example, may take a few seconds or minutes to execute. If you've a critical SOP with 25 steps, a failure to follow the steps may cause considerable customer dissatisfaction or loss of productivity. Because of the complexity, this task would still qualify for ZTA regardless of how often it's performed.

Step 3: Identify users who execute these processes
Process users know what isn't outlined in the SOP. They also know the failure points or edge cases. This information will help you, the automation author, make any design or implementation decisions.

ZTA solutions have to account for the long-tail problem too. During my user interviews, I learned that one customer had organized their VPN gateways in a highly nontraditional manner. The VPN gateway blocked traffic, which would have resulted in customers not being able to connect to the database. While our automation had accounted for the mainstream scenarios and many edge cases, without this one user interview, we would have had an outage.

Step 4: Identify the environment for your automation 
Designing a quality ZTA requires knowing what's in your environment. That is, a proper inventory of all infrastructure components (e.g., list of servers, power distribution units, network gear), as well as the exact location, role and owner of each device, and who can access or modify it.
Every automation requires a pre-flight check, just like what a pilot does prior to take off.
Share this

The ZTA design needs to account for how the operational environment works and include these nuances in the automation lifecycle. "Change freeze" periods instituted during high demand, like holiday seasons for retailers, for example, and any other type of downtime or reduced functionality could impact a customer's bottom line. This information is collected during interviews with the process users who are executing the SOP and codified as part of the automation.

Step 5:  Identify the automation components
The configuration management database (CMDB) stores all your infrastructure elements and the relationships between those elements. You also want dynamic and reasonable near real-time, up-to-date information about each infrastructure element that your automation is able to leverage. You want to know the location of everything today, not three weeks ago. With a highly dependable CMDB, the DevOps team can work with the network operation center (NOC) to automate specific SOPs and keep the infrastructure information current.

Step 6: Automate, Automate, Automate
You want the system up and running with little-to-no downtime, and you need a DevOps team with software development and infrastructure expertise to do this. 

The automation needs to be designed in such a way that failure, network latency, unreliable networks and a dynamic topology are captured. There are thresholds within the system that alert people when to perform tasks, like when a server disk is close to capacity, for example. These thresholds can also trigger tasks that are automated.

Every automation requires a pre-flight check, just like what a pilot does prior to take off. More importantly, post-flight checks can ensure that the automation did what it was supposed to do -- if it didn't, then roll back, investigate, fix the issue and begin again.

Step 7: Rigorously test your automation
Murphy's law applies especially well in the cloud world. What can you do about it? Test for scale-related stress scenarios, performance, edge cases and race conditions. There are many ways to test a system, like A-B or split testing within a subset of customers or introducing random (synthetic) failures to validate how well the automation system works under failure scenarios -- the more thorough your testing is, the better your system will run.

Step 8: Perform User Acceptance Testing
Once the ZTA has been implemented and tested, schedule a user acceptance session with those who provided the initial requirements for the automation. A good user acceptance testing session can help you update the automations quickly and with limited overhead. Establish a continuous feedback loop to track all changes in a cloud infrastructure so you know when to update the automation.

Step 9: Release to production
Following the right change management procedures is important when you write an SOP, articulate what a successful deployment looks like and rollback procedures, for example. You want those with the right authorization and role to be able to execute automations as well.
Why do companies believe that Zero Touch Automation is hard to do?
Share this

Use a staggered deployment model where the automation is released into one data center (or pod or subset of customers). You need to observe and evaluate the automation based on what you expect success to look like. Once the automation functions for an appropriate amount of soak time without encountering issues, gradually release it to other data centers ... remember to be quick, but don't hurry.

Step 10: Evaluate the results of your automation and iterate
Your job isn't finished once the automation is in production. A good DevOps person will continue to monitor the automation for failures. They will also remain in constant touch with the operations team so that the assumptions used for the automation continue to hold true.

Each iteration increases reliability since you'll address failures, performance issues and add features to the cloud infrastructure. In one automation, we closely monitored the data integrity, but when we executed the automation in production, we couldn't validate the integrity of the data during one particular outage and had to revert to the backup database. We enhanced the audits and were able to discover this behavior only because we paid attention to how the automation functioned in the production environment. We don't fire and forget, but continue to iterate.

ZTA doesn't mean there's no human touch. You may need approval from a change approval board (CAB) or supervisor prior to executing certain automated operations.

Predicting what can go wrong
To anticipate when something is going to break, you need to capture the desired state of the system. You want to know your best-case scenario, and you can do this once you know what's in your system. You can also include thresholds and triggers that address when you need to take action to bring your system back to the desired state.

Lets say a database that's supposed to be read_only was accidentally changed to read_write. In one type of high-availability configuration, a primary database server copies every transaction on to a standby database server that's set to read_only state in case the primary goes bad. By having this standby become read_write, the configuration is now wrong, which could result in split-brain scenarios where the primary and standby databases become inconsistent, causing a potential data loss. You want to make sure they have the correct state and the data matches. When these don't match, that's when you can proactively discover such issues and fix them.

One of the first steps is to run your preflight checks against your actual systems to ensure they're in good shape and to back off if you see a gap. This avoids your automation from trying to modify a system that may be in a bad state, which would likely cause a failure.

So, why do companies believe that ZTA is hard to do?
Getting management buy-in and strict adherence to the process is difficult. Then there's the perception that this problem is very hard to solve. As Matt Damon said in "The Martian," "You solve one problem ... and you solve the next one ... and then the next. And if you solve enough problems, you get to come home." Or in our case, you get to do ZTA for most of your processes. Once the stakeholders are on board, the rest is a matter of setting up the proper processes and audits, as well as using the right tools/platform to inventory the system and keep it current.

What's the real benefit of ZTA?
Meeting your business's goals is easier when you have better management and control of what's happening in your infrastructure. There are also intangible benefits as well, like knowing who's actually using your infrastructure and letting operators focus their energy more on creative solutions. When a business actively manages its infrastructure, you get happier customers who have a direct impact on the bottom line.
 
Tap to read full article