Managing numerous customers with unique characteristics and tens of thousands of systems at scale can be challenging. Here, I want to pull back the curtain on some of the automation and tools that 2nd Watch develops to solve these problems. Below I will outline our approach to this problem and its 3 main components: Collect, Model, and React.
Collect: The first problem facing us is an overwhelming flood of data. We have CloudWatch metrics, CloudTrail events, custom monitoring information, service requests, incidents, tags, users, accounts, subscriptions, alerts, etc. The data is all structured differently, tells us different stories, and is collected at an unrelenting pace. We need to identify all the sources, collect the data, and store it in a central place so we can begin to consume it and make correlations between various events.
Most of the data I described above can be gathered from the AWS & Azure APIs directly, while others may need to be ingested with an agent or by custom scripts. We also need to make sure we have a consistent core set of data being brought in for each of our customers, while also expanding that to include some specialized data that perhaps only certain customers may have. All the data is gathered and sent to our Splunk indexers. We build an index for every customer to ensure that data stays segregated and secure.
Model: Next we need to present the data in a useful way. The modeling of the data can vary depending on who is using it or how it is going to be consumed. A dashboard with a quick look at several important metrics can be very useful to an engineer to see the big picture. Seeing this data daily or throughout the day will make anomalies very apparent. This is especially helpful because gathering and organizing the data at scale can be time consuming, and thus could reasonably only be done during periodic audits.
Modeling the data in Splunk allows for a low overhead view with up-to-date data so the engineer can do more important things. A great example of this is provisioned resources by region. If the engineer looks at the data on a regular basis, they would quickly notice that the number of provisioned resources has drastically changed. A 20% increase in the number of EC2 resources could mean several things; Perhaps the customer is doing a large deployment, or maybe Justin accidently put his AWS access key and secret key on GitHub (again).
We provide our customers with regular reports and reviews of their cloud environments. We also use the data collected and modeled in this tool for providing that data. Historical data trended over a month, quarter, and year can help you ask questions or tell a story. It can help you forecast your business, or the number of engineers needed to support it. We recently used the historical tending data to show progress of a large project that included waste removal and a resource tagging overhaul for a customer. Not only were we able to show progress throughout the project,t but we used that same view to ensure that waste did not creep back up and that the new tagging standards were being applied going forward.
React: Finally, it’s time to act on the data we collected and modeled. Using Splunk alerts we can provide conditional logic to the data patterns and act upon them. From Splunk we can call our ticketing system’s API and create a new incident for an engineer to investigate concerning trends or to notify the customer of a potential security risk. We can also call our own APIs that trigger remediation workflows. A few common scenarios are encrypting unencrypted S3 buckets, deleting old snapshots, restarting failed backup jobs, requesting cloud provider limit increases, etc.
Because we have several independent data sources providing information, we can also correlate events and have more advanced conditional logic. If we see that a server is failing status checks, we can also look to see if it recently changed instance families or if it has all the appropriate drivers. This data can be included in the incident and available for the engineer to review without having to check it themselves.
The entire premise of this idea and the solution it outlines is about efficiency and using data and automation to make quicker and smarter decisions. Operating and maintaining systems at scale brings forth numerous challenges and if you are unable to efficiently accommodate the vast amount of information coming at you, you will spend a lot of energy just trying to keep your head above water.
For help getting started in automating your systems, contact us.
-Kenneth Weinreich, Managed Cloud Operations
For most students, one of the most stressful experiences of their educational career are exam days. Exams are a semi-public declaration of your ability to learn, absorb, and regurgitate the curriculum, and while the rewards for passing are rather mundane, the ramifications of failure are tremendous. My anecdotal educational experience indicates that exam success is primarily due to preparation, with a fair bit of luck thrown in. If you were like me in school, my exam preparation plan consisted mostly of cramming, with a heavy reliance on luck that the hours spent jamming material into my brain would cover at least 70% of the exam contents.
After I left my education career behind me and started down a new path in business technology, I was rather dismayed to find out that the anxiety of testing and exams continued, but in the form of audits! So much for the “we will never use this stuff in real life” refrain that we students expressed Calculus 3 class – exams and tests continue even when you’re all grown up. Oddly enough, the recipe for audit success was remarkably similar: a heavy dose of preparation with a fair bit of luck thrown in. Additionally, it seemed that many businesses also adhered to my cram-for-the-exam pattern. Despite full knowledge and disclosure of the due dates and subject material, audit preparation largely consisted of ignoring it until the last minute, followed by a flurry of activity, stress, anxiety, and panic, with a fair bit of hoping and wishing-upon-a-star that the auditors won’t dig too deeply. There must be a better way to be prepared and execute (hint: there is)!
There are some key differences between school exams and business audits:
- Audits are open-book: the subject matter details and success criteria are well-defined and well-known to everyone
- Audits have subject matter and success criteria that remains largely unchanged from one audit to the next
Given these differences, it would seem logical that preparation for audits should be easy. We know exactly what the audit will cover, we know when it will happen, and we know what is required to pass. If only it was that easy. Why, then, do we still cram-for-the-exam and wait to the last minute? I think it comes down to these things:
- Audits are important, just like everything else
- The scope of the material seems too large
- Our business memory is short
Let’s look at that last one first. Audits tend to be infrequent, often with months or years going by before they come around again. Like exam cramming, it seems that our main goal is to get over the finish line. Once we are over that finish line, we tend to forget all about what we learned and did, and our focus turns to other things. Additionally, the last-minute cram seems to be the only way to deal with the task at hand, given the first two points above. Just get it done, and hope.
What if our annual audits were more frequent, like once a week? The method of cramming is not sustainable or realistic. How could we possibly achieve this?
Iteration is, by definition, a repetitive process that intends to produce a series of outcomes. Both simple and complex problems can often be attacked and solved by iteration:
- Painting a dark-colored room in a lighter color
- Digging a hole with a shovel
- Building a suspension bridge
- Attempting to crack an encrypted string
- Achieving a defined compliance level in complex IT systems
Note that last one: achieving audit compliance within your IT ecosystem can be an iterative process, and it doesn’t have to be compressed into the 5 days before the audit is due.
The iteration (repetitive process) is simple:
The scope and execution of the iteration is where things tend to break down. The key to successful iterations starts with defining and setting realistic goals. When in doubt, keep the goals small! The idea here is being able to achieve the goal repeatedly and quickly, with the ability to refine the process to improve the results.
We need to clearly define what we are trying to achieve. Start big-picture and then drill down into something much smaller and achievable. This will accomplish two things: 1) build some confidence that we can do this, and 2) using what we will do here, we can “drill up” and tackle a similar problem using the same pattern. Here is a basic example of starting big-picture and drilling down to an achievable goal:
Identify and Recognize
Given that we are going to monitor failed user logons, we need a way to do this. There are manual ways to achieve this, but, given that we will be doing this over and over, it’s obvious that this needs to be automated. Here is where tooling comes into play. Spend some time identifying tools that can help with log aggregation and management, and then find a way to automate the monitoring of failed network user authentication logs.
Notify and Remediate
Now that we have an automated way to aggregate and manage failed network user authentication logs, we need to look at our (small and manageable) defined goal and perform the necessary notifications and remediations to meet the requirement. Again, this will need to be repeated over and over, so spend some time identifying automated tools that can help with this process.
Analyze and Report
Now that we are meeting the notification and remediation requirements in a repeatable and automated fashion, we need to analyze and report on the effectiveness of our remedy and, based on the analysis, make necessary improvements to the process, and then repeat!
Now that we have one iterative and automated process in place that meets and remedies an audit requirement, there is one less thing that needs to be addressed and handled when the audit comes around. We know that this one requirement is satisfied, and we have the process, analysis, and reports to prove it. No more cramming for this particular compliance requirement, we are now handling it continuously.
Now, what about the other 1,000 audit requirements? As the saying goes, “How do you eat an elephant (or a Buick)? One bite at a time.” You need the courage to start, and from there every bite gets you one step closer to the goal.
Keys to achieving Continuous Compliance include:
- You must start somewhere. Pick something!
- Start big-picture, then drill down to something small and achievable.
- Automation is a must!
For help getting started on the road to continuous compliance, contact us.
-Jonathan Eropkin, Cloud Consultant
Many people are looking to take advantage of containers to isolate their workloads on a single system. Unlike traditional hypervisor-based virtualization, which utilizes the same operating system and packages, Containers allow you to segment off multiple applications with their own set of processes on the same instance.
Let’s walk through some grievances that many of us have faced at one time or another in our IT organizations:
Say, for example, your development team is setting up a web application. They want to set up a traditional 3 tier system with an app, database, and web servers. They notice there is a lot of support in the open source community for their app when it is run on Ubuntu Trusty (Ubuntu 14.04 LTS) and later. They’ve developed the app in their local sandbox with an Ubuntu image they downloaded, however, their company is a RedHat shop.
Now, depending on the type of environment you’re in, chances are you’ll have to wait for the admins to provision an environment for you. This often entails (but is not limited to) spinning up an instance, reviewing the most stable version of the OS, creating a new hardened AMI, adding it to Packer, figuring out which configs to manage, and refactoring provisioning scripts to utilize aptitude and Ubuntu’s directory structure (e.g Debian has over 50K packages to choose from and manage). In addition to that, the most stable version of Ubuntu is missing some newer packages that you’ve tested in your sandbox that need to be pulled from source or another repository. At this point, the developers are procuring configuration runbooks to support the app while the admin gets up to speed with the OS (not significant but time-consuming nonetheless).
You can see my point here. A significant amount of overhead has been introduced, and it’s stagnating development. And think about the poor sysadmins. They have other environments that they need to secure, networking spaces to manage, operations to improve, and existing production stacks they have to monitor and support while getting bogged down supporting this app that is still in the very early stages of development. This could mean that mission-critical apps are potentially losing visibility and application modernization is stagnating. Nobody wins in this scenario.
Now let us revisit the same scenario with containers:
I was able to run my Jenkins build server and an NGINX web proxy, both running on a hardened CentOS7 AMI provided by the Systems Engineers with docker installed. From there I executed a docker pull command pointed at our local repository and deployed two docker images with Debian as the underlying OS.
$ docker pull my.docker-repo.com:4443/jenkins
$ docker pull my.docker-repo.com:4443/nginx
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
$ docker ps
7478020aef37 my.docker-repo.com:4443/jenkins/jenkins:lts “/sbin/tini — /us …” 16 minutes ago Up 16 minutes ago 8080/tcp, 0.0.0.0:80->80/tcp, 50000/tcp jenkins
d68e3b96071e my.docker-repo.com:4443/nginx/nginx:lts “nginx -g ‘daemon of…” 16 minutes ago Up 16 minutes 0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp nginx
$ sudo systemctl status jenkins-docker
jenkins-docker.service – Jenkins
Loaded: loaded (/etc/systemd/system/jenkins-docker.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2018-11-08 17:38:06 UTC; 18min ago
Process: 2006 ExecStop=/usr/local/bin/jenkins-docker stop (code=exited, status=0/SUCCESS)
The processes above were executed on the actual instance. Note how I’m able to execute a cat of the OS release file from within the container
sudo docker exec d68e3b96071e cat /etc/os-release
PRETTY_NAME=”Debian GNU/Linux 9 (stretch)”
I was able to do so because Docker containers do not have their own kernel, but rather share the kernel of the underlying host via linux system calls (e.g setuid, stat, umount, ls) like any other application. These system calls (or syscalls for short) are standard across kernels, and Docker supports version 3.10 and higher. In the event older syscalls are deprecated and replaced with new ones, you can update the kernel of the underlying host, which can be done independently of an OS upgrade. As far as containers go, the binaries and aptitude management tools are the same as if you installed Ubuntu on an EC2 instance (or VM).
Q: But I’m running a windows environment. Those OS’s don’t have a kernel.
Yes, developers may want to remove cost overhead associated with Windows licenses by exploring running their apps on Linux OS. Others may simply want to modernize their .NET applications by testing out the latest versions on Containers. Docker allows you to run Linux VM’s on Windows 10 and Server 2016. As docker was written to initially execute on Linux distributions, in order to take advantage of multitenant hosting, you will have to run Hyper-V containers, which provision a thin VM on top of your hosts. You can then manage your mixed environment of Windows and Linux hosts via the –isolate option. More information can be found in the Microsoft and Docker documentation.
IT teams need to be able to help drive the business forward. Newer technologies and security patches are procured on a daily basis. Developers need to be able to freely work on modernizing their code and applications. Concurrently, Operations needs to be able to support and enhance the pipelines and platforms that get the code out faster and securely. Leveraging Docker containers in conjunction with these pipelines further helps to ensure these are both occurring in parallel without the unnecessary overhead. This allows teams to work independently in the early stages of the development cycle and yet more collaboratively to get the releases out the door.
For help getting started leveraging your environment to take advantage of containerization, contact us.
-Sabine Blair, Systems Engineer & Cloud Consultant
Big Data and Machine Learning Services Lead the Way
If you’ve been reading this blog, or otherwise following the enterprise tech market, you know that the worldwide cloud services market is strong. According to Gartner, the market is projected to grow by 17% in 2019, to over $206 billion.
Within that market, enterprise IT departments are embracing cloud infrastructure and related services like never before. They’re attracted to tools and technologies that enable innovation, cost savings, faster-time-to-market for new digital products and services, flexibility and productivity. They want to be able to scale their infrastructure up and down as the situation warrants, and they’re enamored with the idea of “digital transformation.”
In its short history, cloud infrastructure has never been more exciting. At 2nd Watch, we are fortunate to have a front-row seat to the show, with more than 400 enterprise workloads under management and over 200,000 instances in our managed public cloud. With 2018 now in our rearview mirror, we thought this a good time for a quick peek back at the most popular Amazon Web Services (AWS) products of the past year. We aggregated and anonymized our AWS customer data from 2018, and here’s what we found:
The top five AWS products of 2018 were: Amazon Virtual Private Cloud (used by 100% of 2nd Watch customers); AWS Data Transfer (100%); Amazon Simple Storage Service (100%); Amazon DynamoDB (100%) and Amazon Elastic Compute Cloud (100%). Frankly, the top five list isn’t surprising. It is, however, indicative of legacy workloads and architectures being run by the enterprise.
Meanwhile, the fastest-growing AWS products of 2018 were: Amazon Athena (68% CAGR, as measured by dollars spent on this service with 2nd Watch in 2018 v. 2017); Amazon Elastic Container Service for Kubernetes (53%); Amazon MQ (37%); AWS OpsWorks (23%); Amazon EC2 Container Service (21%); Amazon SageMaker (21%); AWS Certificate Manager (20%); and AWS Glue (16%).
The growth in data services like Athena and Glue, correlated with Sagemaker, is interesting. Typically, the hype isn’t supported by the data, but clearly, customers are moving forward with big data and machine learning strategies. These three services were also the fastest growing services in Q4 2018.
Looking ahead, I expect EKS to be huge this year, along with Sagemaker and serverless. Based on job postings and demand in the market, Kubernetes is the most requested skill set in the enterprise. For a look at the other AWS products and services that rounded out our list for 2018, download our infographic.
– Chris Garvey, EVP Product
While AWS re:Invent 2018 is still fresh in our minds, let’s take a look at some of the most significant and exciting AWS announcements made. Here are our top five takeaways from AWS re:Invent 2018.
Number 5: AWS DeepRacer
To be honest, when I first saw DeepRacer I wasn’t paying full attention to the keynote. After previous years’ announcements of Amazon Snowball and Snowmobile, I thought this might be the next version of how AWS is going to be moving data around. Instead we have an awesome little car that will give people exposure to programming and machine learning in a fun and interesting way. I know people at 2nd Watch are hoping to form a team so that we can compete at the AWS races. Anything that can get people to learn more about machine learning is a good thing as so many problems could be solved elegantly with machine learning solutions.
Number 4: Amazon Managed Blockchain and Amazon Quantum Ledger Database
Amazon has finally plunged directly into the Blockchain world that seems to get so much media attention these days. Built upon the Amazon Quantum Ledger Database (QLDB), Amazon Managed Blockchain will give you the ability to integrate with the Ethereum and Hyperledger Fabric. QLDB will allow you to store information in a way so that transactions can never be lost or modified. For instance, rather than storing security access in a log file or a database you can store transactions in the QLDB. This will make it easy to guarantee integrity of the security access for audit purposes.
Number 3: RDS on VMWare
Having worked with many companies that are concerned about moving into the cloud, RDS on VMWare could be a great first step on their journey to the cloud. Rather than taking the full plunge into the cloud, companies will be able to utilize RDS instances in their existing VMWare environments. Since databases are such a critical piece of infrastructure, much of the initial testing can be done on-premises. You can set up RDS on VMWare in your dev environment alongside your current dev databases and begin testing without ever needing to touch things in AWS. Then, once you’re ready to move the rest of your infrastructure to the cloud, you’ll have one less critical change you’ll have to make.
Number 2: AWS Outposts
EC2 instances in your datacenter – and not just EC2 instances, but pretty much anything that uses EC2 under the hood (RDS, EMR, Sagemaker, etc.) – will be able to run out of your datacenter. The details are a little scant, but it sounds as though AWS is going to send you rack mount servers with some amount of storage built into them. You’ll rack them, power them, plug them into your network and be good to go. From a network perspective, it sounds like these instances will be able to show up as a VPC but also be able to connect directly into your private network. For users that aren’t ready to migrate to the cloud for whatever reason, Outposts could be the perfect way to start extending into AWS.
Number 1: AWS Transit Gateway
AWS Transit Gateway is a game changer for companies with many VPCs, VPNs, and eventually Direct Connect connections. At 2nd Watch we help companies design their cloud infrastructure as simply and elegantly as possible. When it comes to interconnecting VPC’s, the old ways were always very painful and manually intensive. With Transit Gateways you’ll have one place to go to manage all of your VPC interconnectivity. The Transit Gateway will act as a hub and ensure that your data can be routed safely and securely. This will make managing all of your AWS interconnectivity much easier!
-Philip Seibel, Managing Cloud Consultant