The annual Amazon Web Services (AWS) re:Invent conference is just around the corner (the show kicks off November 27 in Las Vegas). Rest assured, there will be lots of AWS-related products, partners, and customer news. Not to mention, more than a few parties. Here’s what to expect at AWS re:Invent 2017—and a few more topics we hope to hear about.
1.) Focus on IOT, Machine Learning, and Big Data
IOT, Machine Learning, and Big Data are top of mind with much of the industry—insert your own Mugatu “so hot right now” meme here – and we expect all three to be front and center at this year’s re:Invent conference. These Amazon Web Services are ripe for adoption, as most IT shops lack the capabilities to deploy these types of services on their own. We expect to see advancements in AWS IOT usability and features. We’ve already seen some early enhancements to AWS Greengrass, most notably support for additional programming languages, and would expect additional progress to be displayed at re:Invent. Other products that we expect to see advancement made are with AWS Athena and AWS Glue.
In the Machine Learning space, we were certainly excited about the recent partnership between Amazon Web Services and Microsoft around Gluon, and expect a number of follow-up announcements geared toward making it easier to adopt ML into one’s applications. As for Big Data, we imagine Amazon Web Service to continue sniping at open source tools that can be used to develop compelling services. We also would be eager to see more use of AWS Lambda for in-flight ETL work, and perhaps a long-running Lambda option for batch jobs.
2.) Enterprise Security
To say that data security has been a hot topic these past several months, would be a gross understatement. From ransomware to the Experian breach to the unsecured storage of private keys, data security has certainly been in the news. In our September Enterprise Security Survey, 73% of respondents who are IT professionals don’t fully understand the public cloud shared responsibility model.
Last month, we announced our collaboration with Palo Alto Networks to help enterprises realize the business and technical benefits of securely moving to the public cloud. The 2nd Watch Enterprise Cloud Security Service blends 2nd Watch’s Amazon Web Services expertise and architectural guidance with Palo Alto Networks’ industry-leading VM series of security products. To learn more about security and compliance, join our re:Invent breakout session—Continuous Compliance on AWS at Scale— by registering for ID number SID313 from the AWS re:Invent Session Catalogue. The combination delivers a proven enterprise cloud security offering that is designed to protect customer organizations from cyberattacks, in hybrid or cloud architectures. 2nd Watch is recognized as the first public cloud-native managed security provider to join the Palo Alto Networks, NextWave Channel Partner Program. We are truly excited about this new service and collaboration, and hope you will visit our booth (#1104) or Palo Alto Networks (#2409) to learn more.
As for Amazon Web Services, we fully expect to see a raft of announcements. Consistent with our expectations around ML and Big Data, we expect to hear about enhanced ML-based anomaly detection, logging and log analytics, and the like. We also expect to see advancements to AWS Shield and AWS Organizations, which were both announced at last year’s show. Similarly, we wouldn’t be surprised by announced functionality to their web app firewall, AWS WAF. A few things we know customers would like are easier, less labor-intensive management and even greater integration into SecDevOps workflows. Additionally, customers are looking for better integration with third-party and in-house security technologies – especially application scanning and SIEM solutions – for a more cohesive security monitoring, analysis, and compliance workflow.
The dynamic nature of the cloud creates specific challenges for security. Better security and visibility for ephemeral resources such as containers, and especially for AWS Lambda, are a particular challenge, and we would be extremely surprised not to see some announcements in this area.
Lastly, General Data Protection Regulations (GDPR) will be kicking in soon, and it is critical that companies get on top of this. We expect Amazon Web Service to make several announcements about improved, secure storage and access, especially with respect to data sovereignty. More broadly, we expect that Amazon Web Service will announce improved tools and services around compliance and governance, particularly with respect to mapping deployed or planned infrastructure against the control matrices of various regulatory schemes.
We don’t need to tell you that AWS’ re:Play Party is always an amazing, veritable visual, and auditory playground. Last year, we played classic Street Fighter II while listening to Martin Garrix bring the house down (Coin might have gotten ROFLSTOMPED playing Ken, but it was worth it!). Amazon Web Services always pulls out all the stops, and we expect this year to be the best yet.
2nd Watch will be hosting its annual party for customers at the Rockhouse at the Palazzo. There will be great food, an open bar, an awesome DJ, and of course, a mechanical bull. If you’re not yet on the guest list, request your invitation TODAY! We’d love to connect with you, and it’s a party you will not want to miss.
Bonus: A wish list of things 2nd Watch would like to see released at AWS re:Invent 2017
Blockchain – Considering the growing popularity of blockchain technologies, we wouldn’t be surprised if Amazon Web Service launched a Blockchain as a Service (BaaS) offering, or at least signaled their intent to do so, especially since Azure already has a BaaS offering.
Multi-region Database Option – This is something that would be wildly popular but is incredibly hard to accomplish. Having an active-active database strategy across regions is critical for production workloads that operate nationwide and require high uptime. Azure already offers it with their Cosmos DB (think of it as a synchronized, multi-region DynamoDB), and we doubt Amazon Web Service will let that challenge stand much longer. It is highly likely that Amazon Web Service has this pattern operating internally, and customer demand is how Amazon Web Service services are born.
Nifi – The industry interest in Nifi data-flow orchestration, often analogized to the way parcel services move and track packages, has been accelerating for many reasons, including its applicability to IoT and for its powerful capabilities around provenance. We would love to see AWS DataPipeline re-released as Nifi, but with all the usual Amazon Web Services provider integrations built in.
If even half our expectations for this year’s re:Invent are met, you can easily see why the 2nd Watch team is truly excited about what Amazon Web Services has in store for everyone. We are just as excited about what we have to offer to our customers, and so we hope to see you there!
Schedule a meeting with one of our AWS Professional Certified Architects, DevOps or Engineers and don’t forget to come visit us in booth #1104 in the Expo Hall! See you at re:Invent 2017!
— Coin Graham, Senior Cloud Consultant and John Lawler, Senior Product Manager, 2nd Watch
When we talk about high performance computing we are typically trying to solve some type of problem. These problems will generally fall into one of four types:
- Compute Intensive – A single problem requiring a large amount of computation.
- Memory Intensive – A single problem requiring a large amount of memory.
- Data Intensive – A single problem operating on a large data set.
- High Throughput – Many unrelated problems that are be computed in bulk.
In this post, I will provide a detailed introduction to High Performance Computing (HPC) that can help organizations solve the common issues listed above.
Compute Intensive Workloads
First, let us take a look at compute intensive problems. The goal is to distribute the work for a single problem across multiple CPUs to reduce the execution time as much as possible. In order for us to do this, we need to execute steps of the problem in parallel. Each process—or thread—takes a portion of the work and performs the computations concurrently. The CPUs typically need to exchange information rapidly, requiring specialization communication hardware. Examples of these types of problems are those that can be found when analyzing data that is relative to tasks like financial modeling and risk exposure in both traditional business and healthcare use cases. This is probably the largest portion of HPC problem sets and is the traditional domain of HPC.
When attempting to solve compute intensive problems, we may think that adding more CPUs will reduce our execution time. This is not always true. Most parallel code bases have what we call a “scaling limit”. This is in no small part due to the system overhead of managing more copies, but also to more basic constraints.
CAUTION: NERD ALERT
This is summed up brilliantly in Amdahl’s law.
In computer architecture, Amdahl’s law is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It is named after computer scientist Gene Amdahl, and was presented at the AFIPS Spring Joint Computer Conference in 1967.
Amdahl’s law is often used in parallel computing to predict the theoretical speedup when using multiple processors. For example, if a program needs 20 hours using a single processor core, and a particular part of the program which takes one hour to execute cannot be parallelized, while the remaining 19 hours (p = 0.95) of execution time can be parallelized, then regardless of how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than that critical one hour. Hence, the theoretical speedup is limited to at most 20 times (1/(1 − p) = 20). For this reason, parallel computing with many processors is useful only for very parallelizable programs.
Amdahl’s law can be formulated the following way:
- Slatency is the theoretical speedup of the execution of the whole task;
- s is the speedup of the part of the task that benefits from improved system resources;
- p is the proportion of execution time that the part benefiting from improved resources originally occupied.
Chart Example: If 95% of the program can be parallelized, the theoretical maximum speedup using parallel computing would be 20 times.
Bottom line: As you create more sections of your problem that are able to run concurrently, you can split the work between more processors and thus, achieve more benefits. However, due to complexity and overhead, eventually using more CPUs becomes detrimental instead of actually helping.
There are libraries that help with parallelization, like OpenMP or Open MPI, but before moving to these libraries, we should strive to optimize performance on a single CPU, then make p as large as possible.
Memory Intensive Workloads
Memory intensive workloads require large pools of memory rather than multiple CPUs. In my opinion, these are some of the hardest problems to solve and typically require great care when building machines for your system. Coding and porting is easier because memory will appear seamless, allowing for a single system image. Optimization becomes harder, however, as we get further away from the original creation date of your machines because of component uniformity. Traditionally, in the data center, you don’t replace every single server every three years. If we want more resources in our cluster, and we want performance to be uniform, non-uniform memory produces actual latency. We also have to think about the interconnect between the CPU and the memory.
Nowadays, many of these concerns have been eliminated by commodity servers. We can ask for thousands of the same instance type with the same specs and hardware, and companies like Amazon Web Services are happy to let us use them.
Data Intensive Workloads
This is probably the most common workload we find today, and probably the type with the most buzz. These are known as “Big Data” workloads. Data Intensive workloads are the type of workloads suitable for software packages like Hadoop or MapReduce. We distribute the data for a single problem across multiple CPUs to reduce the overall execution time. The same work may be done on each data segment, though not always the case. This is essentially the inverse of a memory intensive workload in that rapid movement of data to and from disk is more important than the interconnect. The type of problems being solved in these workloads tend to be Life Science (genomics) in the academic field and have a wide reach in commercial applications, particularly around user data and interactions.
High Throughput Workloads
Batch processing jobs (jobs with almost trivial operations to perform in parallel as well as jobs with little to no inter-CPU communication) are considered High Throughput workloads. In high throughput workloads, we create an emphasis on throughput over a period rather than performance on any single problem. We distribute multiple problems independently across multiple CPU’s to reduce overall execution time. These workloads should:
- Break up naturally into independent pieces
- Have little or no inter-cpu communcation
- Be performed in separate processes or threads on a separate CPU (concurrently)
Workloads that are compute intensive jobs can likely be broken into high throughput jobs, however, high throughput jobs do not necessarily mean they are CPU intensive.
HPC On Amazon Web Services
Amazon Web Services (AWS) provides on-demand scalability and elasticity for a wide variety of computational and data-intensive workloads, including workloads that represent many of the world’s most challenging computing problems: engineering simulations, financial risk analyses, molecular dynamics, weather prediction, and many more.
– AWS: An Introduction to High Performance Computing on AWS
Amazon literally has everything you could possibly want in an HPC platform. For every type of workload listed here, AWS has one or more instance classes to match and numerous sizes in each class, allowing you to get very granular in the provisioning of your clusters.
Speaking of provisioning, there is even a tool called CfnCluster which creates clusters for HPC use. CfnCluster is a tool used to build and manage High Performance Computing (HPC) clusters on AWS. Once created, you can log into your cluster via the master node where you will have access to standard HPC tools such as schedulers, shared storage, and an MPI environment.
For data intensive workloads, there a number of options to help get your data closer to your compute resources.
EBS is even a viable option for creating large scale parallel file systems to meet high-volume, high-performance, and throughput requirements of workloads.
HPC Workloads & 2nd Watch
2nd Watch can help you solve complex science, engineering, and business problems using applications that require high bandwidth, enhanced networking, and very high compute capabilities.
Increase the speed of research by running high performance computing in the cloud and reduce costs by paying for only the resources that you use, without large capital investments. With 2nd Watch, you have access to a full-bisection, high bandwidth network for tightly-coupled, IO-intensive workloads, which enables you to scale out across thousands of cores for throughput-oriented applications. Contact us today to learn more.
2nd Watch Customer Success
Celgene is an American biotechnology company that manufactures drug therapies for cancer and inflammatory disorders. Read more about their cloud journey and how they went from doing research jobs that previously took weeks or months, to just hours. Read the case study.
We have also helped a global finance & insurance firm prove their liquidity time and time again in the aftermath of the 2008 recession. By leveraging the batch computing solution that we provided for them, they are now able to scale out their computations across 120,000 cores while validating their liquidity with no CAPEX investment. Read the case study.
– Lars Cromley, Director of Engineering, Automation, 2nd Watch
The exponential growth of big data is pushing companies to process massive amounts of information as quickly as possible, which is often times not realistic, practical or down right just not achievable on standard CPI’s. In a nutshell, High Performance Computing (HPC) allows you to scale performance to process and report on the data quicker and can be the solution to many of your big data problems.
However, this still relies on your cluster capabilities. By using AWS for your HPC needs, you no longer have to worry about designing and adjusting your job to meet the capabilities of your cluster. Instead, you can quickly design and change your cluster to meet the needs of your jobs. There are several tools and services available to help you do this, like the AWS Marketplace, AWS API’s, or AWS CloudFormation Templates.
Today, I’d like to focus on one aspect of running an HPC cluster in AWS that people tend to forget about – placement groups.
Placement groups are a logical grouping of instances in a single availability zone. This allows you to take full advantage of a low-latency 10 GB network, which in turn will allow you to be able to transfer up to 4TB of data per hour between nodes. However, because of the low-latency 10 GB network, the placement groups cannot span to multiple availability zones. This may scare some people away from using them, but it shouldn’t. You can create multiple placement groups in different availability zones as a work-around, and with enhanced networking you can also still connect between the different HPC’s.
One of the grea benefits of AWS HPC is that you can run your High Performance Computing clusters with no up-front costs and scale out to hundreds of thousands of cores within minutes to meet your computing needs. Learn more about Big Data and HPC solutions on AWS or Contact Us to get started with a workload workshop.
-Shawn Bliesner, Cloud Architect
Business intelligence (BI) is an umbrella term that refers to a variety of software applications used to analyze an organization’s raw data. BI as a discipline is made up of several related activities including data mining, online analytical processing, querying and reporting. Analytics is the discovery and communication of meaningful patterns in data. This blog will look at a few areas of BI that will include data mining and reporting, as well as talk about using analytics to find the answers you need to make better business decisions.
Data Mining is an analytic process designed to explore data. Companies of all sizes continuously collect data, often times in very large amounts, in order to solve complex business problems. Data collection can range in purpose from finding out the types of soda your customers like to drink to tracking genome patterns. To process these large amounts of data quickly takes a lot of processing power, and therefore, a system such as Amazon Elastic MapReduce (EMR) is often needed to accomplish this. AWS EMR can handle most use cases from log analysis to bioinformatics, which are key when collecting data, but AWS EMR can only report on data that is collected, so make sure the collected data is accurate and complete.
Reporting accurate and complete data is essential for good BI. Tools like Splunk’s Hunk and Hive work very well with AWS EMR for modeling, reporting, and analyzing data. Hive is business intelligence software used for reporting meaningful patterns in the data, while Hunk helps interactively review logs with real-time alerts. Using the correct tools is the difference between data no one can use and data that provides meaningful BI.
Why do we collect all this data? To find answers of course! Finding answers in your data, from marketing data to application debugging, is why we collect the data in the first place. AWS EMR is great for processing all that data with the right tools reporting on that data. But more than knowing just what happened, we need to find out how it happened. Interactive queries on the data are required to drill down and find the root causes or customer trends. Tools like Impala and Tableau work great with AWS EMR for these needs.
Business Intelligence and Analytics boils down to collecting accurate and complete data. That includes having a system that can process that data, having the ability to report on that data in a meaningful way, and using that data to find answers. By provisioning the storage, computation and database services you need to collect big data into the cloud, we can help you manage big data, BI and analytics while reducing costs, increasing speed of innovation, and providing high availability and durability so you can focus on making sense of your data and using it to make better business decisions. Learn more about our BI and Analytics Solutions here.
-Brent Anderson, Senior Cloud Engineer
In the past few years, Apache’s Hadoop software library has increased market share for Big Data analytics, which are useful for business intelligence (BI) today. There are several reasons why Hadoop’s had such success, but our favorites are that it was one of the first in the market and it’s led by the Open Source community.
By offering a Hadoop-based service, public cloud vendors can offer their customers rapidly scalable processing power and storage. On its own, Hadoop requires significant customization depending on the processing needs of the organization using it. Hadoop also helps manage situations that crank out large volumes of data, big enough to impact your storage resources. Yelp, a local business directory service and review site with social networking features, and AWS customer, is using Hadoop in-house, and deploying big RAID storage resources to handle the increase in their log file production. According to Yelp, they were pumping out up to 100GB of log files every day.
AWS made the Hadoop technology available via the cloud in its Elastic MapReduce (EMR) offering that came out in the early part of 2009. With AWS, customers access EMR through on-demand EC2 instances and can store data using its DynamoDB or S3. By using AWS EMR and S3, Yelp, Inc., was able to save $55,000 in upfront storage costs while meeting their performance needs. That’s a pretty compelling case for running Hadoop services in the cloud.
Recently, Microsoft released its Azure Hadoop-based service, called Azure HDInsight, which has gone through three public pre-release versions in 2012. Microsoft partnered with Hortonworks to build out HDInsight.
Azure is certainly an important and up-and-coming public cloud provider, but it’s mainly been playing a “me too” game with AWS, trying to match the competing service feature for feature. That’s a lot of catch-up; as it should be since EMR’s been in commercial operation since 2009 while HDInsight only just got off the ground.
That means there’s a maturity of both service and technology to EMR that’s not quite there yet with HDInsight. One example is that with AWS EMR, you can opt for an Elastic Load Balancer, which Azure doesn’t mention at all. And via EC2, those instances are also “available in minutes” just like Azure’s big virtualized infrastructure benefit play.
Analyzing Big Data takes massive amounts of processing power (which is why it lends itself so well to cloud-based computing clusters) and huge volumes of data. That means you’ll at least want the option to use a wide and well-managed WAN link for reliable connection up-time as well as big storage buckets. EMR lets you store up to 48TB using multiple deployment choices depending on your needs along with high-end compute cycles and up to 10Gbps worth of network throughput. EMR’s maturity provides for all that while it seems HDInsight is still learning.
Another difference is EMR’s use of the AWS management console to build and manage Elastic MapReduce clusters. Cloud-oriented IT folks are very familiar with the AWS management console, so managing EMR means a much lower learning curve than wading through a whole new set of tools via Azure. It makes use of MapR technology, which adds important features to the Hadoop platform, like data snapshots and high-availability management as well as Amazon-specific features including the ability to mirror EMR clusters across AWS availability zones. MapR has had a long time to integrate with AWS EMR, so its tools are pretty much seamless with EMR’s management capabilities at this point.
Then there’s cost. AWS has been leading the cloud cost wars for the last few years, against all competitors, not just Microsoft. Competitors are reacting to AWS rather than pushing ahead on their own. AWS has a free tier of business application operation, which includes EMR implementation that lasts for one year from sign-up. That allows you to grow your application, understand its long-term scope including spikes and dips, and then budget accordingly. After that, it goes to AWS’ pay-as-you-go model. At least, that’s the model we’d likely use for mission-critical BI, but there are plenty of customers with different priorities, so EMR supports all of AWS’ pricing models.
An example is BackType, a social analytics company and another AWS EMR customer, which uses approximately 25TB to hold over 100 billion records. To satisfy its business, BackType implemented an API that can process 400 requests per second. That was seriously straining both their in-house hardware and their budget. To help, it’s currently averaging around 60 EMR instances, but by using both the reserved and spot instance payment models it can quickly scale up to 150 instances when needed. By leveraging one pricing model against another, the company says it’s saved up to 34% in costs. Those kinds of flexible pricing options aren’t available on other services.
The one place where Azure HDInsight may pull ahead is in end-user tools. If your Big Data analytics team is using Excel as its front-end analysis tool, then Azure delivers a Hive ODBC driver and a Hive add-on for Excel. That’s a smart move on Azure’s part, but it can be duplicated on EMR with some front-end planning.
Additionally, Microsoft is only now attempting to become a serious player in the business intelligence space, and whether SQL Server and Excel can really compete against dedicated and far more mature platforms, like AWS or other established players like Karmasphere Analyst, is a big question mark. Those platforms are all allied with Amazon and available as EMR extensions and can also be had on AWS’ pricing model (including pay-as-you-use).
Whether Azure, SQL Server and Excel can really compete in BI against competition like that is definitely still up in the air. Azure needs to prove itself at the service level, not just as the infrastructure as a service (IaaS) and cloud storage provider it’s been so far. Microsoft has introduced new development tools that should allow developers to build such services, so we’ll likely see a closer race in the future. Currently, however, AWS remains our cloud service platform of choice.
-Kris Bliesner, CEO