When we talk about high performance computing we are typically trying to solve some type of problem. These problems will generally fall into one of four types:
- Compute Intensive – A single problem requiring a large amount of computation.
- Memory Intensive – A single problem requiring a large amount of memory.
- Data Intensive – A single problem operating on a large data set.
- High Throughput – Many unrelated problems that are be computed in bulk.
In this post, I will provide a detailed introduction to High Performance Computing (HPC) that can help organizations solve the common issues listed above.
Compute Intensive Workloads
First, let us take a look at compute intensive problems. The goal is to distribute the work for a single problem across multiple CPUs to reduce the execution time as much as possible. In order for us to do this, we need to execute steps of the problem in parallel. Each process—or thread—takes a portion of the work and performs the computations concurrently. The CPUs typically need to exchange information rapidly, requiring specialization communication hardware. Examples of these types of problems are those that can be found when analyzing data that is relative to tasks like financial modeling and risk exposure in both traditional business and healthcare use cases. This is probably the largest portion of HPC problem sets and is the traditional domain of HPC.
When attempting to solve compute intensive problems, we may think that adding more CPUs will reduce our execution time. This is not always true. Most parallel code bases have what we call a “scaling limit”. This is in no small part due to the system overhead of managing more copies, but also to more basic constraints.
CAUTION: NERD ALERT
This is summed up brilliantly in Amdahl’s law.
In computer architecture, Amdahl’s law is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It is named after computer scientist Gene Amdahl, and was presented at the AFIPS Spring Joint Computer Conference in 1967.
Amdahl’s law is often used in parallel computing to predict the theoretical speedup when using multiple processors. For example, if a program needs 20 hours using a single processor core, and a particular part of the program which takes one hour to execute cannot be parallelized, while the remaining 19 hours (p = 0.95) of execution time can be parallelized, then regardless of how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than that critical one hour. Hence, the theoretical speedup is limited to at most 20 times (1/(1 − p) = 20). For this reason, parallel computing with many processors is useful only for very parallelizable programs.
Amdahl’s law can be formulated the following way:
- Slatency is the theoretical speedup of the execution of the whole task;
- s is the speedup of the part of the task that benefits from improved system resources;
- p is the proportion of execution time that the part benefiting from improved resources originally occupied.
Chart Example: If 95% of the program can be parallelized, the theoretical maximum speedup using parallel computing would be 20 times.
Bottom line: As you create more sections of your problem that are able to run concurrently, you can split the work between more processors and thus, achieve more benefits. However, due to complexity and overhead, eventually using more CPUs becomes detrimental instead of actually helping.
There are libraries that help with parallelization, like OpenMP or Open MPI, but before moving to these libraries, we should strive to optimize performance on a single CPU, then make p as large as possible.
Memory Intensive Workloads
Memory intensive workloads require large pools of memory rather than multiple CPUs. In my opinion, these are some of the hardest problems to solve and typically require great care when building machines for your system. Coding and porting is easier because memory will appear seamless, allowing for a single system image. Optimization becomes harder, however, as we get further away from the original creation date of your machines because of component uniformity. Traditionally, in the data center, you don’t replace every single server every three years. If we want more resources in our cluster, and we want performance to be uniform, non-uniform memory produces actual latency. We also have to think about the interconnect between the CPU and the memory.
Nowadays, many of these concerns have been eliminated by commodity servers. We can ask for thousands of the same instance type with the same specs and hardware, and companies like Amazon Web Services are happy to let us use them.
Data Intensive Workloads
This is probably the most common workload we find today, and probably the type with the most buzz. These are known as “Big Data” workloads. Data Intensive workloads are the type of workloads suitable for software packages like Hadoop or MapReduce. We distribute the data for a single problem across multiple CPUs to reduce the overall execution time. The same work may be done on each data segment, though not always the case. This is essentially the inverse of a memory intensive workload in that rapid movement of data to and from disk is more important than the interconnect. The type of problems being solved in these workloads tend to be Life Science (genomics) in the academic field and have a wide reach in commercial applications, particularly around user data and interactions.
High Throughput Workloads
Batch processing jobs (jobs with almost trivial operations to perform in parallel as well as jobs with little to no inter-CPU communication) are considered High Throughput workloads. In high throughput workloads, we create an emphasis on throughput over a period rather than performance on any single problem. We distribute multiple problems independently across multiple CPU’s to reduce overall execution time. These workloads should:
- Break up naturally into independent pieces
- Have little or no inter-cpu communcation
- Be performed in separate processes or threads on a separate CPU (concurrently)
Workloads that are compute intensive jobs can likely be broken into high throughput jobs, however, high throughput jobs do not necessarily mean they are CPU intensive.
HPC On Amazon Web Services
Amazon Web Services (AWS) provides on-demand scalability and elasticity for a wide variety of computational and data-intensive workloads, including workloads that represent many of the world’s most challenging computing problems: engineering simulations, financial risk analyses, molecular dynamics, weather prediction, and many more.
– AWS: An Introduction to High Performance Computing on AWS
Amazon literally has everything you could possibly want in an HPC platform. For every type of workload listed here, AWS has one or more instance classes to match and numerous sizes in each class, allowing you to get very granular in the provisioning of your clusters.
Speaking of provisioning, there is even a tool called CfnCluster which creates clusters for HPC use. CfnCluster is a tool used to build and manage High Performance Computing (HPC) clusters on AWS. Once created, you can log into your cluster via the master node where you will have access to standard HPC tools such as schedulers, shared storage, and an MPI environment.
For data intensive workloads, there a number of options to help get your data closer to your compute resources.
EBS is even a viable option for creating large scale parallel file systems to meet high-volume, high-performance, and throughput requirements of workloads.
HPC Workloads & 2nd Watch
2nd Watch can help you solve complex science, engineering, and business problems using applications that require high bandwidth, enhanced networking, and very high compute capabilities.
Increase the speed of research by running high performance computing in the cloud and reduce costs by paying for only the resources that you use, without large capital investments. With 2nd Watch, you have access to a full-bisection, high bandwidth network for tightly-coupled, IO-intensive workloads, which enables you to scale out across thousands of cores for throughput-oriented applications. Contact us today to learn more.
2nd Watch Customer Success
Celgene is an American biotechnology company that manufactures drug therapies for cancer and inflammatory disorders. Read more about their cloud journey and how they went from doing research jobs that previously took weeks or months, to just hours. Read the case study.
We have also helped a global finance & insurance firm prove their liquidity time and time again in the aftermath of the 2008 recession. By leveraging the batch computing solution that we provided for them, they are now able to scale out their computations across 120,000 cores while validating their liquidity with no CAPEX investment. Read the case study.
– Lars Cromley, Director of Engineering, Automation, 2nd Watch
AWS has managed to transform the traditional datacenter model into a feature-rich platform and has been constantly adding new services to meet business and consumer needs. As virtualization has changed the way infrastructure is now built and managed, the ‘serverless’ execution model has become a viable method of reducing costs and simplifying management. A few years ago, the infrastructure required to host a typical application or service required the setup and management of physical hardware, operating systems and application code. AWS’ offerings have grown to include services such as RDS, SES, DynamoDB and ElastiCache which provide a subset of functionality without the requirement of having to manage the entire underlying infrastructure on which those services actually run.
Enter AWS Lambda.
Lambda is a serverless compute service that runs your code in response to events and automatically manages the underlying compute resources for you. You can use AWS Lambda to extend other AWS services with custom logic, or create your own back-end services that operate at AWS scale, performance, and security.
In a nutshell, Lambda provides a service that executes custom code without having to manage the underlying infrastructure on which that code is executed. The administration of the underlying compute resources, including server and operating system maintenance, capacity provisioning, automatic scaling, code monitoring, logging, and code and security patch deployment are eliminated. With AWS Lambda, you pay only for what you use, and are charged based on the number of requests for your functions and the time your code executes. This allows you to eliminate the overhead of paying for instances (by the hour or reserved) and their administration. Why build an entire house if all you need is a kitchen so you can cook dinner? In addition, the service also automatically scales to meet capacity requirements. Again, less complexity and overhead than managing EC2 Auto Scale Groups.
Here’s AWS’ Jeff Barr’s simple description of the service and how it works:
You upload your code and then specify context information to AWS Lambda to create a function. The context information specifies the execution environment (language, memory requirements, a timeout period, and IAM role) and also points to the function you’d like to invoke within your code. The code and the metadata are durably stored in AWS and can later be referred to by name or by ARN (Amazon Resource Name). You can also include any necessary third-party libraries in the upload (which takes the form of a single ZIP file per function).
After uploading, you associate your function with specific AWS resources (a particular S3 bucket, DynamoDB table, or Kinesis stream). Lambda will then arrange to route events (generally signifying that the resource has changed) to your function.
When a resource changes, Lambda will execute any functions that are associated with it. It will launch and manage compute resources as needed in order to keep up with incoming requests. You don’t need to worry about this; Lambda will manage the resources for you and will shut them down if they are no longer needed.
Lambda Functions can be invoked by triggers from changes in state or data from services such as S3, DynamoDB, Kinesis, SNS and CloudTrail, after which, the output can then be sent back to those same services (though it does not have to be). It handles listening, polling, queuing and auto-scaling and spins up as many workers as needed match the rate change of source data.
A few common use cases include:
- S3 + Lambda (Dynamic data ingestion) – Image re-sizing, Video Transcoding, Indexing, Log Processing
- Direct Call + Lambda (Serverless backend) – Microservices, Mobile backends, IoT backends
- Kinesis + Lambda (Live Stream Processing) – Transaction Processing, Stream analysis, Telemetry and Metering
- SNS + Lambda (Custom Messages) – Automating alarm responses, IT Auditing, Text to Email Push
Additionally, data can be sent in parallel to separate Functions to decrease the amount of time required for data that must be processed or manipulated multiple times. This could theoretically be used to perform real-time analytics and data aggregation from a source such as Kinesis.
- Memory is specified ranging from 128MB to 1GB, in 64MB increments. Disk, network and compute resources are provisioned based on the memory footprint. Lambda tells you how much memory is used, so this setting can be tuned.
- They can be invoked on-demand via the CLI and AWS Console, or subscribed to one or multiple event sources (e.g. S3, SNS). And you can reuse the same Function for those event sources.
- Granular permissions can be applied via IAM such as IAM Roles. At a minimum, logging to CloudWatch is recommended.
- Limits to resource allocation such as 512MB /tmp space, 1024 file descriptors and 50MB deployment package size can be found at http://docs.aws.amazon.com/lambda/la/dg/limits.html.
- Multiple deployment options exist including direct authoring via the AWS Console, packaging code as a zip, and 3rd party plugins (Grunt, Jenkins).
- Stateless data means depending on another service such as S3 or DynamoDB to retain persistence.
- Monitoring and debugging can be accomplished using the Console Dashboard to view CloudWatch metrics such as requests, errors, latency and throttling.
Invoking Lambda functions can be achieved using Push or Pull methods. In the event of a Push from S3 or SNS, retries occur automatically 3 times and is unordered. One event equals one Function invocation. Pull, on the other hand (Kinesis & DynamoDB), is ordered and will retry indefinitely until data expires. Resource policies (used in the Push model) can be defined per Function and allow for cross-account access. IAM roles (used for Pull), can be used to derive permission from execution role to read data from a particular stream.
Lambda uses a fine-grained pricing model based on the number of requests made AND the execution time of those requests. Each month, the first 1 million requests are free with a $0.20 charge per 1 million requests thereafter. Duration is calculated from the time your code begins executing until it returns or otherwise terminates, rounded up to the nearest 100ms and takes into account the amount of memory allocated to a function. The execution cost is $0.00001667 for every GB-second used.
Additional details regarding the service can be found https://aws.amazon.com/lambda/. If you need help getting started with the service, contact us at 2nd Watch.
-Ryan Manikowski, Cloud Consultant
IT infrastructure is the hardware, network, services and software required for enterprise IT. It is the foundation that enables organizations to deliver IT services to their users. Disaster recovery (DR) is preparing for and recovering from natural and people-related disasters that impact IT infrastructure for critical business functions. Natural disasters include earthquakes, fires, etc. People-related disasters include human error, terrorism, etc. Business continuity differs from DR as it involves keeping all aspects of the organization functioning, not just IT infrastructure.
When planning for DR, companies must establish a recovery time objective (RTO) and recovery point objective (RPO) for each critical IT service. RTO is the acceptable amount of time in which an IT service must be restored. RPO is the acceptable amount of data loss measured in time. Companies establish both RTOs and RPOs to mitigate financial and other types of loss to the business. Companies then design and implement DR plans to effectively and efficiently recover the IT infrastructure necessary to run critical business functions.
For companies with corporate datacenters, the traditional approach to DR involves duplicating IT infrastructure at a secondary location to ensure available capacity in a disaster. The key downside is IT infrastructure must be bought, installed and maintained in advance to address anticipated capacity requirements. This often causes IT infrastructure in the secondary location to be over-procured and under-utilized. In contrast, Amazon Web Services (AWS) provides companies with access to enterprise-grade IT infrastructure that can be scaled up or down for DR as needed.
The four most common DR architectures on AWS are:
- Backup and Restore ($) – Companies can use their current backup software to replicate data into AWS. Companies use Amazon S3 for short-term archiving and Amazon Glacier for long-term archiving. In the event of a disaster, data can be made available on AWS infrastructure or restored from the cloud back onto an on-premise server.
- Pilot Light ($$) – While backup and restore are focused on data, pilot light includes applications. Companies only provision core infrastructure needed for critical applications. When disaster strikes, Amazon Machine Images (AMIs) and other automation services are used to quickly provision the remaining environment for production.
- Warm Standby ($$$) – Taking the Pilot Light model one step further, warm standby creates an active/passive cluster. The minimum amount of capacity is provisioned in AWS. When needed, the environment rapidly scales up to meet full production demands. Companies receive (near) 100% uptime and (near) no downtime.
- Hot Standby ($$$$) – Hot standby is an active/active cluster with both cloud and on-premise components to it. Using weighted DNS load-balancing, IT determines how much application traffic to process in-house and on AWS. If a disaster or spike in load occurs, more or all of it can be routed to AWS with auto-scaling.
In a non-disaster environment, warm standby DR is not scaled for full production, but is fully functional. To help adsorb/justify cost, companies can use the DR site for non-production work, such as quality assurance, ing, etc. For hot standby DR, cost is determined by how much production traffic is handled by AWS in normal operation. In the recovery phase, companies only pay for what they use in addition and for the duration the DR site is at full scale. In hot standby, companies can further reduce the costs of their “always on” AWS servers with Reserved Instances (RIs).
Smart companies know disaster is not a matter of if, but when. According to a study done by the University of Oregon, every dollar spent on hazard mitigation, including DR, saves companies four dollars in recovery and response costs. In addition to cost savings, smart companies also view DR as critical to their survival. For example, 51% of companies that experienced a major data loss closed within two years (Source: Gartner), and 44% of companies that experienced a major fire never re-opened (Source: EBM). Again, disaster is not a ready of if, but when. Be ready.
-Josh Lowry, General Manager – West