1-888-317-7920 info@2ndwatch.com

Business Intelligence and Analytics in the Public Cloud

Business intelligence (BI) is an umbrella term that refers to a variety of software applications used to analyze an organization’s raw data. BI as a discipline is made up of several related activities including data mining, online analytical processing, querying and reporting.  Analytics is the discovery and communication of meaningful patterns in data. This blog will look at a few areas of BI that will include data mining and reporting, as well as talk about using analytics to find the answers you need to make better business decisions.

Data Mining

Data Mining is an analytic process designed to explore data.  Companies of all sizes continuously collect data, often times in very large amounts, in order to solve complex business problems.  Data collection can range in purpose from finding out the types of soda your customers like to drink to tracking genome patterns. To process these large amounts of data quickly takes a lot of processing power, and therefore, a system such as Amazon Elastic MapReduce (EMR) is often needed to accomplish this.  AWS EMR can handle most use cases from log analysis to bioinformatics, which are key when collecting data, but AWS EMR can only report on data that is collected, so make sure the collected data is accurate and complete.


Reporting accurate and complete data is essential for good BI.  Tools like Splunk’s Hunk and Hive work very well with AWS EMR for modeling, reporting, and analyzing data.  Hive is business intelligence software used for reporting meaningful patterns in the data, while Hunk helps interactively review logs with real-time alerts. Using the correct tools is the difference between data no one can use and data that provides meaningful BI.

Why do we collect all this data? To find answers of course! Finding answers in your data, from marketing data to application debugging, is why we collect the data in the first place.  AWS EMR is great for processing all that data with the right tools reporting on that data.  But more than knowing just what happened, we need to find out how it happened.  Interactive queries on the data are required to drill down and find the root causes or customer trends.  Tools like Impala and Tableau work great with AWS EMR for these needs.

Business Intelligence and Analytics boils down to collecting accurate and complete data.  That includes having a system that can process that data, having the ability to report on that data in a meaningful way, and using that data to find answers.  By provisioning the storage, computation and database services you need to collect big data into the cloud, we can help you manage big data, BI and analytics while reducing costs, increasing speed of innovation, and providing high availability and durability so you can focus on making sense of your data and using it to make better business decisions.  Learn more about our BI and Analytics Solutions here.

-Brent Anderson, Senior Cloud Engineer


AWS Elastic MapReduce vs. Windows Azure HDInsight

In the past few years, Apache’s Hadoop software library has increased market share for Big Data analytics, which are useful for business intelligence (BI) today. There are several reasons why Hadoop’s had such success, but our favorites are that it was one of the first in the market and it’s led by the Open Source community.

By offering a Hadoop-based service, public cloud vendors can offer their customers rapidly scalable processing power and storage. On its own, Hadoop requires significant customization depending on the processing needs of the organization using it. Hadoop also helps manage situations that crank out large volumes of data, big enough to impact your storage resources. Yelp, a local business directory service and review site with social networking features, and AWS customer, is using Hadoop in-house, and deploying big RAID storage resources to handle the increase in their log file production. According to Yelp, they were pumping out up to 100GB of log files every day.

AWS made the Hadoop technology available via the cloud in its Elastic MapReduce (EMR) offering that came out in the early part of 2009. With AWS, customers access EMR through on-demand EC2 instances and can store data using its DynamoDB or S3. By using AWS EMR and S3, Yelp, Inc., was able to save $55,000 in upfront storage costs while meeting their performance needs. That’s a pretty compelling case for running Hadoop services in the cloud.

Recently, Microsoft released its Azure Hadoop-based service, called Azure HDInsight, which has gone through three public pre-release versions in 2012. Microsoft partnered with Hortonworks to build out HDInsight.

Azure is certainly an important and up-and-coming public cloud provider, but it’s mainly been playing a “me too” game with AWS, trying to match the competing service feature for feature. That’s a lot of catch-up; as it should be since EMR’s been in commercial operation since 2009 while HDInsight only just got off the ground.

That means there’s a maturity of both service and technology to EMR that’s not quite there yet with HDInsight. One example is that with AWS EMR, you can opt for an Elastic Load Balancer, which Azure doesn’t mention at all. And via EC2, those instances are also “available in minutes” just like Azure’s big virtualized infrastructure benefit play.

Analyzing Big Data takes massive amounts of processing power (which is why it lends itself so well to cloud-based computing clusters) and huge volumes of data. That means you’ll at least want the option to use a wide and well-managed WAN link for reliable connection up-time as well as big storage buckets. EMR lets you store up to 48TB using multiple deployment choices depending on your needs along with high-end compute cycles and up to 10Gbps worth of network throughput. EMR’s maturity provides for all that while it seems HDInsight is still learning.

Another difference is EMR’s use of the AWS management console to build and manage Elastic MapReduce clusters. Cloud-oriented IT folks are very familiar with the AWS management console, so managing EMR means a much lower learning curve than wading through a whole new set of tools via Azure. It makes use of MapR technology, which adds important features to the Hadoop platform, like data snapshots and high-availability management as well as Amazon-specific features including the ability to mirror EMR clusters across AWS availability zones. MapR has had a long time to integrate with AWS EMR, so its tools are pretty much seamless with EMR’s management capabilities at this point.

Then there’s cost. AWS has been leading the cloud cost wars for the last few years, against all competitors, not just Microsoft. Competitors are reacting to AWS rather than pushing ahead on their own. AWS has a free tier of business application operation, which includes EMR implementation that lasts for one year from sign-up. That allows you to grow your application, understand its long-term scope including spikes and dips, and then budget accordingly. After that, it goes to AWS’ pay-as-you-go model. At least, that’s the model we’d likely use for mission-critical BI, but there are plenty of customers with different priorities, so EMR supports all of AWS’ pricing models.

An example is BackType, a social analytics company and another AWS EMR customer, which uses approximately 25TB to hold over 100 billion records. To satisfy its business, BackType implemented an API that can process  400 requests per second. That was seriously straining both their in-house hardware and their budget. To help, it’s currently averaging around 60 EMR instances, but by using both the reserved and spot instance payment models it can quickly scale up to 150 instances when needed. By leveraging one pricing model against another, the company says it’s saved up to 34% in costs. Those kinds of flexible pricing options aren’t available on other services.

The one place where Azure HDInsight may pull ahead is in end-user tools. If your Big Data analytics team is using Excel as its front-end analysis tool, then Azure delivers a Hive ODBC driver and a Hive add-on for Excel. That’s a smart move on Azure’s part, but it can be duplicated on EMR with some front-end planning.

Additionally, Microsoft is only now attempting to become a serious player in the business intelligence space, and whether SQL Server and Excel can really compete against dedicated and far more mature platforms, like AWS or other established players like Karmasphere Analyst, is a big question mark. Those platforms are all allied with Amazon and available as EMR extensions and can also be had on AWS’ pricing model (including pay-as-you-use).

Whether Azure, SQL Server and Excel can really compete in BI against competition like that is definitely still up in the air. Azure needs to prove itself at the service level, not just as the infrastructure as a service (IaaS) and cloud storage provider it’s been so far. Microsoft has introduced new development tools that should allow developers to build such services, so we’ll likely see a closer race in the future. Currently, however, AWS remains our cloud service platform of choice.

-Kris Bliesner, CEO