Wednesday, 8 March 2017

Getting Started with Azure SQL Data Warehouse - Part 1


As the demand for data analytics grows so does the need for a technology or platform to process large amounts of different types of data in timely manner. A platform that can be easily scaled up or down, based on increased or decreased demand, has become an imminent need for all those organizations that want to do data analytics on a large scale of varied data.

Microsoft introduced Azure SQL Data Warehouse, a new enterprise-class, elastic petabyte-scale, data warehouse service that can scale according to organizational demands in just a few minutes. In this article series, we will dive deep into this data warehouse-as-a-service offering from Microsoft.

What is SQL Data Warehouse

Traditional data warehouses were designed to handle data with some sort of limitation on its overall volume and introduced latency during processing of those incoming data. Though these designs were acceptable to business owners for the last several decades, the demand has changed. Businesses want to analyze data in a near real-time or real-time manner, based on the data that comes from both internal sources as well as cloud applications, so that they can remain competitive and tap into new growth opportunities. Azure SQL Data Warehouse is one of the technologies that helps in meeting businesses’ changing demand by letting them create a modern data warehouse for storage and processing of vast amounts of data, as well as integrating well with existing tools and technologies to combine data from various sources so that you can visualize data in your familiar visualization tool for quicker insights.

Azure SQL Data Warehouse is a fully-managed (offered as Platform-as-a-Service(PaaS)) and highly scalable cloud based distributed relational database management system for data warehousing workloads. Azure SQL Data Warehouse is built on Massively Parallel Processing (MPP) architecture, capable of processing massive volumes of data (both relational and non-relational), processing data parallelly across multiple nodes and offering other enterprise-class features to handle enterprise data warehouse workloads.

Oracle Database Tutorials and Materials, Oracle Database Certifications
Figure 1 - An integrated application data and advanced analytics platform

Advantages of Azure SQL Data Warehouse

As you can see in the figure above, Azure SQL Data Warehouse has been designed to architect modern data warehouses, to which you can ingest data from a variety of sources, both on-premise data as well as cloud born data. Azure SQL Data Warehouse offers several advantages over traditional data warehouse technologies as well as over some of those other competitive technologies, some of which are:
  • Massively parallel processing (MPP) and Elastic scale - Azure SQL Data Warehouse is based on Massively Parallel Processing (MPP) architecture, which has been powering Analytics Platform Systems (APS) - formally known as Parallel Data Warehouse (PDW) - for several years now. One of the best part of Azure SQL Data Warehouse is, it separates the compute nodes from the data storage. Compute is deployed using Azure Virtual Machines, data is stored in Azure Page Blobs. This de-coupling allows Azure SQL Data Warehouse to be a truly elastic Data Warehouse, able to on-demand scale storage capacity and be able to scale compute capacity independently within few seconds.
  • End-to-end platform built for Modern Data Warehouse - Azure SQL Data Warehouse is based on Microsoft SQL Server and leverages much of the its capabilities, with additional capabilities to make it a distributed processing system. That means it doesn't have a steep learning curve and organizations can make use of existing resources\skills to start work on Azure SQL Data Warehouse.
  • Also, Azure SQL Data Warehouse integrates well with Azure Platform Services and other analytics solutions. With that capability in hand, you have the choice of either creating either a truly cloud or hybrid analytics solution (as shown in Figure above). Its seamless compatibility with Power BI, Azure Machine Learning, HDInsight, Azure Data Factory and other Azure Services as well as other non-Microsoft products (like Informatica, Tableau, Qlik etc.) makes it an ideal candidate for creating end to end analytics solutions. Its integration with development tools, like tools SSMS and Visual Studio SSDT, brings overall ease in development.
  • Market leading price/ performance - As mentioned earlier, Azure SQL Data Warehouse separates the compute nodes from the data storage; compute is scaled independently of the storage and vice versa. You don't need to provision enough storage or compute in advance; you can scale up or down in a few minutes based on your changing needs. When not in use, you can even pause compute resources, paying only for the storage during that time (anyway, storage is a commodity and not that expensive).
  • For example, during load time you can scale your Azure SQL Data Warehouse on tens of different compute nodes (Data Warehousing Unit - DWU is the unit of scale that I will be talking about in my next article) and scale it down to few nodes when the load decreases. Likewise, when you don't need to use it during weekends or holidays, you can pause it. This allows you to make optimal use of resources and you just pay for what you need or consume. For example, if you need your data warehouse during working hours, on working days (8*5) rather than all the time (24*7), then pausing your data warehouse when not in use will let you pay ~25% of the overall compute cost, which is a huge savings.
  • In-memory columnstore for 100x speed improvement - Like Microsoft SQL Server, Azure SQL Server Data Warehouse also supports both row-wise and column-wise storage mechanisms. Columnstore leverages state of art Vertipaq in-memory engine and improves query performance up to 100x. Unlike regular rowstore indexes or heaps, which store data in B-Tree structure (in row-wise fashion), the column store index stores data in columnar fashion and uses compression aggressively to reduce the disk I/O needed to serve the query request.
  • Polybase enables joining relational and non-relational external data - Polybase is a new technology built on top of T-SQL, which allows you to access and combine both relational and non-relational data, all within a single query in Azure SQL Data Warehouse. 
  • Security and Protection - Azure SQL Data Warehouse has multiple layers of protection. Its connection security feature allows you to restrict access using firewall rules to restrict to certain IP or IP range whereas Azure Active Directory (AAD) authentication allows connecting to Azure SQL Data Warehouse using identities in Azure Active Directory (Azure AD). Encryption at rest, in motion or in use ensures your data is not miss-used whereas built-in tools to auditing and monitoring data can help you in maintaining regulatory compliance, understand database activity, and gain insight into discrepancies and anomalies that could indicate business concerns or suspected security violations.
  • Automatic backup and geo-restore - To recover from data deletion or alteration or any sort of disaster, Azure SQL Data Warehouse backs up the database automatically every 4 hours (backup copies geo-replicated across multiple data centers) as well as provides the mechanism to take on-demand backup if there is a need.

Symmetric Multiprocessing vs. Shared-nothing Architecture

In most cases, traditional data warehouses have been deployed on Symmetric multiprocessing (SMP) machines, which have two or more identical processors\cores connected to a single, shared main memory, have full access to all I/O devices, and are controlled by a single operating system instance that treats all processors equally, reserving none for special purposes. Though they have been serving the purpose for several decades, it’s now poses challenges in terms of scale and performance that current businesses demand.

Azure SQL Data Warehouse is based on shared nothing architecture with multi-nodes, working together in parallel to process the request in a distributed manner. When you store data in Azure SQL Data Warehouse, the data gets distributed and stored across multiple locations. Each of those locations functions as an independent storage and processing unit that runs queries on its portion of the data. In the next article, I will go deeper into the architecture for Azure SQL Data Warehouse, different components and explain how you can easily scale up or down based on your need.

Oracle Database Certifications, Oracle Database Tutorials and Materials

Figure 2 - Symmetric Multiprocessing vs. Shared-nothing Architecture

Note - Distributed data is fundamental to running queries in parallel to achieve high query performance, which I will be talking in detail about in the next article of this series.