Wednesday, 15 March 2017

Getting Started with Azure SQL Data Warehouse - Part 2


Microsoft introduced Azure SQL Data Warehouse, a new enterprise-class, elastic petabyte-scale, data warehouse service that can scale according to organizational demands in just few minutes. In my last article of the series, I discussed the traditional way of creating a data warehouse and the challenges associated with it in the current scenario. Then we talked about Azure SQL Data Warehouse, and how it helps meet the changing demands for current business. In this article, I will cover the architecture of Azure SQL Data Warehouse and how you can scale up or down, based on your need.

SQL Data Warehouse – A look at its Architecture

SQL Data Warehouse leverages the proprietary Massively Parallel Processing (MPP) engine for distributed parallel processes and is based on master-slave architecture. At a minimum, you will have one control node (master) and one or more compute nodes (slaves) based on your need. Apart from control and compute nodes, you will have storage to store your data, which can scale completely independently from the size of compute.

Control node provides the endpoint for connecting to the SQL DW database and uses the regular SQL 1433 port for connectivity. The control node contains an instance of SQL Database for storage of meta data (persists no user data) and an MPP engine for distributed query optimization, distributed processing and coordination. The control node manages and optimizes queries. When a request for query execution comes in, an MPP engine on the control node, creates a distributed execution plan by breaking down the query into parallel processes and coordinating that processing across multiple compute nodes running in parallel. Apart from that, the control node also contains an instance of Data Management Service (DMS), which coordinates all the data movement with other nodes.

Oracle database Tutorials, Oracle Database Guide, Oracle Database Certifications
Figure 1 - SQL Data Warehouse - Architecture

Depending on your need, you might have one or more compute nodes. Each of these compute nodes contains SQL Database and does all the heavy lifting of executing query processes (assigned to them by the control node). Each compute node works on its local data (in certain cases there might be some data movement across compute nodes, which we will discuss later) and once it finishes processing the query, it returns its result to the control node. When the control node receives results from all the compute nodes, it then aggregates the data and returns the final result to the end-user who submitted the query for execution.

Data Movement Service (DMS) facilitates communication and data transportation across nodes in the SQL Data Warehouse. The control node and each compute node runs an instance of Data Movement Service locally for coordinating data movement between the control node and compute nodes and between the compute nodes themselves, making it possible to distribute data on load and to perform joins and aggregations across multiple compute nodes during query execution.

Azure Storage is used for storage of your data in two scenarios. In the first scenario, when we load data into the SQL DW table, data gets distributed (more on distribution in later) across compute nodes and persisted on Azure Storage. In the second scenario, you can add or load data into Azure Storage directly without incurring Azure SQL DW compute costs and then use Polybase to query data in SQL DW from the Azure storage. As discussed earlier, both compute and storage are de-coupled in SQL DW, enabling a true elasticity and letting you scale storage separately from scaling compute, and vice-versa. SQL DW also leverages Azure Storage to store backup copies of your SQL DW databases.

SQL Data Warehouse – Data Warehouse Unit (DWU)

Data Warehouse Unit (DWU) is a unit of measure of resources (CPU, memory and IOPS) assigned to your SQL DW database. You can think of a unit as one virtual machine, working as one compute node. You can start with DW100 (one compute node) and can go up to DW6000 (60 compute nodes). Increasing the number of DWUs increases resources and performance of your queries by way of parallelly executing them. You can scale SQL DW database by increasing or decreasing DWU simply by moving a slider in the Azure Portal or by the Properties page of the database in SQL Server Management Studio (SSMS) or by using PowerShell scripts.

DWU provides an abstraction and hides details on underlying hardware and software so that they can change or move without affecting your workload performance. With this abstraction in place, Microsoft can adjust the underlying architecture of the service without affecting the performance of your workload.

When you need faster performance, simply scale your SQL DW database up by increasing DWU and scale your SQL DW database down by decreasing DWU. This way, you can minimize the cost by paying only for resources in use, when needed. These are a few scenarios when increasing and decreasing DWU that will give you balance in terms of cost and performance:
  • Decrease SQL DW database to lower DWU when you have low usage; this way users will still be able to access data (though the performance might be slower) and you can still save on overall cost.
  • Pause your SQL DW database when not in use, for example during off-working hours or during weekends. Pausing your SQL DW database releases compute resources and you only pay for storage for data stored.
  • Increase SQL DW database DWU to secure more resources for heavy data loading or transformation operations and then during that time pay for greater performance.

Data Distribution and How It Relates to DWU

Distributed data is fundamental to running queries in parallel to achieve high query performance and hence when we store data into SQL DW database, it’s distributed across 60 fixed distributions (distributed location is called a distribution). Each of those distributions functions as an independent storage and processing unit that runs queries on its portion of the data.

Oracle database Tutorials, Oracle Database Guide, Oracle Database Certifications
Figure 2 - SQL DW distributions

SQL DW database has 60 fixed distributions and a flexible number of compute nodes which will work on those distributions. The number of compute nodes can change according to your workload and the computing capacity (DWUs) you specify for the SQL DW database. When the number of compute nodes changes, the number of distributions per compute node also changes. For example, in the figure below, the size of the SQL DW database is DW100 and hence there is only one compute node processing data for all those 60 distributions.

Oracle database Tutorials, Oracle Database Guide, Oracle Database Certifications
Figure 3 - SQL DW distributions assigned to DW100

Likewise, if you scale your SQL DW database to DW600, you will have 6 compute nodes, each processing data from 10 distributions.

Oracle database Tutorials, Oracle Database Guide, Oracle Database Certifications
Figure 4 - SQL DW distributions assigned to DW600

The number of compute nodes must divide 60 with zero remainder so that each compute node has an equal number of distributions to process and that’s the reason you will notice, the size does not increase in multiples of 100 but rather they are DWU100 (one compute to process 60 distributions), DWU200 (two compute nodes, each processing 30 distributions), DWU300 (three compute nodes, each processing 20 distributions), DWU400 (four compute nodes, each processing 15 distributions), DWU500 (five compute nodes, each processing 12 distributions), DWU600 (six compute nodes, each processing 10 distributions), DWU1000 (ten compute nodes, each processing 6 distributions), DWU1200 (twelve compute nodes, each processing 5 distributions), DWU2000 (twenty compute nodes, each processing 3 distributions), DWU3000 (thirty compute nodes, each processing 2 distributions) and DWU6000 (sixty compute nodes, each processing 1 distribution).

How data gets distributed across those 60 distributions is based on the types of target table and its design.