Azure Synapse Analytics Dedicated SQL Pools
Azure Synapse Analytics dedicated SQL pools is a PaaS relational database engine that is optimized for data warehouse workloads. Dedicated SQL pools use a scale-out MPP architecture to process very large amounts of data. This means that data is sharded into multiple distributions and processed across one or more compute nodes. To do this, dedicated SQL pools separate compute and storage by using a SQL engine to perform computations and Azure Storage to store the data. Even though data is stored in Azure Blob storage, dedicated SQL pools serve data to users in a relational format as tables or views.
Dedicated SQL pools shard data into 60 distributions across one or more compute nodes. There are three different distribution patterns to consider when creating tables or materialized views. The most optimal choice is going to depend on the size and nature of the table or materialized view. They include the following distribution patterns:
- Hash distribution uses a hash function to deterministically assign each row to a distribution. In the table or view definition, one of the columns is designated as the distribution column. The most optimal distribution columns have a high number of distinct values and an even amount of data skew. Hash distribution is the best option for large fact and dimension tables as it provides the best performance for joins and aggregations on large tables.
- Round-robin distribution is the simplest distribution pattern as it evenly shards data randomly across distributions. Data is loaded quickly to a table or view using round-robin distribution but it can cause performance issues as data is not organized in the most optimal manner across each distribution. Typical use cases for round-robin distribution include staging tables or using it if there are no columns with highly distinct values.
- Replicated tables or materialized views cache a full copy of the table or materialized view on the first distribution on each compute node. This provides the fastest query performance as data does not need to shuffle from one distribution to another when aggregated or joined. Because extra storage is required, replicated tables and materialized views are recommended for small tables or tables that contain static values.
Distribution design should be carefully considered since data distribution results in data being physically stored in different locations. For example, round-robin distribution tables or poorly chosen distribution columns on hash distributed tables could result in a lot of data shuffling when the data is queried. The more that data needs to be shuffled, the more time the query will take to complete.
Just as with Azure SQL Database, it is easy to scale a dedicated SQL pool up or down depending on workload demands through the Azure Portal, PowerShell, or T-SQL. Service level objectives (SLOs) represent the scalability setting of a dedicated SQL pool and determine the cost and performance level as well as the number of compute nodes allocated. These are measured by compute Data Warehouse Units (cDWUs) which are bundled compute units of CPU, memory, and I/O. Table 2.6 lists the available dedicated SQL pool SLOs.
TABLE 2.6 Dedicated SQL pool service level objectives
Performance Level | Compute Nodes | Distributions per Compute Node | Memory (GB) |
DW100c | 1 | 60 | 60 |
DW200c | 1 | 60 | 120 |
DW300c | 1 | 60 | 180 |
DW400c | 1 | 60 | 240 |
DW500c | 1 | 60 | 300 |
DW1000c | 2 | 30 | 600 |
DW1500c | 3 | 20 | 900 |
DW2000c | 4 | 15 | 1,200 |
DW2500c | 5 | 12 | 1,500 |
DW3000c | 6 | 10 | 1,800 |
DW5000c | 10 | 6 | 3,000 |
DW6000c | 12 | 5 | 3,600 |
DW7500c | 15 | 4 | 4,500 |
DW10000c | 20 | 3 | 6,000 |
DW15000c | 30 | 2 | 9,000 |
DW30000c | 60 | 1 | 18,000 |