PolyBase import and export between Azure SQL Data Warehouse and Blob Storage

13 August 2016

SQL Server Polybase lets you mount data stored in either Azure blob storage or Hadoop as an external data table in SQL Server. You can use this table to query data using normal Transact-SQL statements as well as joining it to other internally-held relational tables.

This makes any data load much more straightforward as you can move data using SQL statements rather than having to use any ETL or integration tooling. Given that PolyBase supports two-way movement of data so you can also use it to push data out of relational tables for archiving.

PolyBase uses parallel processing to read the underlying data so the performance of external tables can be good enough to work on directly as opposed to importing them into local relational tables. PolyBase can also push computation out to Hadoop via map\reduce jobs if it will improve query performance.

The Hadoop aspects of Polybase are not available in Azure SQL Data Warehouse but you can still use it to connect with data stored in Azure Blob Storage. If you’re trying to upload a large amount of data to a SQL Data Warehouse, using Azure blobs and PolyBase is probably the most straightforward approach.

Importing data to SQL Data Warehouse

The more common use case is using Polybase to load SQL Data Warehouse data from uploaded Azure blobs. You can create an external table based on as many files as you want, but they all need to be in the same format and live in the same location in Azure blob storage. Be aware that PolyBase also requires UTF8 encoding.

Step one: Create the master key

The first step is to ensure that a master key has been created in the database. This is used in encrypting any credentials for external data sources. A master key is created using the following SQL statement – you only have to run it once as it will throw an error the second time you run it:

CREATE MASTER KEY;

Step two: Create the data source

An external table is mapped to a container in Azure blob storage using an external data source. To create a source you have to create a scoped credential containing your blob storage access key. The SQL below demonstrates this – note that the IDENTITY parameter is not sent to Azure, it’s just there for identification.

CREATE DATABASE SCOPED CREDENTIAL TestStorageCredential
WITH
    IDENTITY = 'user',
    SECRET = '[STORAGE ACCESS KEY]'
;

The data source is created with a reference to the credentials using the syntax below.

CREATE EXTERNAL DATA SOURCE TestStorageSource
WITH 
(
    TYPE = HADOOP,
    LOCATION = 'wasbs://[CONTAINER NAME]@[STORAGE ACCOUNT NAME].blob.core.windows.net',
    CREDENTIAL = TestStorageCredential
);

The “TYPE” parameter is set to “HADOOP” as PolyBase uses Hadoop’s APIs to access data in Azure blob storage.

Step three: Create the file format

The file format provides instructions on how to interpret the files in your container. Bear in mind that PolyBase assumes that all the file files in the container location will be in the same format – you cannot specify more than one format for a single table. If you do have multiple file formats then you will need to segregate them into separate external tables.

PolyBase supports several different columnar formats, most of which are particular to Hadoop and MapReduce (i.e. RCFile, ORC and Parquet). The most relevant format in this context is delimited files and you have a reasonable amount of freedom in specifying the formatting, e.g.

You can specify different field terminators (i.e. comma or tab) and delimiters for strings
You can choose which date format to apply
You can also choose how to deal with missing data, i.e. use the type default or a null

The example below creates a comma delimited file format with double quote string delimiters. Note that Gzip compression is also specified – PolyBase is happy to read and processed compressed files.

CREATE EXTERNAL FILE FORMAT CommaDelimitedFile
WITH
(
    FORMAT_TYPE = DelimitedText,
    FORMAT_OPTIONS 
    (
        FIELD_TERMINATOR = ',',
        STRING_DELIMITER = '"',
        DATE_FORMAT = 'dd/MM/yyyy',
        USE_TYPE_DEFAULT = FALSE
    ),
    DATA_COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec'
);

Step four: Create the external table

The CREATE EXTERNAL TABLE command is used to wire up a new external table to the data source and file format created in previous steps. The command also lets you specify the file layout in terms of name and data type for each column.

The LOCATION argument can be used to segment files within a blob container by specifying a start point. If you want to include every file in the container then use a location of “.”, otherwise you can specify a sub-folder as shown in the example below. Polybase will process every file it finds below the start location, except for hidden files and sub-folders.

File headers can be tricky as CREATE EXTERNAL TABLE does not provide any options for ignoring them. The default behaviour is be to read the header line like any other which can lead to a lot of invalid data type errors.

The workaround for this is to use the row rejection settings to add a small amount of tolerance for invalid rows. In the example below the three REJECT_ options are telling PolyBase to accept a failure rate of 0.1% of imported rows – that’s enough to absorb the headers.

CREATE EXTERNAL TABLE [dbo].[Example_External] 
(
    [ProductId]    int             NOT NULL
,   [ProductCode]  nvarchar(20)    NOT NULL
,   [DateSold]     datetime2       NOT NULL
,   [Price]        decimal(18, 0)  NOT NULL
)
WITH 
(
    LOCATION='\test1\',
    DATA_SOURCE = TestStorageSource,
    FILE_FORMAT = CommaDelimitedFile,
    REJECT_TYPE = percentage,
    REJECT_VALUE = 1,
    REJECT_SAMPLE_VALUE = 1000
);

Step five: Fire SQL at your new table

Your data isn’t actually accessed until you throw a SQL statement at it, at which point PolyBase gets to work processing all the files it can find in the data source. From here loading data is a matter of writing SQL scripts that move data from one table to another. Be aware that some Transact SQL features that you might be inclined to use in this context are not supported by SQL Data Warehouse, including MERGE and SELECT INTO. In these cases, the fully parallel CREATE TABLE AS SELECT (CTAS) function can be used to achieve the same result.

Bear in mind that Azure SQL Data Warehouse does not support automatic updating of table statistics. If you do load a significant amount data to an internal table then you are expected update the statistics on each column manually using the CREATE STATISTICS statement.

Exporting data using Polybase

To export data you create a new external table using the AS SELECT form of CREATE EXTERNAL TABLE. In the example below the results of the SELECT query will be written to the location and data source defined in the WITH clause using the data format specified.

CREATE EXTERNAL TABLE [dbo].[Example_Exported]
WITH 
(
    LOCATION='\test1\export\',
    DATA_SOURCE = TestStorageSource,
    FILE_FORMAT = CommaDelimitedFile
)
AS
SELECT  [ProductId],
        [ProductCode] 
FROM    [dbo].[Example_Internal]
WHERE   [Price] > 10

You will need to do some tidying up of the output. When PolyBase exports data to SQL Azure blob files it creates a different export file for each of the sixty underlying databases in SQL Azure Warehouse. You don’t have any control over the naming of these files but you can at least specify the location they will be written to in the LOCATION clause. You will also have to write in your own file headers as these are not supported.

Filed under Integration, Databases and Data Warehouse.

PolyBase import and export between Azure SQL Data Warehouse and Blob Storage

Importing data to SQL Data Warehouse

Step one: Create the master key

Step two: Create the data source

Step three: Create the file format

Step four: Create the external table

Step five: Fire SQL at your new table

Exporting data using Polybase

Splitting up shared databases

Optimising bulk inserts with Entity Framework

Handling Protocol Buffers backwards compatibility between versions 2 and 3 using C#

Implementing complex workflows in AI Agents

What should architects focus on - and what should be delegated to teams?

Versioning doesn't make it any easier to manage change in APIs