Skip to main content
Version: v2.0 print this page

Redshift Datasets

Amorphic Dataset portal helps you create structured, semi-structured and structured Datasets. These Datasets can be used as a single source of truth across the different departments of an organization. Amorphic Datasets helps in providing complete data lake visibility of data.

Datasets

Amorphic Dataset page consists of options to List or Create a new Dataset. Datasets are available to select in the Amorphic Dataset for listing purpose. You can sort through the Dataset using the Domain filters, Create new Datasets or View Dataset details.

This page explains usage of datasets with target location as Redshift. Amorphic provides user ability to store csv/tsv/xslx files in s3 without the overhead of maintaining a data warehousing solution for cost effectiveness.

With Redshift datasets as the target location, we provide an option of performing a partial data validation on the files uploaded. By default, Data Validation is enabled for Redshift target location. It can be enabled/disabled at any point of time. Each file is sampled/read partially and every column is validated against the schema which was uploaded to the dataset while registering. This helps the user to quickly detect and perform data correction on corrupt or invalid data files but it takes few extra seconds per file to validate and there will be additional charge per file. As of now, Users can register structured data i.e csv,tsv,xlsx and parquet files and has facility to validate data types such as Strings/Varchar, Integers, Double, Boolean, Date and Timestamp. For accommodating complex data structures we recommend enclosing them in quote chars and register the column schema as String/Varchar once loaded user can perform ETL atop of the data and cast them appropriately.

The CSV Parser/SerDe recommended by AWS Athena has the following limitations:

  • Does not support embedded line breaks in CSV files.
  • Does not support empty fields in columns defined as a numeric data type.

As per the AWS Documentation one work around to achieve this is to import them as string columns and create views on top of it by casting them to the required data types.

Create Redshift Datasets

You can create new Datasets with a wide range of Target locations. This section describes using Redshift as target location for the datasets. Currently only structured data with file formats CSV or XLSX. The following animation shows a detail workflow of creating datasets with Redshift as target.

Create_Athena_Datasets

Redshift Datasets Permissions

  • Owner of the redshift dataset has USAGE and CREATE permissions on redshift cluster schema (Domain) as well as 'SELECT', 'INSERT', 'DELETE', 'UPDATE' permissions on redshift cluster table (Dataset)
  • Read-only user has USAGE permission on redshift cluster schema (Domain) as well as SELECT permission on redshift cluster table (Dataset)

Redshift DBT support

With 1.13 release users will be able to create their own tables and views while working on dbt however sync back data from these new tables and views is not supported in this release.

Users need to provide existing domain (schema) name as dbt schema and they will be able to modify and create new tables/views under the schema.

Redshift Column/Table Constraints and Attributes Support

With 1.14 release users will be able to create Redshift datasets with more table constraints and column constraints and attributes. Check AWS Redshift Create Table documentation to know more.

  • Table Constraints:

    • UNIQUE : Specifies that the column can contain only unique values.
    • PRIMARY KEY : Specifies that the column is the primary key for the table.
    info

    Note Unique and Primary key constraints are informational only and they aren't enforced by the Redshift. Redshift automatically convert columns that defined as PrimaryKey to NOTNULL by default.

  • Column Attributes:
    • IDENTITY : Specifies that the column is an IDENTITY column. To add identity column, user need to provide values for Identity Column Name, Seed and Step Values.
  • Column Constraints:
    • NOT NULL : Specifies that the column isn't allowed to contain null values.

Redshift Data Conversion Parameters Support

With 1.14 release users will be able to create Redshift datasets with Data Conversion Parameters. When data is loading into Redshift, Amorphic implicitly convert the strings in the source data to the data type of the target column.

Data Conversion Parameters are used to specify a conversion that is different from the default behavior, or if the default conversion results in errors when data loads into Redshift database. Check AWS Data Conversion Parameters documentation.

Below are different Data Conversion Parameters that are supported:

'ACCEPTANYDATE', 'ACCEPTINVCHARS', 'BLANKSASNULL', 'DATEFORMAT', 'EMPTYASNULL', 'ENCODING', 'ESCAPE', 'EXPLICIT_IDS', 'FILLRECORD', 'IGNOREBLANKLINES', 'NULL', 'REMOVEQUOTES', 'ROUNDEC', 'TIMEFORMAT', 'TRIMBLANKS', 'TRUNCATECOLUMNS'
Note

For data conversion parameters DATEFORMAT, TIMEFORMAT and NULL, user must provide a valid argument. Arguments for ENCODING and ACCEPTINVCHARS are optional. For all other data conversion parameters, arguments are not required.

Data conversion parameters are available under the profile section of redshift type datasets.

Below image shows how to add data conversion parameters for a registered redshift dataset.

Redshift Data Conversion Parameters

Users also can edit/remove the data conversion parameters by following same steps as in above image.