Skip to main content
Version: v1.14 print this page

Redshift Datasets

Amorphic Dataset portal helps you create structured, semi-structured and structured Datasets. These Datasets can be used as a single source of truth across the different departments of an organization. Amorphic Datasets helps in providing complete data lake visibility of data.

Datasets

Amorphic Dataset page consists of options to List or Create a new Dataset. Datasets are available to select in the Amorphic Dataset for listing purpose. You can sort through the Dataset using the Domain filters, Create new Datasets or View Dataset details.

This page explains usage of datasets with target location as Redshift. Amorphic provides user ability to store csv/tsv/xslx files in s3 without the overhead of maintaining a data warehousing solution for cost effectiveness.

With Redshift datasets as the target location, we provide an option of performing a partial data validation on the files uploaded. By default, Data Validation is enabled for Redshift target location. It can be enabled/disabled at any point of time. Each file is sampled/read partially and every column is validated against the schema which was uploaded to the dataset while registering. This helps the user to quickly detect and perform data correction on corrupt or invalid data files but it takes few extra seconds per file to validate and there will be additional charge per file. As of now, Users can register structured data i.e csv,tsv,xlsx and parquet files and has facility to validate data types such as Strings/Varchar, Integers, Double, Boolean, Date and Timestamp. For accommodating complex data structures we recommend enclosing them in quote chars and register the column schema as String/Varchar once loaded user can perform ETL atop of the data and cast them appropriately.

The CSV Parser/SerDe recommended by AWS Athena has the following limitations:

  • Does not support embedded line breaks in CSV files.
  • Does not support empty fields in columns defined as a numeric data type.

As per the AWS Documentation one work around to achieve this is to import them as string columns and create views on top of it by casting them to the required data types.

Create Redshift Datasets

You can create new Datasets with a wide range of Target locations. This section describes using Redshift as target location for the datasets. Currently only structured data with file formats CSV or XLSX. The following animation shows a detail workflow of creating datasets with Redshift as target.

Create_Athena_Datasets

Redshift Datasets Permissions

  • Owner of the redshift dataset has USAGE and CREATE permissions on redshift cluster schema (Domain) as well as 'SELECT', 'INSERT', 'DELETE', 'UPDATE' permissions on redshift cluster table (Dataset)
  • Read-only user has USAGE permission on redshift cluster schema (Domain) as well as SELECT permission on redshift cluster table (Dataset)

Redshift DBT support

With 1.13 release users will be able to create their own tables and views while working on dbt however sync back data from these new tables and views is not supported in this release.

Users need to provide existing domain (schema) name as dbt schema and they will be able to modify and create new tables/views under the schema.

Redshift Column/Table Constaints and Attributes Support

With 1.14 release users will be able to create Redshift datasets with more table constraints and column constraints and attributes. Check AWS Redshift Create Table documentation to know more.

  • Table Constraints:

    • UNIQUE : Specifies that the column can contain only unique values.
    • PRIMARY KEY : Specifies that the column is the primary key for the table.
    info

    Note Unique and Primary key constraints are informational only and they aren't enforced by the Redshift.

  • Column Attributes:
    • IDENTITY : Specifies that the column is an IDENTITY column. To add identity column, user need to provide values for Identity Column Name, Seed and Step Values.
  • Column Constraints:
    • NOT NULL : Specifies that the column isn't allowed to contain null values.

Redshift Data Conversion Parameters Support (API Only))

With 1.14 release users will be able to create Redshift datasets with Data Conversion Parameters. When data is loading into Redshift, Amorphic implicitly convert the strings in the source data to the data type of the target column.

Data Conversion Parameters are used to specify a conversion that is different from the default behavior, or if the default conversion results in errors when data loads into Redshift database. Check AWS Data Conversion Parameters documentation.

Below are different Data Conversion Parameters that are supported:

'ACCEPTANYDATE', 'ACCEPTINVCHARS', 'BLANKSASNULL', 'DATEFORMAT', 'EMPTYASNULL', 'ENCODING', 'ESCAPE', 'EXPLICIT_IDS', 'FILLRECORD', 'IGNOREBLANKLINES', 'NULL', 'REMOVEQUOTES', 'ROUNDEC', 'TIMEFORMAT', 'TRIMBLANKS', 'TRUNCATECOLUMNS'

API's List

  • /datasets/{id}?request=get_conversion_params

    • GET -- Returns available Data Conversion Parameters in the format below

      Details about available Data Conversion parameters
          {
      "DataConversionParameters": [
      {
      "param": "ACCEPTANYDATE",
      "argument_required": "no"
      },
      {
      "param": "ACCEPTINVCHARS",
      "argument_required": "optional"
      },
      {
      "param": "ENCODING",
      "argument_required": "optional",
      "valid_arguments": ["UTF8", "UTF16", "UTF16LE", "UTF16BE"]
      },
      {
      "param": "NULL",
      "argument_required": "yes"
      },
      .
      .
      ]
      }

      param key specifies Data Conversion Parameter and argument_required specifies the argument requirement for each Data ConversionPparameters. Possible values for argument_required are either yes, no or optional.

    Note

    Please replace {id} with the DatasetId received from the backend response during creating dataset.

API Request Payload Details

  1. To complete dataset registration

    /datasets/{id}/ & PUT method

    Sample request payload to complete redshift dataset registration with Data Conversion Parameters
        {
    "DatasetSchema": [
    {
    "name": <String>,
    "type": <String>
    },
    {
    "name": "name",
    "type": "varchar(256)"
    }
    ],
    "SortType": <String>,
    "SortKeys": <List of SortKeys> (Optional/Required field based on SortType),
    "DistType": <String>,
    "DistKey": <List of SortKeys> (Optional/Required field based on DistType),
    "DataConversionParams": {
    "<Param_1>": <String>,
    "<Param_2>": <String>,
    .
    .
    }
    }
    Note

    Please replace variables Param_1, Param_2 with value of param (e.g., ACCEPTANYDATE or NULL) received from backend when calling the API endpoint /datasets/{id} with method GET and query parameter request=get_conversion_params and value of Param_1, Param_2 with valid arguments.

Note

During completing registration, for the Data Conversion Parameters with argument_required is either yes or optional, then user must provide a valid argument. If argument_required is no, then the value of argument should be an empty string("").