Skip to main content
Version: v1.13 print this page

DynamoDB Datasets

Amorphic DynamoDB Datasets helps you to create structured datasets and these datasets can be used as a single source of truth across the different departments of an organization. Amorphic datasets helps in providing complete data lake visibility of data.

Note

As of version 1.10, Amorphic datasets support DynamoDB as target location only via API. Create/update/delete of dynamodb datasets is only supported via API and upon creation of the dataset, user can view the dataset details, upload files and any other operations on datasets from Amorphic portal.

Create New Dataset

You can create a new DynamoDB dataset in Amorphic by using the below API.

POST https://{{gateway_id}}.execute-api.{{region}}.amazonaws.com/{{environment}}/datasets

Create_Dataset

In order to create a new dataset, you would require information like Domain, Connection type and File Type etc. Following are the main information required to create a new dataset.

The Amorphic dataset has a hierarchical structure such that Files are associated with Datasets which are associated with a domain. Hence to create a Dataset, you need to first create a Domain using Amorphic Administration. Then create a Dataset and upload new structured, semi-structured or unstructured files to the Dataset and upload the respective files.

Please check the documentation on Create New Dataset for more details and the mandatory fields that are required for dataset creation.

  • DatasetName: Name of the dataset. Dataset name must be 3-70 (64 only if DWH is AuroraMySQL) alphanumeric, _ characters only. Your dataset name must be unique across the application.

  • DatasetDescription: Description of the dataset. Please describe your dataset in as much detail as possible. The full text of this field is searchable within the application.

  • Domain: Domain groups the related datasets to keep them organized. This will be used while creating the DynamoDB table.

  • DataClassification: To classify the dataset with different categories to protect the data more efficiently. For ex: PCI, PII etc.

  • Keywords: Keywords for the dataset. Keywords are indexed and searchable within the application. Please choose keywords which are meaningful to you and others. You can use these to flag related datasets with the same keywords so that you can easily find them later.

  • ConnectionType: Currently amorphic supports below connection types.

    • api : Default connection type which can be used for manual upload of files to the dataset. Please check the documentation on Dataset files for manual upload of files to the dataset.
    • jdbc : For ingesting the data from a JDBC connection (as source) to the Amorphic dataset. Requires a schedule for data ingestion.
    • s3 : For ingesting the data from a S3 connection (as source) to the Amorphic dataset. Requires a schedule for data ingestion.
    • ext-api : For ingesting the data from an external API (as source) to the Amorphic dataset. Requires a schedule for data ingestion.
  • FileType: The supported file types for dynamodb datasets are csv, tsv, xlsx, parquet.

  • TargetLocation:

    • dynamodb : Files uploaded to the dataset (either manual or through ingestion) will be stored in the DynamoDB table.
  • TableUpdate: Currently dynamodb datasets supports below two update methods.

    • append : With this update method data will be appended to the existing data.

    • reload : With this update method data will be reloaded to the dataset. Below are two options which are exclusive for Reload type of dataset.

      • TargetTablePrepMode

        • recreate
        • truncate
      • SkipFileHeader (Optional) : When Skip Trash is True old data is not moved to Trash bucket during the data reload process, default is true when not provided.

      Based on the above reload settings, data reload process times can vary.

Schema Publish

When user registers a dataset with DynamoDB as target location then they can define the dataset schema and DynamoDB table keys using below API

PUT https://{{gateway_id}}.execute-api.{{region}}.amazonaws.com/{{environment}}/datasets/{{dataset_id}}

Dataset_Schema_Publish

Following are the details that are required to publish a schema and complete the registration process of a dataset.

{
"DatasetSchema": [
{
"name": "string",
"type": "string"
}
],
"PartitionKey": {
"name": "string",
"type": "string"
},
"SortKey": {
"name": "string",
"type": "string"
},
"LocalSecondaryIndexes": [
{
"IndexName": "string",
"SortKey": {
"name": "string",
"type": "string"
},
"ProjectionType": "string"
}
],
"GlobalSecondaryIndexes": [
{
"IndexName": "string",
"PartitionKey": {
"name": "string",
"type": "string"
},
"SortKey": {
"name": "string",
"type": "string"
},
"ProjectionType": "string"
}
]
}
Note
  • DynamoDB datasets support only STRING, NUMBER and BINARY data types for columns. This applies to any column defined in above.
  • If the Dataset publish schema (PUT) call times out after 30 seconds, Please delete the dataset from Amorphic portal then try creating the dataset and publishing the schema again through API.
  • DatasetSchema: Schema of the dataset with the list of columns (column name and column type). It must contain all the columns that are expected from the input source of this dataset.

  • PartitionKey: PartitionKey is an attribute which is part of the primary key. If the dataset has only PartitionKey then no two items can have the same partition key value. If SortKey is also included the the composite of both the attributes must be uniquely identify an item in the dataset. For more information on PartitionKeys and SortKeys refer to AWS Documentation

  • SortKey: SortKey is an attribute which is part of the primary key. This is an optional field and if provided then the composite of PartitionKey and SortKey must be uniquely identified. For more information on PartitionKeys and SortKeys refer to AWS Documentation

  • LocalSecondaryIndexes: An index that has the same partition key as the dataset, but a different sort key. LSI is an optional field and if provided then user needs to provide a SortKey which follows the rules of primary key. User can create LSI only during the dataset creation process and cannot update/delete the LSI's. User can create only 5 local secondary indexes in a single dataset. For more information on Local Secondary Index refer to AWS Documentation.

    • IndexName: A unique name for the local secondary index.

    • SortKey: SortKey is similar to the table SortKey the composite of PartitionKey and SortKey must be uniquely identify an item in the dataset and follows the same rules within local secondary indexes as well.

    • ProjectionType: Represents attributes that are copied (projected) from the table into the local secondary index. User can provide one of below options:

      • ALL: All of the table attributes are projected into the index.

      • KEYS_ONLY: Only the index and primary keys are projected into the index.

  • GlobalSecondaryIndexes: An index with a partition key and sort key that can be different from those on the table. GSI is an optional field and if provided then user needs to provide a PartitionKey and SortKey which follows the rules of primary key. User can create GSI during the dataset creation process and can also create/delete them by updating the dataset. User can create a maximum of 20 global secondary indexes in a single dataset. For more information on Global Secondary Index refer to AWS Documentation.

    • IndexName: A unique name for the global secondary index.

    • PartitionKey: PartitionKey is similar to the table PartitionKey follows the same rules within global secondary indexes as well.

    • SortKey: SortKey is similar to the table SortKey the composite of PartitionKey and SortKey must be uniquely identify an item in the dataset and follows the same rules within global secondary indexes as well.

    • ProjectionType: Represents attributes that are copied (projected) from the table into the global secondary index. User can provide one of below options:

      • ALL: All of the table attributes are projected into the index.

      • KEYS_ONLY: Only the index and primary keys are projected into the index.

Update metadata

User can add more global secondary indexes to dynamodb datasets or delete existing global secondary indexes.

 PUT https://{{gateway_id}}.execute-api.us-west-2.amazonaws.com/{{environment}}/datasets/{{dataset_id}}/updatemetadata

Update_Dataset_Metadata

Following are the details that are required to update global secondary index.

{
"GlobalSecondaryIndexes": [
{
"IndexName": "string",
"PartitionKey": {
"name": "string",
"type": "string"
},
"SortKey": {
"name": "string",
"type": "string"
},
"ProjectionType": "string"
}
]
}
Note

In a single API call, user can either add or delete only one global secondary index. If multiple operations are required then user needs to make consecutive calls and perform the operations.

The PUT API call body must have the list of GSI's that the user wants in a dataset. If a user wants to delete a GSI and add a new GSI then it has to be two separate API calls, one with the GSI deleted from the GSI list and another call with the new GSI added to the list.

View Dataset

Update_Dataset_Metadata

Upon clicking on View Details under a dynamodb dataset, the user will be able to see all the details of the dataset. Please check the documentation on View Dataset for more details on viewing a dataset.

DynamoDB Table Name and SSM Parameter Details

After schema registration step, dynamodb table creation process will be triggered and a table name with below naming convention will be created. Also a SSM parameter will be created as part of the same process which contains dynamodb table name. This SSM parameter can be used in ETL jobs to get the latest dynamodb table name and perform ETL operations.

DynamoDB table naming convention: <PROJECTSHORTNAME>_<DOMAINNAME>_<DATASETNAME>_<RANDOM_5_CHAR_STRING>
SSM parameter naming convention: <PROJECTSHORTNAME>_<DOMAINNAME>_<DATASETNAME>

Update_Dataset_Metadata

Note

If the dataset UpdateType is reload then everytime a new dataload happens then a new dynamodb table will be created with the same properties and old table will be deleted upon successful completion. In order, to get the latest dynamodb table name then query the SSM parameter and get the value.

Upload files

User can upload the files manually to the dataset ONLY if the connection type is API (default). Please check the documentation on Dataset files for more details. DynamoDB datasets also follow the same approach w.r.t files.

Delete dataset

Dataset can be deleted using the "Delete" (trash) icon on the right corner of the page. Please check the documentation on Delete Dataset for more details.

Access DynamoDB Datasets from ETL Jobs

User can add DynamoDB datasets to their ETL jobs same as any other datasets by editing an existing ETL job or while creating a new ETL job in Amorphic UI. When user adds a DynamoDB dataset to read/write access in ETL job then the read/write access will be provided on DynamoDB table and the SSM parameter as well.

  • DynamoDB dataset with WRITE access: When a dynamodb datasets in given WRITE access in ETL jobs then user will be able to perform below operations on the dynamodb table:

    PutItem
    UpdateItem
    DeleteItem
    BatchWriteItem
    GetItem
    BatchGetItem
    Scan
    Query
    ConditionCheckItem
  • DynamoDB dataset with READ access: When a dynamodb datasets in given READ access in ETL jobs then user will be able to perform below operations on the dynamodb table:

    GetItem
    BatchGetItem
    Scan
    Query
    ConditionCheckItem

Sample ETL Job script

Below is a sample ETL script to access get dynamodb dataset SSM parameter value and query the DynamoDB table.

import os
import sys
import datetime
import json
import boto3
from boto3.dynamodb.conditions import Key
from awsglue.utils import getResolvedOptions

AWS_REGION = 'us-west-2'

DYNAMODB_RES = boto3.resource('dynamodb', AWS_REGION)
ssm_client = boto3.client('ssm', AWS_REGION)

parameter_key = "cdap_dataset_demo_dynamodb_dataset_1"
response = ssm_client.get_parameter(Name=parameter_key)

dynamodb_table_name = response['Parameter']['Value']

print("DynamoDB table name - {}".format(dynamodb_table_name))

dynamodb_table = DYNAMODB_RES.Table(dynamodb_table_name)

ddb_item = dynamodb_table.query(
KeyConditionExpression=Key('imdb_id').eq('tt0047966')
)
print("Get item from DynamoDB: {}".format(json.dumps(ddb_item, indent=4, default=str)))

full_table_scan = dynamodb_table.scan()
data = full_table_scan['Items']