Skip to main content
 print this page

Bulk data load v2

Note

As part of Amorphic amorphic bulk data load v2, datasets need to be created already as a pre-requisite and based on the user selection they will be used for ingesting the data from the source tables to Amorphic datasets.

Set up bulk data load (Version 2) in Amorphic

A Bulk data load task in Amorphic can be setup by using the "Create New Task" option in the connection details page under the "Tasks" tab.

The following picture depicts the connections tasks page in Amorphic

Connections Tasks Home Page

Tasks

Task refers to a unit of work that migrates data from a source database to a target database. There are two types of tasks: those with on-demand instances and Serverless replication tasks. As of now, Serverless Tasks are available exclusively through the API. The steps for creating both types of tasks are outlined below.

How to create a Task with on-demand instance

Below are the steps that are required to create a bulk data load task in Amorphic.

Task Specifications

  • Task Name : Name of the task to be used, an identifier of the task.

  • Migration Type : Full Load, Change Data Capture (CDC) & Full Load and CDC

    • Full Load : This option simply migrates the data from your source database to your target database.
    • Full load and CDC (Migrate existing data and replicate ongoing changes) : This option performs a full data load while capturing changes on the source. After the full load is complete, captured changes are applied to the target and henceforth.
    • CDC only (Replicate data changes only) : In this option, only ongoing data changes are captured.
  • Target Location : Select the target where the data has to be migrated.

    Note
    • For S3Athena target type only Full Load is supported for now.
    • Redshift target type is not supported if Redshift Serverless is being used.
  • Sync To S3 : Only applicable when the target location is selected to DWH, this option enables the user to choose whether data should be copied to S3 or not either for full-load or CDC related tasks. For CDC type of tasks to sync the data to S3, a schedule needs to be created in the schedules page after selecting this option as Yes.

  • Data Format: Data Format is used to determine the target data format of a task. For S3 and S3Athena target tasks files will be created for the datasets in the format that is specified here. For DWH type of targets this field is used to determine the format of data to be exported to S3 when SyncToS3 is selected as "yes" and will be shown only if it is applicable for DWH. This is an optional field and defaulted to "csv" if not specified.

  • CDC Start Time(Applicable only for CDC) : Custom start time which is used as a starting point to capture the data from the source.

  • CDC Stop Type : Custom stop type which is used as stopping point of the change data capture. Available options are Server Time & Commit Time

  • CDC Stop Time: Custom stop time which is used as stopping time of the change data capture.

    Note

    Both CDC Stop Type & CDC Start Time are required to use the functionality of CDC stop time. This setting is only supported for DWH targets (i.e., Either AuroraMysql or Redshift)

  • Target Extra Connection Attributes(Optional) : Extra connection attributes to be applied to target db for data migration jobs. Please refer below documentation for the available extra connection attributes.

    Note

    For S3 target type of datasets Amorphic uses addColumnName=true;timestampColumnName=RecordModifiedTimeStamp;includeOpForFullLoad=true as ExtraConnectionAttributes. When user provides extra connection attributes in the option above then predefined settings will be overwritten, user have to make sure to add these flags when creating the tasks to be in sync with other data loads.

  • Source Extra Connection Attributes(Optional) : Extra connection attributes to be applied to the source db for data migration jobs. Please refer below documentation for the available extra connection attributes.

    Note

    To add source extra connection attibutes, it has to added in Edit Task Payload(Beta) page only. Below is an example for the same

    Source extra connection attributes sample json
        {
    "SourceExtraConnectionAttributes": "parallelLoadThreads=5;maxFileSize=512"
    }
  • Target Table Prep Mode : To indicate how to handle loading the target at load startup. Available options are Truncate & Append, "Truncate" will delete the data in the dataset before load while "Append" will add the data to existing.

    Note

    This is an optional field and if no option is selected then defaulted to Append in the backend. This setting takes precedence over the setting at Dataset configuration.

  • Use Shared Instance : If a shared instance should be used for a task then this option should be selected as Yes and a drop down appears where user can selected one of the shared instaces that are available for the connection. This is an optional field and will be defaulted to No if no option is specified.

  • Replication Instance AZ : Availability zone to be used to launch the replication instance. When an AZ is selected, all the available instance classes in that AZ will be shown and one can be selected from the list.

  • Replication Instance Class : Type of DMS instance class to be used for the data migration (When user selects the instance here then backend instance selection will be nullified). Please choose approx Instance based on the data volume of all the tables.

  • Allocated Storage : Amount of storage space you want for your replication instance. AWS DMS uses this storage for log files and cached transactions while replication tasks are in progress.

    Note

    Both the ‘Allocated Storage’ and ‘Replication Instance Class’ parameters are required to use the instance setting provided by the user for the DMS task, if above two parameters are defined then Approx Table Size parameter that was selected in the table metadata page will not have any affect and Amorphic uses the instance setting provided by the user else instance config is decided based on the aggregate sum of all Table Sizes in the task.

Table(s) selection

  • Select all the tables that are needed to be loaded into Amorphic.
  • Schemas and tables can be filtered if necessary with the filter on the top.
  • After selecting the tables, click on Next which is at the end of the page to proceed with the metadata edit/update.
  • In case of S3 target location whole schema can also be selected instead of individual selection.

Connections Tasks Home Page

Ingestion Configuration

  • Bulk Edit:

    • Bulk Edit is used to map the source tables to Amorphic datasets, users will be able to see a list of datasets that they have access to.
  • The following are the options available in Bulk edit page and their corresponding use.

    • Target Dataset Name : This option is used to select the dataset in Amorphic where the data needs to be ingested. User can select the datasets based on selected target location. For example, If user selects S3 as target location then only S3 datasets are applicable. In those, only append and update (latest record) type of datasets are allowed.
    • Size : This parameter is used to determine the type of Instance to be used while running the data migration task and has nothing to do with the metadata of the dataset. Please select approx size of the source table with this parameter so that instance can be decided accordingly
    • Enable Filters, Transforms And Sort : This option will enable additional config that can be applied either on the source tables or on the target data while running the task and during the data ingestion.
  • Filter & Transformation rules

    • Filter rules are used to filter the data while ingesting the data to Amorphic. Below are the rules that are supported during ingestion.

      • Equal to
      • Greater than or Equal to
      • Less than or Equal to
    • Transformation rules are used to modify the column names or datatype of the table while ingesting the data to target, below is the list of transformation rules that are available.

      • Rename To
      • Remove Column
      • Make UpperCase
      • Make LowerCase
      • Add Prefix
      • Remove Prefix
      • Replace Prefix
      • Add Suffix
      • Remove Suffix
      • Replace Suffix
      • Change Data Type (Only Integer, Double & String data types are available for conversion now)
      • Add Column

      Below is the sample table which illustrates the column names that will be created in Amorphic when Transformation rules are applied on the columns. An example name of column_name is used to for all the rules.

      RulenameOld valueNew valueResult
      Rename ToN/Anew_column_namenew_column_name
      Remove ColumnN/AN/AN/A
      Make UpperCaseN/AN/ACOLUMN_NAME
      Make LowerCaseN/AN/Acolumn_name
      Add PrefixN/Aprefixprefix_column_name
      Remove Prefixcolumn/COLUMNN/Aname
      Replace Prefixcolumncolcol_name
      Add SuffixN/A_suffixcolumn_name_suffix
      Remove Suffix_name/_NAMEN/Acolumn
      Replace Suffix_name_nmcolumn_nm
      Change Data TypeN/AN/ADatatype changes
      Add ColumnN/Anew_column_namenew_column_name
Note

When multiple transformation rules are applied on a single column then AWS will consider only the first rule that is applied and rest others are ignored. Incase if multiple transformation rules needed to applied on a column then consider using Rename Column Transformation rule

Preview and Submit

  • Here all the details that were entered can be reviewed along with the Transformation and Filter rules added for each dataset.

  • If the task name is already taken by some other task then it can be edited in this page and the edit option will be shown after Submit Task is clicked which will validate the task name in the backend.

  • After selecting all the options click on Submit Task which does the schema conversion and registers the datasets in Amorphic.

    Please follow the below animation as a reference to create a task. Connections Task Creation

  • After successful datasets selection, the task can be started with the Start Task option.

    Connections Tasks Start Task

How to create a Serverless Task

Serverless tasks can be created via the API, utilizing similar attributes for task creation. The necessary payload changes for creating serverless tasks are outlined below

These attributes are unnecessary for serverless task creation: Use Shared Instance, Replication Instance AZ, Replication Instance Class, and Allocated Storage.

Instead, the following attributes are required:

  • MaxDCU : Maximum DMS Capacity Units (DCU) that AWS DMS can provision, depending on your replication's capacity prediction. This is also the maximum DCU that autoscaling could scale up to
  • MinDCU : (Optional) Minimum number of DCU that AWS DMS will provision. This is also the minimum DCU that autoscaling could scale down to.
  • ServerlessReplication : Set this parameter to true to create a serverless replication task. If set to false, the system will create a task using on-demand instances.
  • MultiAZ : Setting this optional parameter creates a standby replica of your replication in another Availability Zone for failover support
  • ReplicationAZ : (Optional) Availability zone to be used to launch the serverless replication task

All other attributes remain the same as tasks with on-demand instances. However, there is a restriction for rules in TableMappings, as mentioned below.

Serverless tasks doesn't support views with selection and transformation rules. Therefore, change "table-type": "all" to "table-type": "table".

As this feature is available only through API, here is a sample payload for creating serverless tasks in Amorphic.

/connections/{connection_id}/dataloads & POST method

Sample payload for creating Serverless-Tasks
    {
"TaskName": "ServerlessTask",
"MigrationType": "full-load-and-cdc",
"TargetLocation": "s3",
"TableMappings": {
"rules": [
{
"rule-type": "transformation",
"rule-id": "1",
"rule-name": "1",
"rule-target": "schema",
"object-locator": {
"schema-name": "SourceSchema",
"table-name": "SourceTable",
"table-type": "table"
},
"rule-action": "rename",
"value": "amorphicdomain",
"old-value": null
},
{
"rule-type": "transformation",
"rule-id": "2",
"rule-name": "2",
"rule-target": "table",
"object-locator": {
"schema-name": "SourceSchema",
"table-name": "SourceTable",
"table-type": "table"
},
"rule-action": "add-prefix",
"value": "Amorphic_dataset_s3/schema_name=SourceSchema/table_name=",
"old-value": null
},
{
"rule-type": "selection",
"rule-id": "3",
"rule-name": "3",
"object-locator": {
"schema-name": "SoureSchema",
"table-name": "SourceTable",
"table-type": "table"
},
"rule-action": "include",
"filters": []
}
]
},
"TaskDefinitions": [
{
"TableName": "SourceTable",
"SchemaName": "SourceSchema",
"TargetDatasetId": "9f065361-7d35-49fd-a59f-96ca98614e7d",
"Size": "2"
}
],
"TaskS3Payload": "no",
"MaxDCU": 2,
"MinDCU": 1,
"MultiAZ": false,
"ReplicationAZ": "eu-west-1c",
"ServerlessReplication": true
}

View Task Details

Once the task is started, the status of the task will be changed to running and the latest status can be fetched with the page refresh page.

Connections Tasks View Task Details

Task Stats

Data migration statistics can be viewed by clicking More Details option or task name on the task details.

Connections Tasks Task Stats

Task statistics can be downloaded using the 'Download Task Stats' option from the more (three vertical ellipses) option in the task details page. Also user can retrieve the latest stats using the 'Refresh Task' option.

Retrieve and Download Task Stats

For Full-load type of tasks an additional tab called Runs will be shown which gives the run history of the task and their corresponding metrics.

Connections Tasks View Task Details Full Load

Note

Schedules tab shown in above image is applicable to any task type and is visible only when a schedule is created for the task.

Additional Info

Additional info related to the task like Datasets Registered, View/Download logs can be viewed under the View of Stats, Schedules, Datasets & Logs page.

Please follow below animation for the details

Connections Tasks Task Details

Instance configuration can be changed to a running task if required. Editing an instance can be done only when the task is in Stopped state.

Edit task

Edit task is also available which is used to change the configuration of the task like mapping rules or changing other config whichever are applicable.

Below image shows where the task edit option is available

Edit bulk data load task

Edit Task Payload(Beta)

Edit payload is also which is used to add any custom table mapping rules that are not possible via UI. User can click on Edit task payload option from the UI and can add the rules as required.

Below image shows where the Edit task payload option is avaialble.

Edit Payload

Addtional configuration

Amorphic also supports few additional configurations that can be applied to a task. These configurations are optional and will be defaulted if not provided by the user. Below is the list for the same.

  • TaskRetryConfig : If a task fails for some reason in its first attempt then based on this configuration task will be retried. This configuration can be applied only from Edit Task Payload option and is applicable to full load tasks only. Below is the payload reference.

    Add TaskRetryConfig to the payload

    "TaskRetryConfig": {
    "AutoRetry": <string> ("enable" or "disable" are the allowed values),
    "MaxRetries": <Integer> (Allowed values are between 1 &3)
    "RetryInterval": <Integer> (Allowed values are between 10 & 60)
    }

    Notes:-
    RetryInterval is the metric in minutes, i.e. if user gives the input as 10 then task waits for 10 mins before a restart
    When "AutoRetry" is "disable" then "MaxRetries" & "RetryInterval" are not applicable
  • AdditionalTaskSettings : Any additional settings can be applied to a task with the help of this configuration. Below is the AWS documentation with the list of all available settings.

Below is an example payload where both TaskRetryConfig & AdditionalTaskSettings are applied on a task

Example task payload with additional configuration

"TaskRetryConfig": {
"AutoRetry": "enable",
"MaxRetries": 2
"RetryInterval": 10
},
"AdditionalTaskSettings": {
"FullLoadSettings": {
"MaxFullLoadSubTasks": 8,
"CommitRate": 50000
}

Example Task additional configuration

  • DmsVersion : This setting is used for the creation of DMS replication Instance, if no value is provided by the user then Amorphic picks the latest non-beta version that is available in that AWS region. This setting is applicable to all task types. This parameter is optional, and users can update it as well. Both creation and updation are exclusively possible through the API.

Below is an example payload where DmsVersion is applied on a task

Example task payload with DMS version
        "DmsVersion": "3.4.5"

Example Task Dms version

Reload table/s in a task

If the data ingestion is failed for few tables in a task then the user can reload the ingestion for only failed tables instead of re-running the entire task again. Amorphic supportes reload table/s in a task via API only feature. Below is the API & the method to be used for reloading table/s in a task.

/connections/{id}/dataloads/{taskid}?action=reload-tables & PUT method

Request payload to reload table/s in a task
    {
"ReloadTables": [
{"SchemaName": "schema1", "TableName": "table1"},
{"SchemaName": "schema2", "TableName": "table2"},
{"SchemaName": "schema3", "TableName": "table2"},
.
.
.
]
}

Addtional Info

  • Only below configurations are editable when using Edit Task Payload feature.

    • TargetTablePrepMode
    • DatasetDefinitions
    • TaskDefinitions
    • TableMappings
    • SyncToS3
    • DataFormat
    • CdcStartTime
    • CdcStopPosition
    • TargetExtraConnectionAttributes
    • TaskRetryConfig
    • AdditionalTaskSettings
    • SourceExtraConnectionAttributes
    • InstanceConfig (InstanceMultiAZ, InstanceClass, AllocatedStorage, InstanceAZ)
    • SharedInstance
    • DmsVersion
  • By default Amorphic assumes the My Data Files Have Headers setting as No for datasets when a task is created. So, user must select the datasets accordingly when creating a task, else there will be a mismatch in counts between Redshit/AuroraMysql and S3Athena when SyncToS3 setting is enabled for a task.