Version: v1.13 print this page

Iceberg Datasets

In Amorphic, User can create Iceberg datasets with S3Athena target location which creates Iceberg table in the backend to store the data.

info

As of Amorphic 1.14, User can only create Iceberg datasets targeted to S3Athena. If Iceberg is compatible with any other targets in future, we will try to incorporate them accordingly.

Athena supports read, time travel, write, and DDL queries for Apache Iceberg tables that use the Apache Parquet format for data and the AWS Glue catalog for their metastore.

What is Apache Iceberg ?

Apache Iceberg is an open table format for very large analytic datasets. Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. The Iceberg specification allows seamless table evolution such as schema and partition evolution, and its design is optimized for usage on Amazon S3. Iceberg also helps guarantee data correctness under concurrent write scenarios. For more information about Apache Iceberg, see Apache Iceberg Documentation

What does Amorphic support ?

Amorphic Iceberg datasets support following features:

ACID transactions
- ACID (atomic, consistent, isolated, and durable) transactions protect the integrity of Data Catalog operations such as creating or updating a table. They also enable multiple users to concurrently and reliably add and delete objects in the Amazon S3 data lake, while still allowing other users to simultaneously run analytical queries and machine learning (ML) models on the same datasets that return consistent and up-to-date results. When Iceberg tables are involved in reads from or writes to the data lake on Amazon S3, those operations occur within a transaction. Transactions protect the integrity of iceberg table metadata, including the manifest — the metadata that defines the Amazon S3 objects in the table's underlying data. Integrated AWS services such as Amazon Athena support iceberg tables to provide consistent reads in queries. To use transactions in your AWS Glue ETL jobs, you begin a transaction before you perform any reads from or writes to the data lake, and you commit the transaction upon completion. For more information about transactions, see Reading from and Writing to the Data Lake Within Transactions.
Set/Unset Table properties (User can specify attributes like Write compression, Data optimization configuration etc)
Schema evolution (Add, Drop, Rename, Update (changing data type) columns)
Hidden-Partitioning
Time travel queries to specified date and time
Version travel queries to specified snapshot ID (Table version)
Queries combining time and version travel
Iceberg table data can be managed directly on Athena using INSERT, UPDATE, and DELETE queries.
Optimizing Iceberg tables data by REWRITE DATA compaction action
Row-level deletes

For more information about Athena supported Iceberg features and limitations, Check Athena Iceberg Documentation

Limitations (Both AWS and Amorphic)

Supported data types
Applicable to ONLY 'S3Athena' TargetLocation and 'Parquet' FileType
Restricted/Non-Applicable Amorphic features for Iceberg datasets
- S3Athena DataValidation
- Skip LZ (Validation) Process
- Malware Detection
- Data Profiling
- Data Cleanup
- Data Metrics collection
- Life Cycle Policy
No Partition evolution (Changing partitions after table creation)
Only predefined list of key-value pairs allowed in the table properties for creating or altering Iceberg tables. Check AWS Documentation
Schema evolution:
- Allowed ONLY a set of update column (data type promotions) actions:
  - Change an integer column to a big integer column
  - Change a float column to a double
  - Increase the precision of a decimal type column
- Reorder columns is not supported
Views are not supported on top of Iceberg tables.

Create Iceberg Datasets

In Amorphic, User can create Iceberg datasets same as Athena dataset by selecting 'S3Athena' as target, 'parquet' as file type and 'Yes' in 'Iceberg Table' dropdown. User can add Iceberg table properties using 'Iceberg Table Properties' section by adding properties in key-value pairs. Check documentation for Iceberg supproted table properties

Create Iceberg dataset

Upon successful registration of Dataset metadata, User can specify partition related information through 'Custom Partition Options' with following attributes:

Column Name: Partition column name which should be of any column name from schema.
Transformation: Partition transform functions which are used to convert the column data into the respective method by Iceberg (Hidden partitioning). Available Iceberg Transformation functions are year, month, day, hour, bucket, truncate, None (means no transformation)
Transformation Input: If Transformation is either bucket or truncate then additional input should be provided. Input value should be a positive number.

For more information, Check documentation for Iceberg Partitioning

Iceberg partitions

Load Iceberg Datasets

Uploading data to Iceberg datasets is similar to Amorphic 'Data Reloads' functionality where the files will be in pending state once uploaded. User should select the files and process the files from 'Pending Files' option from 'File Status' dropdown in Files tab.

Once pending files are processed, then it'll start the dataload process in the backend which takes sometime compared to regular file uploads to other type of datasets.

Below options will not be available for Iceberg dataset:

'Add Tags', 'Delete' and 'Permanent Delete' options when completed files are selected in 'Complete Files' File Status dropdown.
'Truncate Dataset', 'Download File', 'Apply ML' and 'View AI/ML Results' buttons/options for completed files in Files tab.

User can delete the pending files by selecting the files from 'Pending Files' option in 'File Status' dropdown in Files tab.

Query Iceberg Datasets

Once the data is successfully loaded into Iceberg datasets, the data is readily available for the User to query and analyze the data directly from the Amorphic Query Engine feature by selecting the workgroup as AmazonAthenaIcebergPreview.

User can perform additional commands incase of Iceberg datasets for below actions:

Iceberg table data can be managed directly on Athena using below commands
- INSERT INTO, UPDATE and DELETE FROM
- For more information, Check AWS Documentation
View Metadata
- DESCRIBE, SHOW TBLPROPERTIES
- For more information, Check AWS Documentation
Optimize data by REWRITE DATA compaction action
- OPTIMIZE
- For more information, Check AWS Documentation

IMPORTANT

It's better to AVOID the above commands if user doesn't have knowledge on them as it'll change/delete the data and its metadata based on the specified command.

Below image shows result of "DESCRIBE" table command on an Iceberg dataset:

Query Iceberg dataset

What is Apache Iceberg ?​

What does Amorphic support ?​

Limitations (Both AWS and Amorphic)​

Create Iceberg Datasets​

Load Iceberg Datasets​

Query Iceberg Datasets​