Skip to main content
Version: v1.13 print this page

Iceberg Datasets

In Amorphic, User can create Iceberg datasets with S3Athena target location which creates Iceberg table in the backend to store the data.

info

As of Amorphic 1.14, User can only create Iceberg datasets targeted to S3Athena. If Iceberg is compatible with any other targets in future, we will try to incorporate them accordingly.

Athena supports read, time travel, write, and DDL queries for Apache Iceberg tables that use the Apache Parquet format for data and the AWS Glue catalog for their metastore.

What is Apache Iceberg ?

Apache Iceberg is an open table format for very large analytic datasets. Iceberg manages large collections of files as tables, and it supports modern analytical data lake operations such as record-level insert, update, delete, and time travel queries. The Iceberg specification allows seamless table evolution such as schema and partition evolution, and its design is optimized for usage on Amazon S3. Iceberg also helps guarantee data correctness under concurrent write scenarios. For more information about Apache Iceberg, see Apache Iceberg Documentation

What does Amorphic support ?

Amorphic Iceberg datasets support following features:

  • ACID transactions
    • ACID (atomic, consistent, isolated, and durable) transactions protect the integrity of Data Catalog operations such as creating or updating a table. They also enable multiple users to concurrently and reliably add and delete objects in the Amazon S3 data lake, while still allowing other users to simultaneously run analytical queries and machine learning (ML) models on the same datasets that return consistent and up-to-date results. When Iceberg tables are involved in reads from or writes to the data lake on Amazon S3, those operations occur within a transaction. Transactions protect the integrity of iceberg table metadata, including the manifest — the metadata that defines the Amazon S3 objects in the table's underlying data. Integrated AWS services such as Amazon Athena support iceberg tables to provide consistent reads in queries. To use transactions in your AWS Glue ETL jobs, you begin a transaction before you perform any reads from or writes to the data lake, and you commit the transaction upon completion. For more information about transactions, see Reading from and Writing to the Data Lake Within Transactions.
  • Set/Unset Table properties (User can specify attributes like Write compression, Data optimization configuration etc)
  • Schema evolution (Add, Drop, Rename, Update (changing data type) columns)
  • Hidden-Partitioning
  • Time travel queries to specified date and time
  • Version travel queries to specified snapshot ID (Table version)
  • Queries combining time and version travel
  • Iceberg table data can be managed directly on Athena using INSERT, UPDATE, and DELETE queries.
  • Optimizing Iceberg tables data by REWRITE DATA compaction action
  • Row-level deletes

For more information about Athena supported Iceberg features and limitations, Check Athena Iceberg Documentation

Limitations (Both AWS and Amorphic)

  • Supported data types
  • Applicable to ONLY 'S3Athena' TargetLocation and 'Parquet' FileType
  • Restricted/Non-Applicable Amorphic features for Iceberg datasets
    • S3Athena DataValidation
    • Skip LZ (Validation) Process
    • Malware Detection
    • Data Profiling
    • Data Cleanup
    • Data Metrics collection
    • Life Cycle Policy
  • No Partition evolution (Changing partitions after table creation)
  • Only predefined list of key-value pairs allowed in the table properties for creating or altering Iceberg tables. Check AWS Documentation
  • Schema evolution:
    • Allowed ONLY a set of update column (data type promotions) actions:
      • Change an integer column to a big integer column
      • Change a float column to a double
      • Increase the precision of a decimal type column
    • Reorder columns is not supported
  • Views are not supported on top of Iceberg tables.

Create Iceberg Datasets

In Amorphic, User can create Iceberg datasets same as Athena dataset by selecting 'S3Athena' as target, 'parquet' as file type and 'Yes' in 'Iceberg Table' dropdown. User can add Iceberg table properties using 'Iceberg Table Properties' section by adding properties in key-value pairs. Check documentation for Iceberg supproted table properties

Create Iceberg dataset

Upon successful registration of Dataset metadata, User can specify partition related information through 'Custom Partition Options' with following attributes:

  • Column Name: Partition column name which should be of any column name from schema.
  • Transformation: Partition transform functions which are used to convert the column data into the respective method by Iceberg (Hidden partitioning). Available Iceberg Transformation functions are year, month, day, hour, bucket, truncate, None (means no transformation)
  • Transformation Input: If Transformation is either bucket or truncate then additional input should be provided. Input value should be a positive number.

For more information, Check documentation for Iceberg Partitioning

Iceberg partitions

Load Iceberg Datasets

Uploading data to Iceberg datasets is similar to Amorphic 'Data Reloads' functionality where the files will be in pending state once uploaded. User should select the files and process the files from 'Pending Files' option from 'File Status' dropdown in Files tab.

Once pending files are processed, then it'll start the dataload process in the backend which takes sometime compared to regular file uploads to other type of datasets.

Below options will not be available for Iceberg dataset:

  • 'Add Tags', 'Delete' and 'Permanent Delete' options when completed files are selected in 'Complete Files' File Status dropdown.
  • 'Truncate Dataset', 'Download File', 'Apply ML' and 'View AI/ML Results' buttons/options for completed files in Files tab.

User can delete the pending files by selecting the files from 'Pending Files' option in 'File Status' dropdown in Files tab.

Query Iceberg Datasets

Once the data is successfully loaded into Iceberg datasets, the data is readily available for the User to query and analyze the data directly from the Amorphic Query Engine feature by selecting the workgroup as AmazonAthenaIcebergPreview.

User can perform additional commands incase of Iceberg datasets for below actions:

  • Iceberg table data can be managed directly on Athena using below commands
  • View Metadata
  • Optimize data by REWRITE DATA compaction action
IMPORTANT

It's better to AVOID the above commands if user doesn't have knowledge on them as it'll change/delete the data and its metadata based on the specified command.

Below image shows result of "DESCRIBE" table command on an Iceberg dataset:

Query Iceberg dataset