Skip to main content
Version: v2.5 print this page

Lakeformation Datasets

Lakeformation extends S3-Athena datasets with added security and supports CSV, TSV, XLSX, JSON and Parquet files. It also checks data integrity and offers ACID transactions, data compaction, and time-travel queries.

Lakeformation allows partial data validation for uploaded files as an option. Data validation is turned on by default, but you can turn it on or off. When files are partially read, each column is checked against the uploaded schema to detect and fix bad or invalid data. However, this takes more time and costs more money. Structured data like csv, tsv, xlsx, json and parquet files can be registered and check data types like Strings/Varchar, Integers, Double, Boolean, Date, and Timestamp. Complex data structures can be put in quotes and registered as a String/Varchar column. After loading, the data can be changed and cast to the correct form.

The AWS Athena recommended CSV Parser/SerDe has the following limitations:

  • It does not support embedded line breaks in CSV files.
  • It does not support empty fields in columns defined as a numeric data type.

As a workaround, you can import them as string columns and create views on top of it by casting them to the required data types.

For Datasets with JSON file type:

  • AWS Limitations
    • It uses the OpenX JSON SerDe with the following limitations:
      • It expect JSON data to be on a single line (not formatted), with records separated by a new line character.
      • Comma character is not allowed at the end of each line.
      • The full data in the file should not be enclosed in square brackets.
    • Views are not supported on top of Lakeformation JSON datasets.
  • Amorphic feature limitations (Not Applicable)
    • Malware Detection
    • Data Profiling

Below is an example of an invalid JSON file:

    [
{
"EmailId": "test-cwdl@cloudwick.com",
"IsAdmin": "no",
"UserId": "testuser"
},
{
"EmailId": "test1-cwdl@cloudwick.com",
"IsAdmin": "no",
"UserId": "testuser1"
}
]

Below is an example of a valid JSON file:

    { "EmailId": "test-cwdl1@cloudwick.com", "IsAdmin": "no", "UserId": "testuser1" }
{ "EmailId": "test-cwdl2@cloudwick.com", "IsAdmin": "no", "UserId": "testuser2" }
{ "EmailId": "test-cwdl3@cloudwick.com", "IsAdmin": "yes", "UserId": "testuser3" }
{ "EmailId": "test-cwdl4@cloudwick.com", "IsAdmin": "no", "UserId": "testuser4" }
{ "EmailId": "test-cwdl5@cloudwick.com", "IsAdmin": "yes", "UserId": "testuser5" }
Note

For JSON files, if dataset validation is enabled then column names in the files must exactly match the column names in the dataset schema

Datasets

Create Lakeformation Datasets

Create new Datasets with Lakeformation as target location for structured data in CSV/TSV/XLSX/JSON/Parquet formats.

Create_Lakeformation_Datasets

Load Lakeformation Datasets

Loading of Lakeformation datasets is same as S3-Athena datasets. To know more, refer to Athena Datasets for more detail.

Fine grained permissions with Lakeformation Datasets

Lakeformation datasets provide an additional layer of security for the data stored in Amorphic. Currently, two levels of access control are available: Owners and Read-only.

Owners of the datasets are provided with full column access by default and cannot be modified. However, fine-grained access control can be applied to Authorized Read-only members of the dataset. Owners of the dataset can select individual users and choose which columns the Read-only user can access.

Please find the list of examples on how user permissions are applied based on Authorized Users and Groups.

EffectivePermissions_Lakeformation_Datasets

This feature has certain limitations. For more information, refer to the Limitations section.

Below animation shows how to apply fine grained permissions on LF dataset:

EffectivePermissions_Lakeformation_Datasets

Fine grained permissions with Lakeformation Datasets(TBAC)

Owners of the dataset can choose Read-Only tags and specify which columns those tags can access.

Note
  • Only Read-Only tags can be restricted
  • A user with Owner tag access has the ability to access all columns.

TBAC_LakeformationColumn_Tag_Update

Note

Read-only members of a Lakeformation dataset can only view the columns in the dataset schema for which they have been granted permissions.

Query Datasets

Once we load data into the datasets, it is ready for the user to query and analyze it directly from the Run Query tab.

Query_Lakeformation_Datasets

If user has read-only permissions, displayed results will be limited to allowed columns.

Query_Lakeformation_Datasets

Note

Lakeformation governed datasets are deprecated as of v2.3. Users can utilize Athena Iceberg datasets instead of the Lakeformation Governed datasets which provide the same features and more:

* Read the data
* Upsert records
* Delete records
* Time travel and version travel queries
* View History and Snapshots

Limitations

  • Fine-grained permissions are not supported in Groups. If a user has fine-grained permissions applied at an authorized user level and has read-only permissions through the group, Amorphic applies the narrowed down permissions, i.e. the fine-grained row and column permissions set at the authorized user level. Please refer to the "How Are Permissions Applied" section above.
  • ETL and ML services in Amorphic such as Machine learning notebooks, ETL notebooks and Jobs don't support attaching read-only lakeformation datasets to the user.
  • Views:
    • View permissions needs to be aligned with Dataset permissions i.e Owner of the view needs to provide the dataset access of the underlying lakeformation dataset before granting the view access.
    • When the owner of the Lakeformation dataset updates the access control using authorized users or groups, querying the view fails with a message saying "view is stale; it must be re-created". The owner of the view needs to either use the CREATE OR REPLACE statement to recreate the view, or delete and re-create the view with the necessary user permissions. For more details, please check the AWS Documentation on this topic.
    • Users are not allowed to create views on top of governed datasets.
  • DMS tasks doesn't support loading of data to Lakeformation target datasets.
  • Currently, due to character limitations on IAM policies, AWS can only register up to 500 Lakeformation datasets. If you receive the error message DS-1061 - Failed to register dataset in the lakeformation catalog, error message: Unable to register the following path: s3://..., this may be due to the limit. As a workaround, you can remove unnecessary Lakeformation datasets from Amorphic and try again to publish a new Lakeformation dataset.
  • For more information about restrictions on datasets governed by Lake Formation, refer to Governed Table restrictions in the documentation.