Skip to main content
Version: v1.13 print this page

Athena Datasets

Amorphic Dataset portal helps you create structured, semi-structured and structured Datasets. These Datasets can be used as a single source of truth across the different departments of an organization. Amorphic Datasets helps in providing complete data lake visibility of data.

Datasets

Amorphic Dataset page consists of options to List or Create a new Dataset. Datasets are available to select in the Amorphic Dataset for listing purpose. You can sort through the Dataset using the Domain filters, Create new Datasets or View Dataset details.

This page explains usage of datasets with target location as Athena. Amorphic provides user ability to store csv/tsv/xslx files in s3 without the overhead of maintaining a data warehousing solution for cost effectiveness.

With Athena datasets as the target location, we provide an option of performing a partial data validation on the files uploaded. By default, Data Validation is enabled for s3Athena target location. It can be enabled/disabled at any point of time. Each file is sampled/read partially and every column is validated against the schema which was uploaded to the dataset while registering. This helps the user to quickly detect and perform data correction on corrupt or invalid data files but it takes few extra seconds per file to validate and there will be additional charge per file. As of now, Users can register structured data i.e csv,tsv,xlsx and parquet files and has facility to validate data types such as Strings/Varchar, Integers, Double, Boolean, Date and Timestamp. For accommodating complex data structures we recommend enclosing them in quote chars and register the column schema as String/Varchar once loaded user can perform ETL atop of the data and cast them appropriately.

The CSV Parser/SerDe recommended by AWS Athena has the following limitations:

  • Does not support embedded line breaks in CSV files.
  • Does not support empty fields in columns defined as a numeric data type.

As per the AWS Documentation one work around to achieve this is to import them as string columns and create views on top of it by casting them to the required data types.

Create Athena Datasets

You can create new Datasets with a wide range of Target locations. This section describes using Athena as target location for the datasets. Currently only structured data with file formats CSV or XLSX. The following animation shows a detail workflow of creating datasets with Athena as target.

Create_Athena_Datasets

Load Athena Datasets

Athena Datasets allows user a cost effective solution to store their structured data. All the loaded datasets are immediately available for analysis using the Run Query tab in the Amorphic console. Amorphic advantage of using Athena datasets is that it provides Auto data validation on each file that is being uploaded without the need of any addtional ETL.

The following animation shows a detail workflow of loading data into Athena datasets.

Load_Athena_Datasets

Query Datasets

Once we finish loaded data into Athena datasets, the data is readily available for the User to query and analyze the data directly from the Run Query tab. The following animation shows how can a user run a sample query atop of Athena Datasets.

Query_Athena_Datasets