Version: v2.2 print this page

Data Profiling

info

From version 2.2, encryption(in-flight, at-rest) for all jobs and catalog is enabled. All the existing jobs(User created, and also system created) were updated with encryption related settings, and all the newly created jobs will have encryption enabled automatically.

Data profiling is the process of analyzing an existing data source to collect statistics or summaries about the data. It helps to identify anomalies and evaluate data quality.

/img/datasets/datasets/Data_profiling.png

The image above displays a data profile for a cross-sectional MRI dataset from a sample of people diagnosed with Alzheimer's disease.

Enable Data profiling

You can enable data profiling. It can be enabled or disabled at any time. Data profiling is only available for structured datasets (e.g. datasets hosted on S3-Athena or Redshift).

/img/datasets/datasets/enable-data-profiling.gif

Following are the fields derived in a data profile:

Property	Description
Files	Number of files in the dataset.
Dataset Size	Size of the dataset on S3 for datasets of target location 'S3-Athena', and size on data warehouse for datasets of target locations 'Redshift'.
Rows	Number of rows present in the dataset
Duplicate Rows	Number of non-unique rows present in the dataset.
Columns	Number of columns present in the dataset.
Missing Values	Number of empty cells in the dataset.
Last Modified	Time when the dataset was last edited by a user.
Last Profiled	Time when the data profile was extracted
Data Type	Data types inferred when a dataset is registered.
Min Value	Minimum value of each column.
Max Value	Maximum value of each column.
Sample Rows	A random sample of 10 rows from the dataset.

Note

Users with read-only access will not be able to view data profiling details

Update frequency of Data Profiling

Data profiling jobs run at 12 AM UTC everyday.

When do you update data profile for a dataset?

If data profiling is enabled by the user and there have been additions to the dataset in the last 24 hours, the data profile will be updated. This process is set up to prevent the waste of resources, as the data profile will remain the same if no new files are added.

Concurrency of data profiling jobs

Currently all data profiling jobs run with a concurrency factor of 5.

How long does it take for all data profiles to get updated?

If there are 100 datasets to be profiled, each taking approximately 3 minutes (depending on the dataset size), the total time with a concurrency factor of 5 will be 20 * 3 minutes, or 60 minutes. All data profiles should be updated by 1:00 AM UTC.

What happens in case of failures?

If a data profile fails to be extracted, an error will be displayed on the profile tab and an email alert will be sent to the subscribed user.

Data profile failure

Enable Data profiling​

Update frequency of Data Profiling​

When do you update data profile for a dataset?​

Concurrency of data profiling jobs​

How long does it take for all data profiles to get updated?​

What happens in case of failures?​