Skip to main content
Version: v2.2 print this page

Data Profiling

info

From version 2.2, encryption(in-flight, at-rest) for all jobs and catalog is enabled. All the existing jobs(User created, and also system created) were updated with encryption related settings, and all the newly created jobs will have encryption enabled automatically.

Data profiling is the process of analyzing an existing data source to collect statistics or summaries about the data. It helps to identify anomalies and evaluate data quality.

/img/datasets/datasets/Data_profiling.png

The image above displays a data profile for a cross-sectional MRI dataset from a sample of people diagnosed with Alzheimer's disease.

Enable Data profiling

You can enable data profiling. It can be enabled or disabled at any time. Data profiling is only available for structured datasets (e.g. datasets hosted on S3-Athena or Redshift).

/img/datasets/datasets/enable-data-profiling.gif

Following are the fields derived in a data profile:

PropertyDescription
FilesNumber of files in the dataset.
Dataset SizeSize of the dataset on S3 for datasets of target location 'S3-Athena', and size on data warehouse for datasets of target locations 'Redshift'.
RowsNumber of rows present in the dataset
Duplicate RowsNumber of non-unique rows present in the dataset.
ColumnsNumber of columns present in the dataset.
Missing ValuesNumber of empty cells in the dataset.
Last ModifiedTime when the dataset was last edited by a user.
Last ProfiledTime when the data profile was extracted
Data TypeData types inferred when a dataset is registered.
Min ValueMinimum value of each column.
Max ValueMaximum value of each column.
Sample RowsA random sample of 10 rows from the dataset.
Note

Users with read-only access will not be able to view data profiling details

Update frequency of Data Profiling

Data profiling jobs run at 12 AM UTC everyday.

When do you update data profile for a dataset?

If data profiling is enabled by the user and there have been additions to the dataset in the last 24 hours, the data profile will be updated. This process is set up to prevent the waste of resources, as the data profile will remain the same if no new files are added.

Concurrency of data profiling jobs

Currently all data profiling jobs run with a concurrency factor of 5.

How long does it take for all data profiles to get updated?

If there are 100 datasets to be profiled, each taking approximately 3 minutes (depending on the dataset size), the total time with a concurrency factor of 5 will be 20 * 3 minutes, or 60 minutes. All data profiles should be updated by 1:00 AM UTC.

What happens in case of failures?

If a data profile fails to be extracted, an error will be displayed on the profile tab and an email alert will be sent to the subscribed user.

Data profile failure