Skip to main content
Version: v2.5 print this page

Data Profiling

info

From version 2.2, encryption(in-flight, at-rest) for all jobs and catalog has been enabled. All the existing jobs(User created, and also system created) were updated with encryption related settings, and all the newly created jobs will have encryption enabled automatically.

Data profiling is the process of analyzing existing data source to collect statistics or the summary about the data. It helps to identify anomalies and evaluate data quality.

/img/datasets/datasets/Data_profiling.png

The image above displays a data profile for a cross-sectional MRI dataset from a sample of people diagnosed with Alzheimer's disease.

Enable Data profiling

You can enable data profiling. It can be enabled or disabled at any time. Data profiling is only available for structured datasets (e.g. datasets hosted on S3-Athena or Redshift).

/img/datasets/datasets/enable-data-profiling.gif

Following are the fields derived in a data profile:

PropertyDescription
FilesNumber of files in the dataset.
Dataset SizeSize of the datasets in S3 with target location set as 'S3-Athena', and size of datasets in data warehouse with target location set as 'Redshift'.
RowsNumber of rows present in the dataset.
Duplicate RowsNumber of non-unique rows present in the dataset.
ColumnsNumber of columns present in the dataset.
Missing ValuesNumber of empty cells in the dataset.
Last ModifiedTime when the dataset was last edited by a user.
Last ProfiledTime when the data profile was extracted.
Data TypeData types inferred when a dataset is registered.
Min ValueMinimum value of each column.
Max ValueMaximum value of each column.
Sample RowsA random sample of 10 rows from the dataset.
Note

Users with read-only access will not be able to view data profiling details

Update frequency of Data Profiling

Data profiling jobs are run at 12 AM UTC everyday.

When do you update data profile for a dataset?

If data profiling is enabled by the user and there have been additions to the dataset in the last 24 hours, the data profile will be updated accordingly. This process is set up to prevent waste of any resources, as the data profile will remain the same state if no new files are added.

Concurrency of data profiling jobs

Currently all data profiling jobs run with a concurrency factor of 5.

How long does it take for all data profiles to get updated?

If there are 100 datasets which are to be profiled, assuming that each dataset takes approximately 3 minutes (depending on the dataset size), the total time which will be utilized would be equivalent to 20 * 3 minutes or 60 minutes where concurrency factor will be taken as 5 units. All data profiles should be updated by 1:00 AM UTC.

What happens in case of failures?

If a data profile fails to be extracted, an error will be displayed on the profile tab and an email alert will also be sent to the subscribed user.

Data profile failure