Skip to main content
Version: v1.14 print this page

Data Quality checks (Beta)

Amorphic data quality checks helps you to 'unit-test' data to find error early, before the data gets fed to consuming systems or machine learning algorithms.

Data quality checks

Using Amorphic data quality checks users can create a set of constraints (rules) on columns of structured datasets, when executed provides a report of constraints succeeded/failed. A data quality check execution is considered a failure even if one constraint is a failure.

Amorphic data quality checks page consists of options to list or create a new Data quality check. You can sort through the data quality checks list using entities like name, created by, creation time etc.

Create Data Quality Check

You can create new data quality checks in Amorphic by using the "Create Data Quality Check" functionality of Amorphic application. In order to create a new data quality check on a dataset, you would need at least one structured dataset to which you have access to.

Following are the fields needed to create a data quality check:

  • Data Quality Check Name: Data quality check name must be 3-50 alphanumeric, _ characters only. It must be unique across the application.
  • Description: Description of the data quality check being created.
  • Domain: Logical grouping of datasets. This will shortlist datasets from a particular domain.
  • Dataset Name: Structured dataset on which data quality check is to be performed.
  • Auto-Constraint Suggestions Enabled: Enables or disables suggestion of auto constraints. A major hurdle in data validation is that someone needs to come up with the actual constraints to apply on the data. This can be very difficult for large, real-world datasets, especially if they are very complex and contain information from a lot of different sources. Enabling this functionality assists users in finding reasonable constraints for their data.
  • Keywords: Comma separated list of keywords. Keywords are indexed and searchable within the application. Please choose keywords which are meaningful to you and others. You can use these to flag related datasets with the same keywords so that you can easily find them later.

Below image shows how to create a new data quality check:

Create_data_quality_check

User can also create a data quality check by using the "Navigator" which would direct the user to data quality check creation page from any where in the application. To get the option displayed, the user need to double tap on "Ctrl" button in the keyboard.

Below is a simple graphic to demonstrate Navigator.

Navigator

Edit Data Quality Check

Data quality checks metadata can be changed, add, modify or remove constraints using edit data quality check button.

Edit_data_quality_check

Execute a Data Quality Check

Amorphic data quality checks can be triggered on-demand or based on a schedule. Execute data quality check button can be found on data quality check details page.

Data_quality_check_execution

On-demand execution

Data quality check can be triggered on demand using execute button and executions are listed under Executions tab as shown below:

Ondemand_data_quality_check_execution

Scheduled execution

A schedule can be created to trigger data quality check periodically. Schedule can be enabled/disabled anytime.

Schedule_data_quality_check_execution

Stop Data Quality Check execution

Data quality check execution can be stopped by using the 'Stop Execution' option in more options icon

Stop data quality check execution

Stop data quality check execution

View Data Quality Check executions

All the executions of a data quality check are listed under "Executions" tab in data quality check details page.

Data_quality_check_executions

Clicking on View Results of a particular execution shows report of constraints succeeded/failed.

Data_quality_check_execution_details

Auto constraint suggestions can be found by clicking on View Auto Suggestions on the data quality check execution.

Data_quality_check_suggestions

Authorized Users

This tab shows the list of users authorized to perform operations on the data quality checks. The owner, user who created or have owner access to the data quality check, can provide data quality check access to any other user in the system.

There are two type of access types:

  • Owner: This User has permissions to view, edit, run the data quality checks and provide access to other user for the data quality checks.
  • Read-only: This user has limited permission to data quality checks, such as viewing and running data quality check details.

Authorized Groups

This tab shows the list of groups authorized to perform operations on data quality checks. A group is a list of users given access to a resource. Groups are created by going to User Profile -> Profile & Settings -> Groups

There are two type of access types:

  • Owner: This group of users has permissions to view, edit, run the data quality checks and provide access to other user/groups for the resources.
  • Read-only: This group has limited permission to data quality checks, such as viewing and running the data quality check details.

List Data Quality Checks

Users will be able to see the list of data quality checks they have access to. They can also limit the results shown per page using Results Per Page option, and can sort the them based on desired field and its order.

List_of_data_quality_checks

Clone Data Quality Checks

User can clone a data quality check in Amorphic by clicking on clone button on the top right corner of the data quality check details page.

Clone data quality check page auto-populates with the metadata of data quality check from which it is being cloned, reducing the effort to fill every field required for registering the data quality check.

The only field user needs to input/change is the "Data Quality Check Name", as data quality check with the existing data quality check Name can not be created. User can edit any field if he wants to before clicking the "submit" button at the bottom right corner of the form.

Below is the graphic pointing to the populated fields in clone data quality check form:

Clone_data_quality_check

Once the user clicks the "Submit" button, a new data quality check will be created. The created data quality check will show up in the data quality checks page.

Delete Data Quality Check

Data Quality Check can be deleted using the "Delete" (trash) icon on the right corner of the page. Once data quality check deletion is triggered, it'll immediately delete all the related metadata.

Delete_data_quality_check

Constraint Definitions

Name of the constraintDefinition of the constraint
hasMaxCreates a constraint that asserts on the maximum of the column. The column contains either a long, int or float datatype.
hasMinCreates a constraint that asserts on the minimum of a column. The column is contains either a long, int or float datatype.
hasMaxLengthCreates a constraint that asserts on the maximum length of a string datatype column.
hasMinLengthCreates a constraint that asserts on the minimum length of a string datatype column.
hasMeanCreates a constraint that asserts on the mean of the column.
hasSumCreates a constraint that asserts on the sum of the column.
hasStandardDeviationCreates a constraint that asserts on the standard deviation of the column.
hasApproxCountDistinctCreates a constraint that asserts on the approximate count distinct of the given column.
isCompleteCreates a constraint that asserts on a column completion.
isUniqueCreates a constraint that asserts on a column uniqueness.
containsCreditCardNumberCheck to run against the compliance of a column against a Credit Card pattern.
containsEmailCheck to run against the compliance of a column against an e-mail pattern.
containsURLCheck to run against the compliance of a column against an URL pattern.
isPositiveCreates a constraint which asserts that a column contains no negative values and is greater than 0.
containsSocialSecurityNumberCheck to run against the compliance of a column against the Social security number pattern for the US.
isNonNegativeCreates a constraint which asserts that a column contains no negative values.
hasCompletenessCreates a constraint that asserts column completion. Uses the given history selection strategy to retrieve historical completeness values on this column from the history provider.
hasEntropyCreates a constraint that asserts on a column entropy. Entropy is a measure of the level of information contained in a message.
hasMutualInformationCreates a constraint that asserts on a mutual information between two columns. Mutual Information describes how much information about one column can be inferred from another.
hasCorrelationCreates a constraint that asserts on the pearson correlation between two columns.
isLessThanAsserts that, in each row, the value of columnA is less than the value of columnB.
isLessThanOrEqualToAsserts that, in each row, the value of columnA is less than or equal to the value of columnB.
isGreaterThanAsserts that, in each row, the value of columnA is greater than the value of columnB.
isGreaterThanOrEqualToAsserts that, in each row, the value of columnA is greater than or equal to the value of columnB.
hasUniquenessCreates a constraint that asserts any uniqueness in a single or combined set of key columns. Uniqueness is the fraction of unique values of a column(s) values that occur exactly once.
hasDistinctnessCreates a constraint on the distinctness in a single or combined set of key columns. Distinctness is the fraction of distinct values of a column(s).
hasUniqueValueRatioCreates a constraint on the unique value ratio in a single or combined set of key columns.
haveCompletenessCreates a constraint that asserts column completion. Uses the given history selection strategy to retrieve historical completeness values on this column from the history provider.
haveAnyCompletenessCreates a constraint that asserts on any completion in the combined set of columns.
areCompleteCreates a constraint that asserts completion in combined set of columns.
areAnyCompleteCreates a constraint that asserts any completion in the combined set of columns.
isContainedInAsserts that every non-null value in a column is contained in a set of predefined values.