Skip to main content
Version: v2.0 print this page

Data Quality Checks

Amorphic data quality checks helps you to 'unit-test' data to find error early, before the data gets fed to consuming systems or machine learning algorithms.

Using Amorphic data quality checks users can create a set of constraints (rules) on columns of structured datasets, when executed provides a report of constraints succeeded/failed. A data quality check execution is considered a failure even if one constraint is a failure.

Amorphic data quality checks page consists of options to list or create a new Data quality check. You can sort through the data quality checks list using entities like name, created by, creation time etc.

Create Data Quality Check

Data quality checks

You can create new data quality checks in Amorphic by using the "Create Data Quality Check" functionality of Amorphic application. In order to create a new data quality check on a dataset, you would need at least one structured dataset to which you have access to.

Following are the fields needed to create a data quality check:

  • Data Quality Check Name: Data quality check name must be 3-50 alphanumeric, _ characters only. It must be unique across the application.
  • Description: Description of the data quality check being created.
  • Domain: Logical grouping of datasets. This will shortlist datasets from a particular domain.
  • Dataset Name: Structured dataset on which data quality check is to be performed.
  • Auto-Constraint Suggestions Enabled: Enables or disables suggestion of auto constraints. A major hurdle in data validation is that someone needs to come up with the actual constraints to apply on the data. This can be very difficult for large, real-world datasets, especially if they are very complex and contain information from a lot of different sources. Enabling this functionality assists users in finding reasonable constraints for their data.
  • Keywords: Comma separated list of keywords. Keywords are indexed and searchable within the application. Please choose keywords which are meaningful to you and others. You can use these to flag related datasets with the same keywords so that you can easily find them later.

User can also create a data quality check by using the "Navigator" which would direct the user to data quality check creation page from any where in the application. To get the option displayed, the user need to double tap on "Ctrl" button in the keyboard.

Edit Data Quality Check

Data quality checks metadata can be changed, add, modify or remove constraints using edit data quality check button.

Execute a Data Quality Check

data_quality_check\execution

Amorphic data quality checks can be triggered on-demand or based on a schedule. Execute data quality check button can be found on data quality check details page.

Stop Data Quality Check execution

Data quality check execution can be stopped by using the 'Stop Execution' option in more options icon

View Data Quality Check executions

data_quality_check\execution

Clicking on View Results of a particular execution shows report of constraints succeeded/failed.

Data_quality_check_suggestions

Auto constraint suggestions can be found by clicking on View Auto Suggestions on the data quality check execution.

Clone Data Quality Checks

User can clone a data quality check in Amorphic by clicking on clone button on the top right corner of the data quality check details page.

Clone data quality check page auto-populates with the metadata of data quality check from which it is being cloned, reducing the effort to fill every field required for registering the data quality check. You only have to give it a unique name.

Delete Data Quality Check

Data Quality Check can be deleted using the "Delete" (trash) icon on the right corner of the page. Once data quality check deletion is triggered, it'll immediately delete all the related metadata.

Authorized Users

This tab shows the list of users authorized to perform operations on the data quality checks. The owner, user who created or have owner access to the data quality check, can provide data quality check access to any other user in the system.

There are two type of access types:

  • Owner: This User has permissions to view, edit, run the data quality checks and provide access to other user for the data quality checks.
  • Read-only: This user has limited permission to data quality checks, such as viewing and running data quality check details.

Authorized Groups

This tab shows the list of groups authorized to perform operations on data quality checks. A group is a list of users given access to a resource. Groups are created by going to User Profile -> Profile & Settings -> Groups

There are two type of access types:

  • Owner: This group of users has permissions to view, edit, run the data quality checks and provide access to other user/groups for the resources.
  • Read-only: This group has limited permission to data quality checks, such as viewing and running the data quality check details.

Constraint Definitions

Name of the constraintDefinition of the constraint
hasMaxCreates a constraint that asserts on the maximum of the column. The column contains either a long, int or float datatype.
hasMinCreates a constraint that asserts on the minimum of a column. The column is contains either a long, int or float datatype.
hasMaxLengthCreates a constraint that asserts on the maximum length of a string datatype column.
hasMinLengthCreates a constraint that asserts on the minimum length of a string datatype column.
hasMeanCreates a constraint that asserts on the mean of the column.
hasSumCreates a constraint that asserts on the sum of the column.
hasStandardDeviationCreates a constraint that asserts on the standard deviation of the column.
hasApproxCountDistinctCreates a constraint that asserts on the approximate count distinct of the given column.
isCompleteCreates a constraint that asserts on a column completion.
isUniqueCreates a constraint that asserts on a column uniqueness.
containsCreditCardNumberCheck to run against the compliance of a column against a Credit Card pattern.
containsEmailCheck to run against the compliance of a column against an e-mail pattern.
containsURLCheck to run against the compliance of a column against an URL pattern.
isPositiveCreates a constraint which asserts that a column contains no negative values and is greater than 0.
containsSocialSecurityNumberCheck to run against the compliance of a column against the Social security number pattern for the US.
isNonNegativeCreates a constraint which asserts that a column contains no negative values.
hasCompletenessCreates a constraint that asserts column completion. Uses the given history selection strategy to retrieve historical completeness values on this column from the history provider.
hasEntropyCreates a constraint that asserts on a column entropy. Entropy is a measure of the level of information contained in a message.
hasMutualInformationCreates a constraint that asserts on a mutual information between two columns. Mutual Information describes how much information about one column can be inferred from another.
hasCorrelationCreates a constraint that asserts on the pearson correlation between two columns.
isLessThanAsserts that, in each row, the value of columnA is less than the value of columnB.
isLessThanOrEqualToAsserts that, in each row, the value of columnA is less than or equal to the value of columnB.
isGreaterThanAsserts that, in each row, the value of columnA is greater than the value of columnB.
isGreaterThanOrEqualToAsserts that, in each row, the value of columnA is greater than or equal to the value of columnB.
hasUniquenessCreates a constraint that asserts any uniqueness in a single or combined set of key columns. Uniqueness is the fraction of unique values of a column(s) values that occur exactly once.
hasDistinctnessCreates a constraint on the distinctness in a single or combined set of key columns. Distinctness is the fraction of distinct values of a column(s).
hasUniqueValueRatioCreates a constraint on the unique value ratio in a single or combined set of key columns.
haveCompletenessCreates a constraint that asserts column completion. Uses the given history selection strategy to retrieve historical completeness values on this column from the history provider.
haveAnyCompletenessCreates a constraint that asserts on any completion in the combined set of columns.
areCompleteCreates a constraint that asserts completion in combined set of columns.
areAnyCompleteCreates a constraint that asserts any completion in the combined set of columns.
isContainedInAsserts that every non-null value in a column is contained in a set of predefined values.