Skip to main content
Version: v2.5 print this page

Data Quality Checks

info

From version 2.2, encryption(in-flight, at-rest) for all jobs and catalog is enabled. All the existing jobs(User created, and also system created) were updated with encryption related settings, and all the newly created jobs will have encryption enabled automatically.

Amorphic provides data quality checks that help you detect errors in your data before it is utilized by other systems or machine learning algorithms. You can create rules for the columns of your structured datasets, and then run the checks to see if there are any rules that are broken. If one rule is broken, the whole check will fail. In the Amorphic data quality checks page, you can view a list of checks, create new checks, and sort through the list of checks using various criteria such as name, creator, and creation time.

How to Create Data Quality Check?

Data quality checks

To create new data quality check:

  1. Click on Create Data Quality Check In order to create a new data quality check on a dataset you need access to at least one structured dataset.
  2. Fill in the following fields shown below:
PropertyDescription
Data Quality Check NameData quality check name must be 3-120 alphanumeric, _ characters only. It must be unique across the application.
DescriptionDescription of the data quality check being created.
DomainLogical grouping of datasets. This will shortlist datasets from a particular domain.
Dataset NameStructured dataset on which data quality check is to be performed.
Auto-Constraint Suggestions EnabledThis feature enables or disables the suggestion of auto constraints. This can be challenging for large and complex datasets that contain information from multiple sources. Enabling this functionality helps users find suitable constraints for their data.
KeywordsCreate comma-separated keywords to index & search app. Use keywords to flag related datasets for easier future location.

Edit Data Quality Check

You can modify, add, or remove constraints from data quality check's metadata using the "Edit Data Quality Check" button.

Execute a Data Quality Check

You can also execute the data quality checks either on-demand or schedule them. Once the data quality check completes, you will receive an email and a push notification with the execution results.

data_quality_check\execution

Stop Data Quality Check execution

Data quality check execution can be stopped by using the 'Stop Execution' option present in more options icon

View Data Quality Check executions

data_quality_check\execution

You can view the results of a particular execution. The report displays the count of constraints that were both successful and failed.

Data_quality_check_suggestions

To view auto constraint suggestions, click on View Auto Suggestions during a data quality check execution.

Clone Data Quality Checks

Clone a data quality check in Amorphic and it auto-populates the clone page with the original's metadata. Just give it a unique name.

Constraint Definitions

Name of the constraintDefinition of the constraint
hasMaxCreates a constraint that asserts on the maximum value of a column. The column contains either a long, int or float datatype.
hasMinCreates a constraint that asserts on the minimum value of a column. The column is contains either a long, int or float datatype.
hasMaxLengthCreates a constraint that asserts on the maximum length of a string datatype column.
hasMinLengthCreates a constraint that asserts on the minimum length of a string datatype column.
hasMeanCreates a constraint that asserts on the mean of the column.
hasSumCreates a constraint that asserts on the sum of the column.
hasStandardDeviationCreates a constraint that asserts on the standard deviation of the column.
hasApproxCountDistinctCreates a constraint that asserts on the approximate count distinct of the given column.
isCompleteCreates a constraint that asserts on a column completion.
isUniqueCreates a constraint that asserts on a column uniqueness.
containsCreditCardNumberCheck to run against the compliance of a column against a Credit Card pattern.
containsEmailCheck to run against the compliance of a column against an e-mail pattern.
containsURLCheck to run against the compliance of a column against an URL pattern.
isPositiveCreates a constraint which asserts that a column contains no negative values and is greater than 0.
containsSocialSecurityNumberCheck to run against the compliance of a column against the Social security number pattern for the US.
isNonNegativeCreates a constraint which asserts that a column contains no negative values.
hasCompletenessCreates a constraint that asserts column completion. Uses the given history selection strategy to retrieve historical completeness values on this column from the history provider.
hasEntropyCreates a constraint that asserts on a column entropy. Entropy is a measure of the level of information contained in a message.
hasMutualInformationCreates a constraint that asserts on a mutual information between two columns. Mutual Information describes how much information about one column can be inferred from another.
hasCorrelationCreates a constraint that asserts on the pearson correlation between two columns.
isLessThanAsserts that, in each row, the value of columnA is less than the value of columnB.
isLessThanOrEqualToAsserts that, in each row, the value of columnA is less than or equal to the value of columnB.
isGreaterThanAsserts that, in each row, the value of columnA is greater than the value of columnB.
isGreaterThanOrEqualToAsserts that, in each row, the value of columnA is greater than or equal to the value of columnB.
hasUniquenessCreates a constraint that asserts any uniqueness in a single or combined set of key columns. Uniqueness is the fraction of unique values of a column(s) values that occur exactly once.
hasDistinctnessCreates a constraint on the distinctness in a single or combined set of key columns. Distinctness is the fraction of distinct values of a column(s).
hasUniqueValueRatioCreates a constraint on the unique value ratio in a single or combined set of key columns.
haveCompletenessCreates a constraint that asserts column completion. Uses the given history selection strategy to retrieve historical completeness values on this column from the history provider.
haveAnyCompletenessCreates a constraint that asserts on any completion in the combined set of columns.
areCompleteCreates a constraint that asserts completion in combined set of columns.
areAnyCompleteCreates a constraint that asserts any completion in the combined set of columns.
isContainedInAsserts that every non-null value in a column is contained in a set of predefined values.

Data Quality check use case

A retail company has a large database of customer information, including name, address, email, and purchase history. Before running any data analysis or machine learning algorithms on this data, the company wants to ensure the quality of the data by checking for errors and inconsistencies.

To do this, the company sets up a data quality check in Amorphic, with constraints such as:

The email column must contain a valid email address format. The address column must contain a valid postal code. The purchase history column must contain only positive numbers. The company runs the data quality check, which reads the entire database and performs these checks for each record. If any of the constraints fail, the data quality check execution is considered as a failure and the report provides details denoting which constraint failed and for which particular record.

The company can then use this information to correct the errors in the database and ensure that the data is of high quality before running any further data analysis or machine learning algorithms on it.