tfdv.validate_statistics

Validates the input statistics against the provided input schema.

tfdv.validate_statistics(
    statistics: statistics_pb2.DatasetFeatureStatisticsList,
    schema: schema_pb2.Schema,
    environment: Optional[Text] = None,
    previous_statistics: Optional[statistics_pb2.DatasetFeatureStatisticsList] = None,
    serving_statistics: Optional[statistics_pb2.DatasetFeatureStatisticsList] = None,
    custom_validation_config: Optional[custom_validation_config_pb2.CustomValidationConfig] = None
) -> anomalies_pb2.Anomalies

Used in the notebooks

Used in the tutorials
TensorFlow Data Validation

This method validates the statistics against the schema. If an optional environment is specified, the schema is filtered using the environment and the statistics is validated against the filtered schema. The optional previous_statistics and serving_statistics are the statistics computed over the control data for drift- and skew-detection, respectively.

If drift- or skew-detection is conducted, then the raw skew/drift measurements for each feature that is compared will be recorded in the drift_skew_info field in the returned Anomalies proto.

Args
`statistics`	A DatasetFeatureStatisticsList protocol buffer denoting the statistics computed over the current data. Validation is currently supported only for lists with a single DatasetFeatureStatistics proto or lists with multiple DatasetFeatureStatistics protos corresponding to data slices that include the default slice (i.e., the slice with all examples). If a list with multiple DatasetFeatureStatistics protos is used, this function will validate the statistics corresponding to the default slice.
`schema`	A Schema protocol buffer. Note that TFDV does not currently support validation of the following messages/fields in the Schema protocol buffer: FeaturePresenceWithinGroup Schema-level FloatDomain and IntDomain (validation is supported for Feature-level FloatDomain and IntDomain)
`environment`	An optional string denoting the validation environment. Must be one of the default environments specified in the schema. By default, validation assumes that all Examples in a pipeline adhere to a single schema. In some cases introducing slight schema variations is necessary, for instance features used as labels are required during training (and should be validated), but are missing during serving. Environments can be used to express such requirements. For example, assume a feature named 'LABEL' is required for training, but is expected to be missing from serving. This can be expressed by defining two distinct environments in schema: ["SERVING", "TRAINING"] and associating 'LABEL' only with environment "TRAINING".
`previous_statistics`	An optional DatasetFeatureStatisticsList protocol buffer denoting the statistics computed over an earlier data (for example, previous day's data). If provided, the `validate_statistics` method will detect if there exists drift between current data and previous data. Configuration for drift detection can be done by specifying a `drift_comparator` in the schema.
`serving_statistics`	An optional DatasetFeatureStatisticsList protocol buffer denoting the statistics computed over the serving data. If provided, the `validate_statistics` method will identify if there exists distribution skew between current data and serving data. Configuration for skew detection can be done by specifying a `skew_comparator` in the schema.
`custom_validation_config`	An optional config that can be used to specify custom validations to perform. If doing single-feature validations, the test feature will come from `statistics` and will be mapped to `feature` in the SQL query. If doing feature pair validations, the test feature will come from `statistics` and will be mapped to `feature_test` in the SQL query, and the base feature will come from `previous_statistics` and will be mapped to `feature_base` in the SQL query.

Returns
An Anomalies protocol buffer.

Raises
`TypeError`	If any of the input arguments is not of the expected type.
`ValueError`	If the input statistics proto contains multiple datasets, none of which corresponds to the default slice.

tfdv.validate_statistics

Used in the notebooks

Args

Returns

Raises