Validates the input statistics against the provided input schema.
tfdv.validate_statistics(
statistics: statistics_pb2.DatasetFeatureStatisticsList,
schema: schema_pb2.Schema,
environment: Optional[Text] = None,
previous_statistics: Optional[statistics_pb2.DatasetFeatureStatisticsList] = None,
serving_statistics: Optional[statistics_pb2.DatasetFeatureStatisticsList] = None,
custom_validation_config: Optional[custom_validation_config_pb2.CustomValidationConfig] = None
) -> anomalies_pb2.Anomalies
Used in the notebooks
This method validates the statistics
against the schema
. If an optional
environment
is specified, the schema
is filtered using the
environment
and the statistics
is validated against the filtered schema.
The optional previous_statistics
and serving_statistics
are the statistics
computed over the control data for drift- and skew-detection, respectively.
If drift- or skew-detection is conducted, then the raw skew/drift measurements
for each feature that is compared will be recorded in the drift_skew_info
field in the returned Anomalies
proto.
Args |
statistics
|
A DatasetFeatureStatisticsList protocol buffer denoting the
statistics computed over the current data. Validation is currently
supported only for lists with a single DatasetFeatureStatistics proto or
lists with multiple DatasetFeatureStatistics protos corresponding to data
slices that include the default slice (i.e., the slice with all
examples). If a list with multiple DatasetFeatureStatistics protos is
used, this function will validate the statistics corresponding to the
default slice.
|
schema
|
A Schema protocol buffer.
Note that TFDV does not currently support validation of the following
messages/fields in the Schema protocol buffer:
- FeaturePresenceWithinGroup
- Schema-level FloatDomain and IntDomain (validation is supported for
Feature-level FloatDomain and IntDomain)
|
environment
|
An optional string denoting the validation environment.
Must be one of the default environments specified in the schema.
By default, validation assumes that all Examples in a pipeline adhere
to a single schema. In some cases introducing slight schema variations
is necessary, for instance features used as labels are required during
training (and should be validated), but are missing during serving.
Environments can be used to express such requirements. For example,
assume a feature named 'LABEL' is required for training, but is expected
to be missing from serving. This can be expressed by defining two
distinct environments in schema: ["SERVING", "TRAINING"] and
associating 'LABEL' only with environment "TRAINING".
|
previous_statistics
|
An optional DatasetFeatureStatisticsList protocol
buffer denoting the statistics computed over an earlier data (for
example, previous day's data). If provided, the validate_statistics
method will detect if there exists drift between current data and
previous data. Configuration for drift detection can be done by
specifying a drift_comparator in the schema.
|
serving_statistics
|
An optional DatasetFeatureStatisticsList protocol
buffer denoting the statistics computed over the serving data. If
provided, the validate_statistics method will identify if there exists
distribution skew between current data and serving data. Configuration
for skew detection can be done by specifying a skew_comparator in the
schema.
|
custom_validation_config
|
An optional config that can be used to specify
custom validations to perform. If doing single-feature validations,
the test feature will come from statistics and will be mapped to
feature in the SQL query. If doing feature pair validations, the test
feature will come from statistics and will be mapped to feature_test
in the SQL query, and the base feature will come from
previous_statistics and will be mapped to feature_base in the SQL
query.
|
Returns |
An Anomalies protocol buffer.
|
Raises |
TypeError
|
If any of the input arguments is not of the expected type.
|
ValueError
|
If the input statistics proto contains multiple datasets, none
of which corresponds to the default slice.
|