sliceguard API

class sliceguard.sliceguard.SliceGuard

The main class for detecting issues in your data

explain(df: DataFrame, precomputed_embeddings: Dict[str, array] = {}, max_display: int = 20) Any

Generate SHAP values for explaining the model’s predictions on the given dataframe.

Parameters:
  • df – A pandas DataFrame containing the data to be explained.

  • precomputed_embeddings – Optional. A dictionary of precomputed embeddings for the data. This should be in the format {“column_name”: embedding_array}.

  • max_display – The maximum number of features to display in SHAP plots. Defaults to 20.

Returns:

The SHAP values corresponding to the features in the dataframe.

find_issues(data: DataFrame, features: List[str], y: str = None, y_pred: str = None, metric: Callable = None, min_support: int = None, min_drop: float = None, n_slices: int = None, criterion: Literal['drop', 'support', 'drop*support'] = None, metric_mode: Literal['min', 'max'] = None, drop_reference: Literal['overall', 'parent'] = 'overall', remove_outliers: bool = False, feature_types: Dict[str, Literal['raw', 'nominal', 'ordinal', 'numerical', 'embedding']] = {}, feature_orders: Dict[str, list] = {}, disable_scaling: List[str] = [], precomputed_embeddings: Dict[str, array] = {}, embedding_models: Dict[str, str] = {}, embedding_weights: Dict[str, float] = {}, hf_auth_token: str | None = None, hf_num_proc: int | None = None, hf_batch_size: int = 1, automl_task: Literal['classification', 'regression'] = 'classification', automl_split_key: str | None = None, automl_train_split: str | None = None, automl_time_budget: float = 20.0, automl_use_full_embeddings: bool = False, automl_hf_model: str | None = None, automl_hf_model_architecture: str | None = None, automl_hf_model_output_dir: str = './hf_model', automl_hf_model_epochs: int = 5) List[dict]

Find slices that are classified badly by your model.

Parameters:
  • data – A pandas dataframe containing your data.

  • features – A list of feature column names sliceguard should use for for identifying problematic data clusters.

  • y – Name of the dataframe column containing your ground-truth labels.

  • y_pred – Name of the dataframe column containing your model’s predictions.

  • metric – A callable metric function that must correspond to the form metric(y_true, y_pred) -> scikit-learn style.

  • min_support – Minimum support for a cluster to be listed as an issue.

  • min_drop – Minimum metric drop for a cluster to be listed as an issue.

  • n_slices – Number of problematic clusters find_issues should return after sorting them by a criterion specified by the “criterion” parameter.

  • criterion – Criterion after which the slices get sorted when using n_slices. One of drop, support or drop*support.

  • metric_mode – Optimization goal for your metric. max is the right choice for accuracy while e.g. min is good for regression error.

  • drop_reference – Reference value for calculating the drop for a cluster. Default is “overall” which calculates the difference to the overall metric. “parent” will calculate the drop relative to each clusters parent cluster. Use the second option for getting more diverse results, e.g., getting problematic clusters in each class when dealing with image classification instead of focussing on the most difficult class.

  • remove_outliers – Filter metric outliers in identified clusters. Especially useful if metric is unbounded and can heavily distort a clusters overall metric. Will be significantly more computationally expensive.

  • feature_types – Specify the types of your features if sliceguard doesn’t detect them properly. Can be “nominal”, “ordinal”, “numerical” for scalar values. Can be “raw” for filepaths to unstructured data. Can be “embedding” for embedding vectors.

  • feature_orders – Specify the order of ordinal feature values that should be used for encoding. This is required for EVERY ordinal feature that is not specified by pandas categorical ordered datatypes.

  • disable_scaling – List of features that should not be scaled for automl and clustering. Right now only applied to numerical features.

  • precomputed_embeddings – Supply precomputed embeddings for raw columns. Form should be precomputed_embeddings={“image”: image_embeddings}. This is especially useful if you run repeated checks on your data and you want to compute embeddings only once.

  • embedding_models – Supply huggingface model identifiers or locally saved models used for computing embeddings on specific columns. Format should be embedding_models={“image”: “google/vit-base-patch16-224”}.

  • embedding_weights – Specify how much each computed embedding is weighted in the cluster search. Useful to lower the influence of an embedding by setting the parameter lower than 1.0.

  • hf_auth_token – The authentification token used to download embedding models from the huggingface hub.

  • hf_num_proc – Number of processes used in embedding computation.

  • hf_batch_size – Batch size used for embedding computation.

  • automl_task – The task specification for training a model. Has to be one of classification or regression. Used when only supplying labels.

  • automl_split_key – Name of column used for splitting the data when sliceguard trains a model.

  • automl_train_split – The value used for marking the train split when sliceguard trains a model. If supplied, rest of data will be used as validation set. If not supplied using crossvalidation.

  • automl_time_budget – The time budget used by sliceguard for training a model.

  • automl_use_full_embeddings – Wether to use the raw embeddings instead of the pre-reduced ones when training a model. Can potentially improve performance.

  • automl_hf_model – A pre-trained model that can be used instead of the default xgboost model.

  • automl_hf_model_architecture – Model architecture used to train a model on “features”. Right now supports only image classification.

  • automl_hf_model_output_dir – Output directory for training deep learning models via the huggingface transformers library.

  • automl_hf_model_epochs – If finetuning hf model, this determines how many epochs the finetuning is going.

Return type:

List of issues, represented as python dicts.

fit(df: DataFrame, y: str, features: List[str] | None = None, task: Literal['classification', 'regression'] = 'classification', use_full_embeddings: bool = False, feature_types: Dict[str, Literal['raw', 'nominal', 'ordinal', 'numerical', 'embedding']] = {}, feature_orders: Dict[str, list] = {}, disable_scaling: List[str] = [], split_key: str | None = None, train_split: str | None = None, time_budget: float = 20.0, precomputed_embeddings: Dict[str, array] = {}, embedding_models: Dict[str, str] = {}, embedding_weights: Dict[str, float] = {}, hf_auth_token: str | None = None, hf_num_proc: int | None = None, hf_batch_size: int = 1) List[dict]

Fit the model to the provided dataframe and identify problematic data clusters.

Parameters:
  • df – A pandas DataFrame containing your data.

  • y – Name of the DataFrame column containing your ground-truth labels.

  • features – Optional. A list of feature column names the model should use for identifying problematic data clusters.

  • task – The task specification for training a model, either ‘classification’ or ‘regression’.

  • use_full_embeddings – Whether to use the raw embeddings instead of pre-reduced ones when training the model, which can potentially improve performance.

  • feature_types – Specify the types of your features if they are not detected properly. Types can be ‘raw’, ‘nominal’, ‘ordinal’, ‘numerical’, or ‘embedding’.

  • feature_orders – Specify the order of ordinal feature values for encoding. This is required for every ordinal feature not specified by pandas categorical ordered datatypes.

  • disable_scaling – List of features that should not be scaled for automl and clustering, typically applied only to numerical features.

  • split_key – Name of column used for splitting the data when training the model.

  • train_split – The value marking the train split when training the model. If supplied, the rest of the data will be used as the validation set. If not, crossvalidation is used.

  • time_budget – The time budget used for training the model.

  • precomputed_embeddings – Provide precomputed embeddings for raw columns, useful for repeated checks on your data to avoid re-computation.

  • embedding_models – Supply Hugging Face model identifiers or locally saved models for computing embeddings on specific columns.

  • embedding_weights – Specify the weighting of each computed embedding in the cluster search. Lower values reduce the influence of an embedding.

  • hf_auth_token – Authentication token for downloading models from the Hugging Face hub.

  • hf_num_proc – Number of processes used in embedding computation.

  • hf_batch_size – Batch size used for embedding computation.

Returns:

A list of identified issues, represented as python dicts.

predict(df: DataFrame, precomputed_embeddings: Dict[str, array] = {}) array

Perform predictions on a given dataframe using the trained model.

Parameters:
  • df – A pandas DataFrame containing the data for prediction.

  • precomputed_embeddings – Optional. A dictionary of precomputed embeddings for the data. This should be in the format {“column_name”: embedding_array}.

Returns:

An array of predictions generated by the model.

predict_proba(df: DataFrame, precomputed_embeddings: Dict[str, array] = {}) array

Compute probability estimates for each class on the given dataframe using the trained model.

Parameters:
  • df – A pandas DataFrame containing the data for which probability estimates are needed.

  • precomputed_embeddings – Optional. A dictionary of precomputed embeddings for the data. This should be in the format {“column_name”: embedding_array}.

Returns:

An array of probability estimates for each class.

report(spotlight_dtype: Dict[str, Any] = {}, issue_portion: int | float | None = None, non_issue_portion: int | float | None = None, host: str = '127.0.0.1', port: int = 'auto', no_browser: bool = False) Tuple[DataFrame, List[DataIssue], Dict[str, Any]]

Create an interactive report on the found issues in spotlight.

Parameters:
  • spotlight_dtype – Define a datatype mapping for the interactive spotlight report. Will be passed to dtypes parameter of spotlight.show. Form is spotlight_dtype={“image”: spotlight.Image}.

  • issue_portion – The absolute or relative value of samples belonging to an issue that are shown in the report (for downsampling).

  • non_issue_portion – The absolute or relative value of samples not belonging to an issue that are shown in the report (for downsampling).

  • host – The host spotlight should be started on. Default is 127.0.0.1.

  • port – The port spotlight should be started on. Default is “auto”.

  • no_browser – Do not start spotlight but just return the dataframe and issues. Useful for programmatic issue evaluation.

Return type:

Tuple in the format (enriched dataframe, list of spotlight DataIssues, spotlight datatype mapping dict, spotlight layout).