XGBTreeHFD

XGBTreeHFD

TreeHFD decomposition of a xgboost model.

class treehfd.XGBTreeHFD(xgb_model: XGBModel)

TreeHFD decomposition of a xgboost model.

XGBTreeHFD is the TreeHFD decomposition of a xgboost tree ensemble, defined as the Hoeffing functional decomposition of the target tree ensemble, where the hierarchical orthogonality constraints are discretized over the Cartesian tree partitions. The TreeHFD algorithm solves a least square problem for each tree to find the coefficients defining the set of functional components of the decomposition, which are all piecewise constant on the Cartesian tree partitions.

Parameters:

xgb_model (xgb.sklearn.XGBModel) – The xgboost model for regression or classification to be decomposed with TreeHFD.

xgb_model

The input xgboost model.

Type:

xgb.sklearn.XGBModel

config

The config of the xgboost model with all settings and parameters, from xgb_model.get_booster().save_config().

Type:

dict

max_depth

Tree depth parameter of xgb_model (must be greater than 0).

Type:

int

n_estimators

Number of trees of xgb_model (must be greater than 0). For multiclass classification, there are n_estimators trees for each logit.

Type:

int

base_score

Base scores of xgb_model.

Type:

np.ndarray

num_feature

The number of variables of the data used to fit xgb_model.

Type:

int

num_parallel_tree

The number of trees in random forests (one for gradient boosting models).

Type:

int

num_outputs

The number of model outputs: one for regression and binary classification, and the number of classes for multiclass classification.

Type:

int

xgb_table

The table with the tree structures, obtained from the xgboost methods xgb_model.get_booster().trees_to_dataframe().

Type:

pd.core.frame.DataFrame

interaction_order

Set to 1 to fit only main effects, or to 2 to also include second-order interactions in the TreeHFD decomposition.

Type:

int, default=2

interaction_list

The list of interactions, defined as variable pairs.

Type:

np.array, default=np.empty((0, 0))

depth_variable

Variables are selected at the first depth_variable levels of the tree for the components of the decomposition. Set to max_depth by default. Reducing depth_variable strongly speeds up computations for deep trees.

Type:

int, default=max_depth

treehfd_list

The list of the TreeHFD decomposition for each tree. For multiclass classification, all trees are stacked together following the index of xgb_table: for gradient boosting models, the trees of the same boosting round are stored next to each other, whereas for random forests, the trees of the same logit are stored next to each other.

Type:

list, default=[]

eta0

Intercept of the TreeHFD decomposition of xgb_model. For multiclass classification, eta0 is an array with the intercept of each logit decomposition.

Type:

float | np.ndarray, default=0

Examples

>>> import numpy as np
>>> import xgboost as xgb
>>> from treehfd import XGBTreeHFD
>>> from treehfd.validation import sample_data
>>> np.random.seed(11)
>>> X, y = sample_data(nsample=100)
>>> xgb_model = xgb.XGBRegressor()
>>> xgb_model = xgb_model.fit(X, y)
>>> treehfd_model = XGBTreeHFD(xgb_model)
>>> treehfd_model.fit(X, interaction_order=2)
>>> X_new, y_new = sample_data(nsample=3)
>>> y_main, y_order2 = treehfd_model.predict(X_new)
>>> print(f'TreeHFD intercept: {treehfd_model.eta0}')
>>> print(f'TreeHFD main effect predictions: {y_main}')
>>> print(f'TreeHFD interaction predictions: {y_order2}')
>>> interactions = treehfd_model.interaction_list
>>> print(f'TreeHFD interactions: {interactions}')
treehfd.XGBTreeHFD.fit(self, X: ndarray, interaction_order: int = 2, interaction_list: ndarray | None = None, depth_variable: int | None = None, verbose: bool = True) None

Fit TreeHFD decomposition of the provided xgboost model.

Parameters:
  • X (np.ndarray) – The input data used to train the xgboost model.

  • interaction_order (int, default=2) – Set to 1 to fit only main effects, or to 2 to also include second-order interactions in the TreeHFD decomposition.

  • interaction_list (np.ndarray, default=None) – Predefined list of second-order interactions to be estimated in the decomposition. Each row defines an interaction with two integers for the variable indices. Default=None, and interactions are automatically extracted from tree paths.

  • depth_variable (int, default=None) – Variables are selected at the first depth_variable levels of the tree for the components of the decomposition. Default is None, and all variables are selected.

  • verbose (bool, default=True) – Set to False to deactivate the console display of computation progress (% of trees).

treehfd.XGBTreeHFD.predict(self, X_new: ndarray, verbose: bool = True) tuple

Predict TreeHFD components for new input data.

Parameters:
  • X_new (np.ndarray) – New input data where TreeHFD predictions are computed.

  • verbose (bool, default=True) – Set to False to deactivate the console display of computation progress (% of trees).

Returns:

y_mainnp.ndarray

Array for the predictions of main effects. For multiclass classification, y_main is an array of order three, with the prediction matrix for each label (axis 0: data samples, axis 1: labels, axis 2: input variables).

y_order2np.ndarray

Array for predictions of second-order interactions (columns are ordered according to interaction_list). For multiclass classification, y_order2 is an array of order three, with the prediction matrix for each label.

Return type:

tuple

treehfd.XGBTreeHFD._tree_predict(self, X: ndarray) ndarray

Compute the original predictions of each tree.

Parameters:

X (np.ndarray) – The input data where tree predictions are computed.

Returns:

tree_predictions – Array with the predictions of each tree of the ensemble for X, where each column stores the predictions of a tree. For multiclass classification, all trees are stacked together following the index of xgb_table (see treehfd_list doc).

Return type:

np.ndarray

treehfd.XGBTreeHFD._get_output_idx(self, tree_idx: int) int

Get output index from tree_idx.

Parameters:

tree_idx (int) – The tree index from xgb_table.

Returns:

output_idx – Index of the output modeled by the tree of index tree_idx. Notice that output_idx is always 0 for regression and binary classification.

Return type:

int