XGBTreeHFD

XGBTreeHFD

TreeHFD decomposition of a xgboost model.

class treehfd.XGBTreeHFD(xgb_model: XGBModel)

TreeHFD decomposition of a xgboost model.

XGBTreeHFD is the TreeHFD decomposition of a xgboost tree ensemble, defined as the Hoeffing functional decomposition of the target tree ensemble, where the hierarchical orthogonality constraints are discretized over the Cartesian tree partitions. The TreeHFD algorithm solves a least square problem for each tree to find the coefficients defining the set of functional components of the decomposition, which are all piecewise constant on the Cartesian tree partitions.

Parameters:: xgb_model (xgb.sklearn.XGBModel) – The xgboost model for regression or classification to be decomposed with TreeHFD.

xgb_model

The input xgboost model.

Type:: xgb.sklearn.XGBModel

config

The config of the xgboost model with all settings and parameters, from xgb_model.get_booster().save_config().

Type:: dict

max_depth

Tree depth parameter of xgb_model (must be greater than 0).

Type:: int

n_estimators

Number of trees of xgb_model (must be greater than 0). For multiclass classification, there are n_estimators trees for each logit.

Type:: int

base_score

Base scores of xgb_model.

Type:: np.ndarray

num_feature

The number of variables of the data used to fit xgb_model.

Type:: int

num_parallel_tree

The number of trees in random forests (one for gradient boosting models).

Type:: int

num_outputs

The number of model outputs: one for regression and binary classification, and the number of classes for multiclass classification.

Type:: int

xgb_table

The table with the tree structures, obtained from the xgboost methods xgb_model.get_booster().trees_to_dataframe().

Type:: pd.core.frame.DataFrame

interaction_order

Set to 1 to fit only main effects, or to 2 to also include second-order interactions in the TreeHFD decomposition.

Type:: int, default=2

interaction_list

The list of interactions, defined as variable pairs.

Type:: np.array, default=np.empty((0, 0))

depth_variable

Variables are selected at the first depth_variable levels of the tree for the components of the decomposition. Set to max_depth by default. Reducing depth_variable strongly speeds up computations for deep trees.

Type:: int, default=max_depth

treehfd_list

The list of the TreeHFD decomposition for each tree. For multiclass classification, all trees are stacked together following the index of xgb_table: for gradient boosting models, the trees of the same boosting round are stored next to each other, whereas for random forests, the trees of the same logit are stored next to each other.

Type:: list, default=[]

eta0

Intercept of the TreeHFD decomposition of xgb_model. For multiclass classification, eta0 is an array with the intercept of each logit decomposition.

Type:: float | np.ndarray, default=0

Examples

>>> import numpy as np
>>> import xgboost as xgb
>>> from treehfd import XGBTreeHFD
>>> from treehfd.validation import sample_data
>>> np.random.seed(11)
>>> X, y = sample_data(nsample=100)
>>> xgb_model = xgb.XGBRegressor()
>>> xgb_model = xgb_model.fit(X, y)
>>> treehfd_model = XGBTreeHFD(xgb_model)
>>> treehfd_model.fit(X, interaction_order=2)
>>> X_new, y_new = sample_data(nsample=3)
>>> y_main, y_order2 = treehfd_model.predict(X_new)
>>> print(f'TreeHFD intercept: {treehfd_model.eta0}')
>>> print(f'TreeHFD main effect predictions: {y_main}')
>>> print(f'TreeHFD interaction predictions: {y_order2}')
>>> interactions = treehfd_model.interaction_list
>>> print(f'TreeHFD interactions: {interactions}')

treehfd.XGBTreeHFD.fit(self, X: ndarray, interaction_order: int = 2, interaction_list: ndarray | None = None, depth_variable: int | None = None, verbose: bool = True) → None

Fit TreeHFD decomposition of the provided xgboost model.

Parameters:

X (np.ndarray) – The input data used to train the xgboost model.
interaction_order (int, default=2) – Set to 1 to fit only main effects, or to 2 to also include second-order interactions in the TreeHFD decomposition.
interaction_list (np.ndarray, default=None) – Predefined list of second-order interactions to be estimated in the decomposition. Each row defines an interaction with two integers for the variable indices. Default=None, and interactions are automatically extracted from tree paths.
depth_variable (int, default=None) – Variables are selected at the first depth_variable levels of the tree for the components of the decomposition. Default is None, and all variables are selected.
verbose (bool, default=True) – Set to False to deactivate the console display of computation progress (% of trees).

treehfd.XGBTreeHFD.predict(self, X_new: ndarray, verbose: bool = True) → tuple

Predict TreeHFD components for new input data.

Parameters:

X_new (np.ndarray) – New input data where TreeHFD predictions are computed.
verbose (bool, default=True) – Set to False to deactivate the console display of computation progress (% of trees).

Returns:

y_mainnp.ndarray: Array for the predictions of main effects. For multiclass classification, y_main is an array of order three, with the prediction matrix for each label (axis 0: data samples, axis 1: labels, axis 2: input variables).
y_order2np.ndarray: Array for predictions of second-order interactions (columns are ordered according to interaction_list). For multiclass classification, y_order2 is an array of order three, with the prediction matrix for each label.

Return type:

tuple

treehfd.XGBTreeHFD._tree_predict(self, X: ndarray) → ndarray

Compute the original predictions of each tree.

Parameters:: X (np.ndarray) – The input data where tree predictions are computed.
Returns:: tree_predictions – Array with the predictions of each tree of the ensemble for X, where each column stores the predictions of a tree. For multiclass classification, all trees are stacked together following the index of xgb_table (see treehfd_list doc).
Return type:: np.ndarray

treehfd.XGBTreeHFD._get_output_idx(self, tree_idx: int) → int

Get output index from tree_idx.

Parameters:: tree_idx (int) – The tree index from xgb_table.
Returns:: output_idx – Index of the output modeled by the tree of index tree_idx. Notice that output_idx is always 0 for regression and binary classification.
Return type:: int