XGBTreeHFD
TreeHFD decomposition of a xgboost model. |
- class treehfd.XGBTreeHFD(xgb_model: XGBModel)
TreeHFD decomposition of a xgboost model.
XGBTreeHFD is the TreeHFD decomposition of a xgboost tree ensemble, defined as the Hoeffing functional decomposition of the target tree ensemble, where the hierarchical orthogonality constraints are discretized over the Cartesian tree partitions. The TreeHFD algorithm solves a least square problem for each tree to find the coefficients defining the set of functional components of the decomposition, which are all piecewise constant on the Cartesian tree partitions.
- Parameters:
xgb_model (xgb.sklearn.XGBModel) – The xgboost model for regression or classification to be decomposed with TreeHFD.
- xgb_model
The input xgboost model.
- Type:
xgb.sklearn.XGBModel
- config
The config of the xgboost model with all settings and parameters, from xgb_model.get_booster().save_config().
- Type:
dict
- max_depth
Tree depth parameter of xgb_model (must be greater than 0).
- Type:
int
- n_estimators
Number of trees of xgb_model (must be greater than 0). For multiclass classification, there are n_estimators trees for each logit.
- Type:
int
- base_score
Base scores of xgb_model.
- Type:
np.ndarray
- num_feature
The number of variables of the data used to fit xgb_model.
- Type:
int
- num_parallel_tree
The number of trees in random forests (one for gradient boosting models).
- Type:
int
- num_outputs
The number of model outputs: one for regression and binary classification, and the number of classes for multiclass classification.
- Type:
int
- xgb_table
The table with the tree structures, obtained from the xgboost methods xgb_model.get_booster().trees_to_dataframe().
- Type:
pd.core.frame.DataFrame
- interaction_order
Set to 1 to fit only main effects, or to 2 to also include second-order interactions in the TreeHFD decomposition.
- Type:
int, default=2
- interaction_list
The list of interactions, defined as variable pairs.
- Type:
np.array, default=np.empty((0, 0))
- depth_variable
Variables are selected at the first depth_variable levels of the tree for the components of the decomposition. Set to max_depth by default. Reducing depth_variable strongly speeds up computations for deep trees.
- Type:
int, default=max_depth
- treehfd_list
The list of the TreeHFD decomposition for each tree. For multiclass classification, all trees are stacked together following the index of xgb_table: for gradient boosting models, the trees of the same boosting round are stored next to each other, whereas for random forests, the trees of the same logit are stored next to each other.
- Type:
list, default=[]
- eta0
Intercept of the TreeHFD decomposition of xgb_model. For multiclass classification, eta0 is an array with the intercept of each logit decomposition.
- Type:
float | np.ndarray, default=0
Examples
>>> import numpy as np >>> import xgboost as xgb >>> from treehfd import XGBTreeHFD >>> from treehfd.validation import sample_data >>> np.random.seed(11) >>> X, y = sample_data(nsample=100) >>> xgb_model = xgb.XGBRegressor() >>> xgb_model = xgb_model.fit(X, y) >>> treehfd_model = XGBTreeHFD(xgb_model) >>> treehfd_model.fit(X, interaction_order=2) >>> X_new, y_new = sample_data(nsample=3) >>> y_main, y_order2 = treehfd_model.predict(X_new) >>> print(f'TreeHFD intercept: {treehfd_model.eta0}') >>> print(f'TreeHFD main effect predictions: {y_main}') >>> print(f'TreeHFD interaction predictions: {y_order2}') >>> interactions = treehfd_model.interaction_list >>> print(f'TreeHFD interactions: {interactions}')
- treehfd.XGBTreeHFD.fit(self, X: ndarray, interaction_order: int = 2, interaction_list: ndarray | None = None, depth_variable: int | None = None, verbose: bool = True) None
Fit TreeHFD decomposition of the provided xgboost model.
- Parameters:
X (np.ndarray) – The input data used to train the xgboost model.
interaction_order (int, default=2) – Set to 1 to fit only main effects, or to 2 to also include second-order interactions in the TreeHFD decomposition.
interaction_list (np.ndarray, default=None) – Predefined list of second-order interactions to be estimated in the decomposition. Each row defines an interaction with two integers for the variable indices. Default=None, and interactions are automatically extracted from tree paths.
depth_variable (int, default=None) – Variables are selected at the first depth_variable levels of the tree for the components of the decomposition. Default is None, and all variables are selected.
verbose (bool, default=True) – Set to False to deactivate the console display of computation progress (% of trees).
- treehfd.XGBTreeHFD.predict(self, X_new: ndarray, verbose: bool = True) tuple
Predict TreeHFD components for new input data.
- Parameters:
X_new (np.ndarray) – New input data where TreeHFD predictions are computed.
verbose (bool, default=True) – Set to False to deactivate the console display of computation progress (% of trees).
- Returns:
- y_mainnp.ndarray
Array for the predictions of main effects. For multiclass classification, y_main is an array of order three, with the prediction matrix for each label (axis 0: data samples, axis 1: labels, axis 2: input variables).
- y_order2np.ndarray
Array for predictions of second-order interactions (columns are ordered according to interaction_list). For multiclass classification, y_order2 is an array of order three, with the prediction matrix for each label.
- Return type:
tuple
- treehfd.XGBTreeHFD._tree_predict(self, X: ndarray) ndarray
Compute the original predictions of each tree.
- Parameters:
X (np.ndarray) – The input data where tree predictions are computed.
- Returns:
tree_predictions – Array with the predictions of each tree of the ensemble for X, where each column stores the predictions of a tree. For multiclass classification, all trees are stacked together following the index of xgb_table (see treehfd_list doc).
- Return type:
np.ndarray
- treehfd.XGBTreeHFD._get_output_idx(self, tree_idx: int) int
Get output index from tree_idx.
- Parameters:
tree_idx (int) – The tree index from xgb_table.
- Returns:
output_idx – Index of the output modeled by the tree of index tree_idx. Notice that output_idx is always 0 for regression and binary classification.
- Return type:
int