Hi, I am building a HoeffdingTree classifier on a heavily imbalanced data stream (only ~1 in 1000 data points are of the positive class). Using the EvaluatePrequential
evaluator I am able to plot the precision and recall, however, the recall is extremely low as the model learns to predict the negative class almost always (only 50 positive predictions in my stream of 10 million data points).
Tree classifiers often give me class probabilities rather than discrete class outputs, and the actual recall (and precision) is of course threshold-dependent. Is there a way to control the threshold for which I am evaluating the recall?
@nuwangunasekara
Is there a way to achieve something similar to sklearn.preprocessing.OneHotEncoder() in scikit-multiflow in a streaming setting?
The StreamTransform
class can be used as base class to implement it. There are two main scenarios (with challenges) that I can see:
dict
) would help.Tree classifiers often give me class probabilities rather than discrete class outputs, and the actual recall (and precision) is of course threshold-dependent. Is there a way to control the threshold for which I am evaluating the recall?
You can get probabilities via the predict_proba
method from the HoeffdingTree
, however there is currently no suppoort for this in the EvaluatePrequential
class. In this case you might want to try implementing the evaluate prequential process. Something like this:
# Imports
from skmultiflow.data import SEAGenerator
from skmultiflow.trees import HoeffdingTreeClassifier
# Setting up a data stream
stream = SEAGenerator(random_state=1)
# Setup Hoeffding Tree estimator
ht = HoeffdingTreeClassifier()
# Setup variables to control loop and track performance
n_samples = 0
correct_cnt = 0
max_samples = 200
# Train the estimator with the samples provided by the data stream
while n_samples < max_samples and stream.has_more_samples():
X, y = stream.next_sample()
y_pred = ht.predict(X)
if y[0] == y_pred[0]:
correct_cnt += 1
ht = ht.partial_fit(X, y)
n_samples += 1
The metrics can be calculated using the ClassificationPerformanceEvaluator
and WindowClassificationPerformanceEvaluator
in the development branch.
@nuwangunasekara
Is there a way to achieve something similar to sklearn.preprocessing.OneHotEncoder() in scikit-multiflow in a streaming setting?
The
StreamTransform
class can be used as base class to implement it. There are two main scenarios (with challenges) that I can see:
- If you know the number of distinct values in a nominal attribute. Then should be as simple as mapping the value to the corresponding binary attribute (a
dict
) would help.- If you don’t know the distinct values in the nominal attribute. This is more challenging, first the mapping must be maintained dynamically. Second, if a new value appear, the length of the sample will change as a new binary attribute will be added. This is complex as it is not guaranteed that methods are going to support “emerging" attributes in this fashion.
I would explore the first scenario first as the second one seems more like a corner case.
Thamks for the tip @jacobmontiel !
# Imports
from skmultiflow.data import SEAGenerator
from skmultiflow.anomaly_detection import HalfSpaceTrees
# Setup a data stream
stream = SEAGenerator(random_state=1)
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5)
# Pre-train the model with one sample
X, y = stream.next_sample()
half_space_trees.partial_fit(X, y)
# Setup variables to control loop and track performance
n_samples = 0
max_samples= 5000
anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
while n_samples < max_samples and stream.has_more_samples():
X, y = stream.next_sample()
y_pred = half_space_trees.predict(X)
if y_pred[0] == 1:
anomaly_cnt += 1
half_space_trees = half_space_trees.partial_fit(X, y)
n_samples += 1
# Display results
print('{} samples analyzed.'.format(n_samples))
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
Stream
object. You must replace the generator with the proper data.n_features
has been removed from the signature.Some comments:
- The pre-train phase is needed in this case to avoid an error when predicting and the model is empty
- The SEA generator does not really provides data with actual anomalies, we just show use it to show how the detector interacts with a
Stream
object. You must replace the generator with the proper data.- This example corresponds to the development version where the parameter
n_features
has been removed from the signature.
Thank you, appreciate it.
# Imports
from skmultiflow.data import SEAGenerator
from skmultiflow.anomaly_detection import HalfSpaceTrees
# Setup a data stream
stream = SEAGenerator(random_state=1)
stream.prepare_for_use()
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5, n_features=2)
# Pre-train the model with one sample
X, y = stream.next_sample()
half_space_trees.partial_fit(X, y)
# Setup variables to control loop and track performance
n_samples = 0
max_samples= 5000
anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
while n_samples < max_samples and stream.has_more_samples():
X, y = stream.next_sample()
y_pred = half_space_trees.predict(X)
if y_pred[0] == 1:
anomaly_cnt += 1
half_space_trees = half_space_trees.partial_fit(X, y)
n_samples += 1
# Display results
print('{} samples analyzed.'.format(n_samples))
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
# Imports
from skmultiflow.anomaly_detection import HalfSpaceTrees
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(30, 3), columns=['x', 'y', 'z'])
# Access raw numpy array inside the dataframe
X_array = df.values
# Setup Half-Space Trees estimator
half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5) #, n_features=2)
# Pre-train the model with one sample
# the sample is a 1D array and we must pass a 2D array, thus np.asarray([X_array[0]])
half_space_trees.partial_fit(np.asarray([X_array[0]]), [0])
anomaly_cnt = 0
# Train the estimator(s) with the samples provided by the data stream
for X in X_array[1:]:
y_pred = half_space_trees.predict([X])
if y_pred[0] == 1:
anomaly_cnt += 1
half_space_trees = half_space_trees.partial_fit(np.asarray([X]), [0])
# Display results
print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
from skmultiflow.data import SEAGenerator
import pandas as pd
import numpy as np
X, y = SEAGenerator(random_state=12345).next_sample(1000)
df = pd.DataFrame(np.hstack((X, y.reshape(-1,1))),
columns=['attr_{}'.format(i) for i in range(X.shape[1])] + ['target'])
df.target = df.target.astype(int)
df.to_csv('stream.csv')
RandomForest
is the batch version based on Decision Trees. AdaptiveRandomForest
is the stream version based on Hoeffding Trees. AdaptiveRandomForest
can be used with or without the drift detection. If you want to use AdaptiveRandomForest
without drift detection you must initialize it as AdaptiveRandomForest(drift_detection_method=None)
Thank you so much @jacobmontiel# Imports from skmultiflow.anomaly_detection import HalfSpaceTrees import numpy as np import pandas as pd df = pd.DataFrame(np.random.randn(30, 3), columns=['x', 'y', 'z']) # Access raw numpy array inside the dataframe X_array = df.values # Setup Half-Space Trees estimator half_space_trees = HalfSpaceTrees(random_state=1, n_estimators=5) #, n_features=2) # Pre-train the model with one sample # the sample is a 1D array and we must pass a 2D array, thus np.asarray([X_array[0]]) half_space_trees.partial_fit(np.asarray([X_array[0]]), [0]) anomaly_cnt = 0 # Train the estimator(s) with the samples provided by the data stream for X in X_array[1:]: y_pred = half_space_trees.predict([X]) if y_pred[0] == 1: anomaly_cnt += 1 half_space_trees = half_space_trees.partial_fit(np.asarray([X]), [0]) # Display results print('Half-Space Trees anomalies detected: {}'.format(anomaly_cnt))
from skmultiflow.data.data_stream import DataStream
from skmultiflow.evaluation import EvaluatePrequential
from skmultiflow.trees import HoeffdingTree
stream = DataStream(X_train, y = y_train)
stream.prepare_for_use()
ht = HoeffdingTree()
evaluator = EvaluatePrequential(show_plot=True,
pretrain_size=5000,
max_samples=20000,
metrics = ['accuracy', 'running_time','model_size'],
output_file='results.csv')
evaluator.evaluate(stream=stream, model=ht);
show_plot=False
your code runs normally (is it correct?). It seems that your problem is related to the matplotlib backend used in jupyter. Probably the solution is to set a proper backend for your interactive plot
@jacobmontiel .. I think I figured it out, by taking your advise to set one of the Adaptive Random forest AdaptiveRandomForest(drift_detection_method=None). thank you
Glad to help.
@jacobmontiel Hi Jacob, is there a way that I can get access to the actual values predicted per data segment during the evaluations? I have 1 million SEAGen data points and need to perform McNemar's Statistical Significance formula which requires me to know which labels classifier A got incorrect vs classifier B.. etc. etc. As such I need to record the actual values predicted by each classifier.
If you are using an evaluator you can add true_vs_predicted
to metrics
to get predicted values. In this case you also need to set n_wait=1
. As a suggestion, in this case deactivate the plot as n_wait=1
implies a high refresh rate in the plot which is a lot of overhead.
@automater0 I'm guessing the Kappa T stands for temporal. Bifet refers to it as Kper. see pg. 91 Bifet, A., Gavaldá, R., Holmes, G., & Pfahringer, B. (2017). Machine learning for data streams: with practical examples in MOA (Adaptive computation and machine learning series). MIT Press.
That is correct.