Abstract:
This blog explores incremental learning, the preferred ML scheme for production environments that have a continuous stream of data. It also gives a brief introduction to scikit-learn partial fit and River, two popular Python options for incremental learning.
Introduction:
According to Polaris market research, the streaming data industry, valued at $15.65B in 2021, is predicted to exhibit a robust growth at a CAGR of 27.0% from 2021 to 2030. In various streaming data use cases, such as traffic steering in the cellular industry, personalized recommendation systems, and others, real-time decisions must be made while managing limited volumes of storage resources. This necessitates an incremental model that learns on the go and improves its knowledge by leveraging the continuous availability of input data. Such a model refines its parameters based on the knowledge gathered from the old data, seamlessly adjusting to the new data as learning progresses.
Application scenarios:
- Limited memory resources – When fitting entire data into available memory is not possible, incremental learning facilitates splitting the data into chunks and fitting them one by one.
- Drifting of model or data – A model trained in a static environment might face challenges during deployment if the distribution of production data changes vastly from that of training data or the relationship between data and output changes. Incremental learning addresses the issue by having a drift detector in the pipeline and triggering the learning on new data without refitting.
- Streaming data – For a data stream, incremental learning emerges as a natural solution for on-the-go learning and prediction without essentially requiring the accumulation of data and fitting it.
Incremental learning scheme:
A typical incremental learning scheme entails the following steps:
- First training: A batch of data is split for training and testing. An initial warm-up step is required before using the entire training data for model fitting. This ensures that model parameters are adequately correct for the data.
- Evaluation: The data split for the testing is evaluated and the model is saved.
- Further batch learning: The previous two processes are repeated for newly collected data on the saved model.
- Model registry: Data versioning keeps track of and stores the changes in trained models over time.
- Predictions: Predictions are generated from the latest model at any stage after the first training.
Some of the libraries that support incremental learning are scikit-learn, scikit-multiflow, Crème, Vowpal wabbit and River. scikit-multiflow and Crème were merged to form River. The upcoming sections briefly discuss the overall view of incremental learning features in these two libraries.
scikit-learn:
scikit-learn is one of the most popular Python libraries for ML with a permissible BSD-3 license. Some of the scikit-learn classes have a partial fit method that facilitates fitting data in batches. It utilizes an optimization technique known as Stochastic Gradient Descent (SGD) in which the parameters are learned iteratively from a random subset of training data rather than the whole dataset. Some of the incremental learning features in scikit-learn are mentioned below:
- Normalization – Normalization aligns features of varying scales onto the same scale. Techniques like standard scaling, min-max normalization and max abs scalers assist in maintaining the running statistics of data in batches. A standard scaler is recommended if one chooses to use SGD models in the training stage.
- Categorical encoding – For categorical encoding, it becomes essential that the categorical feature be encoded into a fixed number of features. scikit-learn feature hasher plays a crucial role in doing the same through an inbuilt hash function. Unlike other encoding methods like one hot encoding, the number of encoded features will stay the same even if a new category is added. It is recommended to set the number of output features of feature hasher in powers of two to avoid hash collisions.
- Training – ‘SGD classifier and regressor’ and ‘Passive aggressive classifier and regressor’ are present for basic ML tasks like regression and classification. For advanced tasks like anomaly detection, the ‘SGD one class SVM’ (Support Vector Machine) is useful.
Partial fit has the following limitations,
- Consistent batches – Every mini batch must have a consistent number of features, failing if the features count varies in a new batch.
- Prior knowledge of labels – In classification tasks, all the output classes during deployment must be predetermined.
- Multiple passes over data – While dealing with numerous batches, there is a possibility that by the time the last batch arrives, the learning from the first batch is forgotten. Hence, it may require multiple passes over the shuffled data with a low learning rate.
River:
River has a permissive BSD-3 license and it is a one-stop solution for the majority of use cases in incremental learning. Its versatile APIs cover various parts of the ML pipeline, including data ingestion, preprocessing, model training, evaluation and monitoring.
Features of River that are apt for incremental learning:
- Inconsistent batch shape supported – One can add/remove/permute features between batches.
- Robust to target – Handle missing values or the addition of new target labels.
- Adapting to drift – River integrates drift detectors like ‘ADWIN (ADaptive WINdowing)’ and ‘Page Hinkley’ monitor model performance. Concept drifts happen when the relationship between data and output changes due to an external factor. This has to be compensated with new data which might have additional features. River facilitates such changes with classes like DriftRetrainingClassifier.
- Single and batch learning – Data can be learned one by one with *_one attribute and offers support for batch learning with the *_many attributes.
- Convertibility – It also has wrappers for making a River estimator compatible with scikit-learn
- Pipelining – The entire stages of incremental learning can be implemented in a single pipeline with ease.
Recommendations:
- scikit-learn partial fit is easier to implement for numerical data with known class labels and allows the utilization of NumPy for faster processing of data.
- River is better suited for uncertain data characteristics such as missing values and unknown class labels and supports more algorithms, including time series forecasting and recommender systems.
- The choice between the two depends on the specific use case and the preferred incremental learning workflow.
Reference:
- https://www.mathworks.com/help/stats/incremental-learning-overview.html
- Streaming Analytics Market Size Global Report, 2022 - 2030 (polarismarketresearch.com)
- Data Drift vs. Concept Drift (deepchecks.com)
- https://riverml.xyz/0.14.0/
- https://scikit-learn.org/0.15/modules/scaling_strategies.html
- 1.5. Stochastic Gradient Descent — scikit-learn 1.3.1 documentation
- python - Why does `partial_fit` in `SGDClassifier` suffer from a gradual reduction in model accuracy - Stack Overflow
- machine learning - How should I choose n_features in FeatureHasher in sklearn? - Data Science Stack Exchange