Encyclopediav0

Oversampling Ratio

Last updated:

Oversampling Ratio

The oversampling ratio is a quantitative parameter in machine learning and data mining that defines the proportional increase of minority class instances relative to the majority class when applying oversampling techniques to address class imbalance. In datasets with an imbalanced class distribution, where the number of instances in one class (the majority) significantly exceeds another (the minority), standard learning algorithms tend to be biased toward the majority class, often maximizing overall accuracy at the expense of minority class performance [1][2]. The oversampling ratio is central to resampling strategies aimed at mitigating this bias by artificially increasing the representation of the minority class before model training [4]. Oversampling is a core technique within the broader field of imbalanced learning, which remains a compelling research area due to the prevalence of skewed datasets in real-world applications [1]. The process involves generating synthetic or duplicate examples of the minority class until a desired class balance is achieved. The specific ratio—for example, 1:1 for perfect balance or a lesser proportion—determines the final distribution of classes in the training set. Key methods for implementing this ratio range from simple random replication to more sophisticated algorithms like the Synthetic Minority Oversampling Technique (SMOTE), which creates new instances by interpolating between existing minority examples [7]. Other techniques, such as ADASYN, adaptively generate samples based on the local density of the minority class, influencing the effective oversampling ratio in different regions of the feature space [8]. These methods are often contrasted with undersampling, which reduces the majority class, and can be integrated into ensemble frameworks like boosting [4][7]. The significance of the oversampling ratio extends across numerous domains where rare events are critical, including fraud detection, medical diagnosis, fault prediction, and building energy load forecasting [1][6]. Selecting an appropriate ratio is not trivial, as it interacts with dataset size, concept complexity, and the chosen classifier's characteristics [3]. In deep learning, the impact and necessity of explicit oversampling are active research topics, with studies investigating whether network depth and regularization can inherently alleviate problems caused by imbalance [3]. Modern, flexible toolkits enable the application of various resampling techniques, including those defined by specific oversampling ratios, even in distributed computing environments for large-scale or imbalanced regression problems [5]. Consequently, understanding and tuning the oversampling ratio is a fundamental step in developing robust predictive models from skewed data.

Extracted References

Comparative Performance Analysis of SMOTE and ADASYN

The comparative evaluation of SMOTE and ADASYN in terms of classification performance and computational efficiency is a central research question in imbalanced learning [14]. Empirical studies typically assess these techniques across multiple benchmark datasets with varying imbalance ratios, using performance metrics that are robust to class skew, such as the F1-score, geometric mean (G-mean), and area under the Receiver Operating Characteristic curve (AUC-ROC) [14]. A key distinction lies in their approach to generating synthetic samples: while SMOTE performs uniform interpolation between a minority class instance and its k-nearest neighbors, ADASYN employs a density distribution to assign higher generation weights to minority instances that are harder to learn, i.e., those surrounded by more majority class neighbors [14]. This adaptive mechanism often leads ADASYN to produce more synthetic data in the feature space regions where the minority class is most sparse, which can improve the learning of difficult-to-classify boundaries but may also increase the risk of generating noise or outliers [14]. In terms of raw classification performance, results are highly dataset-dependent. On datasets where the minority class clusters are relatively well-defined and separable, standard SMOTE often yields comparable or slightly better results than ADASYN [14]. However, on datasets with highly overlapping classes or where the minority class is distributed across small subclusters, ADASYN's adaptive approach frequently demonstrates superior performance, particularly in improving the recall (true positive rate) for the minority class [14]. The computational cost of ADASYN is generally higher than that of SMOTE due to its two-phase algorithm: the first phase calculates the density distribution and generation weights for each minority instance, and the second phase performs the weighted synthetic sample generation [14]. The time complexity for SMOTE is primarily O(n_minority * k * d), where n_minority is the number of minority instances, k is the number of nearest neighbors, and d is the dimensionality. ADASYN adds an overhead for computing the density distribution, making it roughly 1.2 to 1.5 times more computationally expensive than SMOTE on average, though this factor varies with the specific implementation and dataset sparsity [14].

Integration with Ensemble Learning Frameworks

Oversampling techniques are frequently embedded within ensemble learning paradigms to enhance their effectiveness. As noted earlier, these methods can be integrated into frameworks like boosting. A specialized implementation of this concept is SMOTEBoost, which modifies the AdaBoost algorithm to directly address class imbalance during the boosting process [13]. In standard boosting, misclassified examples are given equal weights in subsequent iterations, which can perpetuate bias toward the majority class. SMOTEBoost alters this dynamic by introducing synthetic minority class examples at each boosting iteration before the learner is trained [13]. This process effectively changes the weight distribution by augmenting the representation of the minority class, thereby forcing the weak learner to focus more on the decision boundaries critical for minority class identification [13]. The synthetic instances are generated using the SMOTE algorithm within the feature space of the current weighted dataset, ensuring that the boosting algorithm's focus adapts to the increasingly refined boundary regions [13]. The algorithmic steps of SMOTEBoost involve, for each iteration t: (1) generating synthetic minority class samples using SMOTE based on the current data distribution, (2) combining these with the original dataset to form an augmented training set, (3) training a weak classifier (e.g., a decision stump) on this augmented set, (4) calculating the error of the weak classifier, and (5) updating the weights of the training instances, with higher weights assigned to misclassified instances for the next iteration [13]. This approach compensates for skewed distributions not by reweighting alone, but by structurally altering the training data presented to each weak learner, leading to a more robust final ensemble model [13]. The performance gain is particularly notable on datasets with extreme imbalance ratios (e.g., 1:100 or higher), where simple cost-sensitive boosting may still struggle [13].

Challenges in High-Dimensional and Sparse Data

The application of oversampling techniques like SMOTE and ADASYN becomes significantly more challenging in high-dimensional feature spaces, such as those created by text processing pipelines. After processing text, the Bag-of-Words (BoW) technique converts it into a high-dimensional sparse matrix, often with tens of thousands of features where most entries are zero [14]. In such spaces, the concept of distance, which is fundamental to nearest-neighbor algorithms used in SMOTE and ADASYN, becomes less meaningful due to the "curse of dimensionality" [14]. The interpolation mechanism used to generate synthetic samples may produce points that do not lie on the true data manifold, creating ambiguous or noisy instances that degrade classifier performance [14]. Furthermore, the sparsity of the data matrix exacerbates the problem. When two minority class documents share only a few non-zero terms, linear interpolation between their feature vectors can result in a synthetic document with diluted term frequencies that do not correspond to any coherent semantic content [14]. Research has shown that applying standard SMOTE directly to high-dimensional text data without dimensionality reduction or specialized distance metrics often yields minimal improvement or can even reduce classification accuracy compared to training on the original imbalanced set [14]. Adaptations such as using cosine similarity instead of Euclidean distance, or first applying techniques like Latent Semantic Analysis (LSA) to project the data into a lower-dimensional dense space before oversampling, have been proposed to mitigate these issues [14].

The Persistent Research Challenge of Class Imbalance

Despite continuous research advancement over the past decades, learning from data with an imbalanced class distribution remains a compelling research area [14]. The core difficulty, as is well-known or as new practitioners of machine learning and deep learning methods soon discover in their investigations, is that the class imbalance problem is a very prevalent problem that is hard to address comprehensively [14]. This persistence is due to several intrinsic factors: imbalance is not merely a property of the dataset but interacts with other data characteristics like noise, class overlap, and the presence of small disjuncts (multiple small subconcepts within a class) [14]. A technique that works well for one type of imbalance (e.g., moderate imbalance with low overlap) may fail catastrophically for another (e.g., extreme imbalance with high overlap) [14]. Numerous methods have been proposed to tackle imbalanced data, encompassing data pre-processing (like oversampling and undersampling), modification of existing classifiers (like cost-sensitive learning), and algorithmic parameter tuning [14]. However, many of these algorithms are designed to maximize classification accuracy, which is a metric that is skewed in favor of the majority class [14]. For example, on a dataset with a 99:1 imbalance ratio, a trivial classifier that always predicts the majority class would achieve 99% accuracy, rendering this metric useless for evaluation [14]. This has driven the adoption of alternative metrics and the development of algorithms that optimize for them, but no single solution has emerged as universally superior, ensuring that class imbalance continues to be an active and nuanced field of study [14].

Historical Development

The historical development of oversampling ratio as a concept is intrinsically linked to the broader evolution of techniques for handling class imbalance in statistical analysis and machine learning. Its journey from rudimentary data replication to sophisticated algorithmic generation reflects the changing computational paradigms and theoretical understandings of data distribution challenges.

Early Foundations in Statistical Sampling (Pre-1990s)

The conceptual roots of manipulating sample ratios predate modern computing, emerging from classical statistics and survey methodology. Statisticians working with biased populations or rare events in fields like epidemiology and quality control developed early forms of weighting and stratification to correct for unequal representation [14]. These methods, however, were primarily analytical adjustments applied during inference rather than preprocessing transformations of the dataset itself. The manual replication of rare event records in analytical datasets was a known, albeit ad-hoc, practice to stabilize variance estimates for minority subgroups. The formalization of "oversampling" as a discrete preprocessing step with a defined "ratio" awaited the advent of database technology and the systematic application of computers to pattern recognition, where the cost of collecting truly balanced data became prohibitive [14].

The Rise of Algorithmic Resampling in Machine Learning (1990s-2000s)

The 1990s marked a pivotal era with the proliferation of machine learning algorithms for classification, such as decision trees and neural networks. Researchers quickly identified that severe class imbalance, where the ratio of majority to minority class instances could exceed 100:1, led to models with high accuracy but poor practical utility, as they would overwhelmingly predict the majority class [14]. This period saw the formal introduction of oversampling and undersampling as deliberate preprocessing techniques to adjust the class distribution before model training. The seminal breakthrough arrived in 2002 with the publication of the Synthetic Minority Over-sampling Technique (SMOTE) by Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE fundamentally redefined oversampling from mere duplication to synthetic generation. Instead of replicating existing minority instances, SMOTE algorithmically created new synthetic examples by interpolating between feature vectors of neighboring minority class instances [14]. This innovation addressed the critical pitfall of overfitting associated with simple random oversampling. The technique introduced key parameters that directly influenced the effective oversampling ratio, such as the k nearest neighbors for interpolation and the percentage of oversampling to perform. For example, setting a 200% oversampling rate in SMOTE would generate two synthetic samples for each selected seed instance. The widespread adoption of SMOTE established a clear paradigm: the oversampling ratio was no longer just a count of duplicated records but a controllable parameter governing the volume and nature of synthetic data generation aimed at achieving a desired class balance.

Integration with Ensemble Methods and Cost-Sensitive Learning (2000s-2010s)

Following SMOTE's success, the 2000s and 2010s focused on integrating oversampling into more robust learning frameworks. Researchers combined SMOTE with ensemble methods to enhance performance further. Notably, SMOTEBoost (2003) and similar algorithms integrated synthetic oversampling directly into the boosting iteration process. Building on the concept discussed above regarding standard boosting, these algorithms specifically oversampled the minority class in each boosting round, thereby altering the data distribution presented to each successive weak learner to focus more on the minority class boundaries [14]. This integration represented an evolution where the oversampling ratio could be dynamically adjusted during the training process rather than being fixed at the preprocessing stage. Parallel developments in cost-sensitive learning provided a theoretical counterpart to resampling. Instead of manipulating the data distribution, these methods assigned higher misclassification costs to the minority class within the learning algorithm's objective function. The oversampling ratio and the misclassification cost weight became seen as two sides of the same coin—different mechanisms for achieving a similar effect of increasing the influence of the minority class during model training [14]. Research during this period often involved comparative studies to determine whether adjusting the data space (via oversampling ratio) or the algorithm's cost function was more effective for specific problem types.

Expansion into the Deep Learning Era (2010s-Present)

The resurgence of deep learning with complex neural network architectures introduced new challenges and considerations for the application of oversampling ratios. Deep learning models, particularly deep convolutional neural networks (CNNs) for image data, require large volumes of training data and are highly susceptible to overfitting. As noted earlier, the analysis of resampling approaches has been extended to deep learning settings, an increasingly relevant topic in machine learning research [15]. The application of traditional oversampling techniques like SMOTE to high-dimensional, raw data spaces such as pixel arrays is often computationally inefficient and can generate semantically meaningless synthetic images (e.g., blurred or incoherent interpolations between two distinct images). Consequently, the field has evolved in two primary directions concerning the oversampling ratio in deep learning:

  • Latent Space Oversampling: Modern approaches frequently perform oversampling not in the original high-dimensional input space (e.g., pixel space) but in a lower-dimensional latent feature space learned by the neural network. An encoder network transforms inputs into a compressed feature representation; synthetic minority samples are generated within this semantically meaningful latent space using SMOTE or similar generators, before being decoded back for training [15]. This makes the synthetic data more coherent and effective.
  • Batch-Level Ratio Adjustment: A common practice in deep learning involves dynamically controlling the class ratio within each training mini-batch, a form of batch-wise oversampling. Techniques like class-balanced sampling ensure that each mini-batch contains a predetermined number of instances from each class (e.g., an equal ratio), even if this requires oversampling the minority class within the batch. This method directly manages the exposure ratio the model sees during each gradient update step [15].
  • Integration with Generative Models: The most recent frontier involves using powerful generative models, such as Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs), specifically trained on the minority class to create highly realistic synthetic data. Here, the "oversampling ratio" is governed by the number of samples drawn from the trained generator, offering potentially infinite oversampling capacity with high fidelity [15]. For instance, a GAN trained on a small set of medical images showing a rare condition can generate thousands of new, realistic variants to balance the dataset. The historical trajectory demonstrates that the concept of the oversampling ratio has evolved from a simple count multiplier to a sophisticated, context-dependent parameter deeply intertwined with the architecture of the learning algorithm itself. Its application has shifted from static dataset preprocessing to dynamic, integrated components of training pipelines, especially within modern deep learning frameworks where it is strategically applied in latent spaces or at the batch level to combat the data hunger and overfitting tendencies of complex models [15][14].

# Assume dataset D with majority class instances M and minority class instances m, where |M| > |m|

This foundational assumption establishes the core problem domain for oversampling techniques, defining a dataset D where the cardinality of the majority class instances M exceeds that of the minority class instances m [22]. This imbalance, quantified by the ratio |M|:|m|, presents a significant challenge for standard machine learning algorithms, which are often designed to optimize metrics like overall accuracy that are inherently skewed toward the majority class [19]. As noted earlier, this renders such metrics insufficient for evaluating model performance on imbalanced tasks. The prevalence of such datasets spans numerous critical application domains, including fraud detection, medical diagnosis, and customer churn prediction, where the minority class often represents the event of primary interest [16][22].

Defining and Calculating the Oversampling Ratio

The oversampling ratio is a critical hyperparameter that dictates the degree of synthetic augmentation applied to the minority class. Formally, for a target minority class, the ratio specifies the desired number of samples post-resampling relative to the original count. In the imbalanced-learn library, a primary implementation for such techniques, this is often expressed as a dictionary where keys are class labels and values are the target sample counts [19]. For instance, if the original minority class m contains 100 instances and the desired oversampling ratio is 200%, the target value would be set to 200. The library's resampling algorithms then generate the necessary number of synthetic samples—in this case, 100 new instances—to meet this target [19]. It is crucial to distinguish this from a simple multiplier on the generated samples per seed instance, a distinction that clarifies the operational mechanics of algorithms like SMOTE.

Strategic Considerations for Ratio Selection

Selecting an appropriate oversampling ratio is not a trivial task and requires careful consideration of the dataset characteristics, the chosen algorithm, and the ultimate learning objective. An excessively high ratio can lead to overfitting, where the model learns the patterns of the synthetic noise rather than the true underlying distribution, while a ratio that is too low may fail to mitigate the classifier's bias toward the majority class [18][20]. Research indicates that the optimal ratio often does not aim for perfect 1:1 balance but rather a more nuanced equilibrium that improves model generalization for the specific task [20]. Furthermore, the effectiveness of a chosen ratio is interdependent with other modeling decisions, such as the probability threshold for classification. Building on the concept discussed above, a threshold adjusted below 0.5 is frequently necessary to achieve satisfactory recall for the minority class, even after resampling [20].

Algorithmic Sensitivity and the Role of Data Geometry

Different oversampling algorithms exhibit varying sensitivities to the specified ratio due to their underlying methodologies. For example, while SMOTE generates samples along line segments between k-nearest neighbors, ADASYN (Adaptive Synthetic Sampling) allocates more synthetic samples to minority instances that are harder to learn, i.e., those surrounded by more majority class neighbors. This adaptive approach means that for a given target ratio, the distribution of synthetic samples across the feature space will differ between SMOTE and ADASYN, potentially impacting model performance [21]. This highlights that the oversampling ratio controls the quantity of new data, but the algorithm's mechanism controls their quality and placement. Practitioners must also be aware of technical limitations; specifying a ratio that requires generating more synthetic samples than the available seed instances for a given neighborhood can trigger warnings or failures in resampling filters [18].

Integration with Model Validation and Evaluation

The choice of oversampling ratio must be rigorously validated using evaluation frameworks suitable for imbalanced data. Standard k-fold cross-validation can introduce bias if the resampling is applied before splitting the data, potentially leaking information from the validation set into the training set via synthetic samples. To address this, techniques like Distribution-Balanced Stratified Cross-Validation (DB-SCV) have been proposed. Studies comparing validation methods show that DB-SCV often yields slightly higher F1 and AUC scores when evaluating models trained with resampled data, as it provides a more reliable estimate of performance on the true data distribution [23]. This underscores that the oversampling ratio is one component in a pipeline that includes proper validation and metrics like precision, recall, F1-score, and AUC-ROC, which provide a more complete picture than accuracy alone [20][23].

Practical Implementation and Ecosystem Support

The practical application of oversampling ratios is supported by comprehensive software libraries. The imbalanced-learn library for Python, which endorses the Scientific Python specification, offers a unified API for a wide array of resampling techniques [17][21]. This library allows users to define the sampling strategy directly via the sampling_strategy parameter, which accepts the desired sample counts for targeted classes [19]. This ecosystem enables systematic experimentation with different ratios across multiple algorithms, from simple random oversampling to advanced hybrid methods that combine oversampling of the minority class with undersampling of the majority class [21]. The availability of these tools has been instrumental in advancing research and application in imbalanced learning, providing a standardized platform for developing and benchmarking solutions to this pervasive problem [17][21][22].

# Assume dataset D with majority class instances M and minority class instances m, where |M| > |m|

The formal definition of an imbalanced dataset, denoted as D, begins with partitioning it into two distinct subsets based on class membership: the majority class M and the minority class m. The defining characteristic is the inequality in their cardinalities, expressed as |M| > |m|. This imbalance ratio, often quantified as |M| / |m|, is a fundamental parameter that dictates the severity of the learning challenge and influences the selection and configuration of remedial techniques [1]. For instance, a dataset with 10,000 majority instances and 100 minority instances has an imbalance ratio of 100:1, presenting a more extreme scenario than a dataset with a 10:1 ratio. The core objective of oversampling is to algorithmically increase the size of m to mitigate the bias introduced by this disparity, thereby improving a model's ability to learn the characteristics of the underrepresented class [2].

### Mathematical Formulation of the Oversampling Ratio

The oversampling ratio is a precise, user-defined parameter that governs the number of synthetic instances to generate for the minority class. It is most commonly expressed as a percentage relative to the original count of m. If N_m represents the original number of minority instances (|m|), and R represents the desired oversampling ratio (e.g., 200% or 2.0), then the target number of minority instances after oversampling, N_m', is calculated as: N_m' = N_m × (1 + R/100) when R is a percentage, or simply N_m' = N_m × R when R is a decimal multiplier [1]. The number of synthetic samples to generate is therefore N_m' - N_m. This formulation provides direct control over the final class distribution. For example, applying a 150% oversampling ratio to a minority class of 200 instances results in a target of 500 instances (200 × 2.5), requiring the generation of 300 synthetic samples. Practitioners may also define the target as an absolute number or as a dictionary specifying desired counts for multiple classes, as noted earlier in the context of libraries like imbalanced-learn [1]. The choice of ratio is not arbitrary; it is often determined through empirical validation or guided by heuristics to avoid overfitting to the synthetic data or creating excessive overlap between classes [2].

### Strategic Implications of Ratio Selection

Selecting an appropriate oversampling ratio is a critical strategic decision that balances the need for better minority class representation against the risks of overfitting and increased computational cost. A ratio that is too low (e.g., 50%) may be insufficient to overcome the classifier's inherent bias toward the majority class, leading to poor recall for the minority class [2]. Conversely, an excessively high ratio (e.g., 1000%) can lead to several negative outcomes:

  • Overfitting: The decision boundaries may become overly specific to the synthetic instances and the particular regions of the feature space from which they were generated, harming generalization performance on unseen data [1].
  • Loss of Generalization: The artificial inflation of the minority class can distort the true underlying data manifold, causing the model to learn an inaccurate representation of the class's true variability [2].
  • Increased Computational Overhead: Generating and processing a large number of synthetic samples increases memory usage and training time, which can be prohibitive for large-scale datasets [1]. A common strategy is to aim for a balanced dataset, setting the ratio so that |M| ≈ |m'|. However, the optimal ratio is often dataset- and algorithm-dependent, and may be slightly less than 100% balance to preserve some informational value from the original imbalance [2]. This necessitates the use of robust evaluation metrics beyond accuracy, such as the F1-score, geometric mean, or area under the Precision-Recall curve, to properly assess the trade-offs [1].

### Integration with Algorithmic Sampling Mechanisms

The oversampling ratio serves as the primary input to the data generation engine of algorithms like SMOTE (Synthetic Minority Over-sampling Technique) and its variants. In standard SMOTE, for each minority instance (the "seed"), k nearest neighbors from the same class are identified. The ratio then determines how many synthetic instances are created along the line segment between the seed and each selected neighbor [1]. If the ratio dictates that 300 synthetic samples are needed from an m of 100, SMOTE might select each seed instance multiple times. More advanced algorithms like ADASYN (Adaptive Synthetic Sampling) adapt the generation process based on the local density of the majority class. ADASYN assigns a higher sampling weight to minority instances that are harder to learn, such as those surrounded by more majority class neighbors [2]. Consequently, it generates more synthetic data in those "boundary" regions, effectively creating a non-uniform distribution of new instances guided by an adaptive mechanism that itself is influenced by the overall target ratio. This contrasts with SMOTE's more uniform approach, where the ratio is applied more evenly across the minority class, though both use the same fundamental ratio parameter to determine the total volume of new data [1][2].

### Comparative Analysis of SMOTE and ADASYN

Building on the concept of algorithmic integration, a direct comparison between SMOTE and ADASYN reveals distinct performance and computational profiles tied to their methodologies. In terms of performance, ADASYN is specifically designed to focus learning on difficult-to-classify minority examples by generating more synthetic data in their vicinity [2]. This adaptive focus can lead to improved classification performance, particularly for complex, nonlinear boundaries, as measured by metrics like the F-measure or G-mean, compared to the more generalized approach of SMOTE [1]. However, this benefit is not universal and depends heavily on the dataset's specific structure. Regarding computational cost, ADASYN incurs a higher overhead than vanilla SMOTE. This is due to its two-phase process: first, it must calculate the density distribution for each minority instance to determine sampling weights, and second, it performs the targeted synthetic generation [2]. The additional step of computing local imbalance densities makes ADASYN more computationally intensive, especially for large datasets or high-dimensional feature spaces. SMOTE, with its simpler, uniform sampling logic, is generally faster and less resource-intensive [1]. The choice between them therefore involves a trade-off: ADASYN may offer potential performance gains for complex imbalances at the cost of increased computation, while SMOTE provides a robust, efficient baseline. As noted earlier, the efficacy of both is ultimately constrained by the quality of the original feature space and must be evaluated using metrics that are not skewed toward the majority class [1][2].

### Practical Considerations and Limitations

The practical application of an oversampling ratio must account for several inherent limitations of synthetic data generation. A primary concern is the "blind" application of oversampling in high-dimensional spaces, such as those created by the Bag-of-Words technique, which produces sparse, high-dimensional matrices [1]. In such spaces, the concept of distance becomes less meaningful (the "curse of dimensionality"), causing nearest-neighbor calculations used by SMOTE and ADASYN to become unreliable. This can lead to the generation of noisy or meaningless synthetic samples that degrade classifier performance [2]. Furthermore, oversampling does not create new information; it interpolates existing data. If the original minority class m has very few instances or lacks diversity, the synthetic samples will merely replicate existing patterns, failing to capture the true variability of the class. This underscores why oversampling is often part of a broader pipeline that may include feature selection, dimensionality reduction, or cost-sensitive learning algorithms [1]. Finally, the optimal oversampling ratio is not a universal constant. It must be determined through careful experimentation, typically using cross-validation with appropriate evaluation metrics on a validation set, to find the point that maximizes model robustness without introducing the drawbacks of over-sampling [2].

References

  1. [1]A survey on imbalanced learning: latest research, applications and future directionshttps://doi.org/10.1007/s10462-024-10759-6
  2. [2]Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineeringhttps://doi.org/10.1186/s40537-024-00943-4
  3. [3]The class imbalance problem in deep learninghttps://link.springer.com/article/10.1007/s10994-022-06268-8
  4. [4][PDF] Paper 124 Under Sampling Techniques for Handling Unbalanced Datahttps://thesai.org/Downloads/Volume15No8/Paper_124-Under_Sampling_Techniques_for_Handling_Unbalanced_Data.pdf
  5. [5]DistResampleR-Lite: Light Distributed Resampler for Imbalanced Regression Problemshttps://doi.org/10.1007/978-981-96-8889-0_33
  6. [6]Problem of data imbalance in building energy load prediction: Concept, influence, and solutionhttps://www.sciencedirect.com/science/article/abs/pii/S0306261921005791
  7. [7]A novel oversampling method based on Wasserstein CGAN for imbalanced classificationhttps://cybersecurity.springeropen.com/articles/10.1186/s42400-024-00290-0
  8. [8]A review on over-sampling techniques in classification of multi-class imbalanced datasets: insights for medical problemshttps://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2024.1430245/full
  9. [9]Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineeringhttps://journalofbigdata.springeropen.com/articles/10.1186/s40537-024-00943-4
  10. [10]On Supervised Class-Imbalanced Learning: An Updated Perspective and Some Key Challengeshttps://ieeexplore.ieee.org/document/9738474/
  11. [11]Handling imbalanced medical datasets: review of a decade of researchhttps://link.springer.com/article/10.1007/s10462-024-10884-2
  12. [12]Learning Confidence Bounds for Classification with Imbalanced Datahttps://arxiv.org/html/2407.11878v2
  13. [13]SMOTEBoost: Improving Prediction of the Minority Class in Boostinghttps://link.springer.com/chapter/10.1007/978-3-540-39804-2_12
  14. [14]Oversampling and undersampling in data analysishttps://grokipedia.com/page/Oversampling_and_undersampling_in_data_analysis
  15. [15]Resampling approaches to handle class imbalance: a review from a data perspectivehttps://journalofbigdata.springeropen.com/articles/10.1186/s40537-025-01119-4
  16. [16]Customer churn prediction with hybrid resampling and ensemble learninghttps://www.abacademies.org/articles/customer-churn-prediction-with-hybrid-resampling-and-ensemble-learning-13867.html
  17. [17]imbalanced-learnhttps://pypi.org/project/imbalanced-learn/
  18. [18]Oversampling and Undersamplinghttps://waikato.github.io/weka-blog/posts/2019-01-30-sampling/
  19. [19]SMOTE — Version 0.14.1https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html
  20. [20]Should You Use Imbalanced-Learn in 2025?https://www.blog.trainindata.com/should-you-use-imbalanced-learn-in-2025/
  21. [21]Imbalanced Learn: the Python library for rebuilding ML datasetshttps://datascientest.com/en/imbalanced-learn-the-python-library-for-rebuilding-ml-datasets
  22. [22]A survey on imbalanced learning: latest research, applications and future directionshttps://link.springer.com/article/10.1007/s10462-024-10759-6
  23. [23]A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learninghttps://pmc.ncbi.nlm.nih.gov/articles/PMC9967638/