Handling class overlapping to detect noisy instances in classification

Shivani Gupta; Atul Gupta; Shivani Gupta; Atul Gupta

doi:10.1017/S0269888918000115

2018 Volume 33

Article Contents

Next Previous

RESEARCH ARTICLE Open Access

Handling class overlapping to detect noisy instances in classification

Shivani Gupta¹,
Atul Gupta¹

Jabalpur

More Information

Published online: 10 July 2018
The Knowledge Engineering Review 33, Article number: e8 (2018) | Cite this article

Abstract

Abstract: Automated machine classification will play a vital role in the machine learning and data mining. It is probable that each classifier will work well on some data sets and not so well in others, increasing the evaluation significance. The performance of the learning models will intensely rely on upon the characteristics of the data sets. The previous outcomes recommend that overlapping between classes and the presence of noise has the most grounded impact on the performance of learning algorithm. The class overlap problem is a critical problem in which data samples appear as valid instances of more than one class which may be responsible for the presence of noise in data sets.The objective of this paper is to comprehend better the data used as a part of machine learning problems so as to learn issues and to analyze the instances that are profoundly covered by utilizing new proposed overlap measures. The proposed overlap measures are Nearest Enemy Ratio, SubConcept Ratio, Likelihood Ratio and Soft Margin Ratio. To perform this experiment, we have created 438 binary classification data sets from real-world problems and computed the value of 12 data complexity metrics to find highly overlapped data sets. After that we apply measures to identify the overlapped instances and four noise filters to find the noisy instances. From results, we found that 60–80% overlapped instances are noisy instances in data sets by using four noise filters. We found that class overlap is a principal contributor to introduce class noise in data sets.
Rights and permissions
© Cambridge University Press, 2018 2018Cambridge University Press

References

Alcal-Fdez J., Fernndez A., Luengo J., Derrac J., Garca S., Snchez L. & Herrera F. 2011. Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing 17, 255–287.

Google Scholar

Baumgartner R. & Somorjai R. L. 2006. Data complexity assessment in undersampled classification of high-dimensional biomedical data. Pattern Recognition Letters 27(12), 1383–1389.

Google Scholar

Basu M. & Ho T. K. (eds) 2006. Data Complexity in Pattern Recognition. Springer Science and Business Media.

Google Scholar

Bernad-Mansilla E. & Ho T. K. 2005. Domain of competence of XCS classifier system in complexity measurement space. IEEE Transactions on Evolutionary Computation 9(1), 82–104.

Google Scholar

Brodley C. E. & Friedl M. A. 1999. Identifying mislabeled training data. Journal of Artificial Intelligence Research 11, 131–167.

Google Scholar

Cortes C. & Vapnik V. 1995. Support-vector networks. Machine Learning 20(3), 273–297.

Google Scholar

Cover T. & Hart P. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27.

Google Scholar

Derrac J., Triguero I., Garca S. & Herrera F. 2012. Integrating instance selection, instance weighting, and feature weighting for nearest neighbor classifiers by coevolutionary algorithms. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 42(5), 1383–1397.

Google Scholar

Devijver P. A. 1986. On the editing rate of the multiedit algorithm. Pattern Recognition Letters 4(1), 9–12.

Google Scholar

Domingos P. & Pazzani M. 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine learning 29(2–3), 103–130.

Google Scholar

Gamberger D., Lavrac N. & Groselj C. 1999. Experiments with noise filtering in a medical domain. In 16th International Conference on Machine Learning (ICML99), 143–151.

Google Scholar

Hattori K. & Takahashi M. 2000. A new edited k-nearest neighbor rule in the pattern classification problem. Pattern Recognition 33(3), 521–528.

Google Scholar

He H. & Garcia E. A. 2009. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284.

Google Scholar

Jain A. K., Duin R. P. W. & Mao J. 2000. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 4–37.

Google Scholar

Jeatrakul P., Wong K. W. & Fung C. C. 2010. Data cleaning for classification using misclassification analysis. Journal of Advanced Computational Intelligence and Intelligent Informatics 14(3), 297–302.

Google Scholar

Khoshgoftaar T. M., Zhong S. & Joshi V. 2005. Enhancing software quality estimation using ensemble-classifier based noise filtering. Intelligent Data Analysis 9(1), 3–27.

Google Scholar

Kretzschmar R., Karayiannis N. B. & Eggimann F. 2003. Handling class overlap with variance-controlled neural networks. In Proceedings of the International Joint Conference on Neural Networks, 2003, 1, 517–522. IEEE.

Google Scholar

Luengo J. & Herrera F. 2012. Shared domains of competence of approximate learning models using measures of separability of classes. Information Sciences 185(1), 43–65.

Google Scholar

Mollineda R. A., Snchez J. S. & Sotoca J. M. 2005. Data characterization for effective prototype selection. In Iberian Conference on Pattern Recognition and Image Analysis, 27–34. Springer.

Google Scholar

Orriols-Puig A., Macia N. & Ho T. K. 2010. Documentation for the Data Complexity Library in C++ 196, Universitat Ramon Llull, La Salle.

Google Scholar

Quinlan J. R. 2014. C4. 5: Programs for Machine Learning. Elsevier.

Google Scholar

Salvador, G. & Herrera, F. 2008. An extension on statistical comparisons of classifiers over multiple data setsi for all pairwise comparisons, Journal Machine Learning Research, 9, 2677–2694.

Google Scholar

Snchez J. S., Barandela R., Marqus A. I., Alejo R. & Badenas J. 2003. Analysis of new techniques to obtain quality training sets. Pattern Recognition Letters 24(7), 1015–1022.

Google Scholar

Snchez J. S., Mollineda R. A. & Sotoca J. M. 2007. An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Analysis and Applications 10(3), 189–201.

Google Scholar

Snchez J. S., Pla F. & Ferri F. J. 1997. Prototype selection for the nearest neighbour rule through proximity graphs. Pattern Recognition Letters 18(6), 507–513.

Google Scholar

Tomek I. 1976. An experiment with the edited nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics 6(6), 448–452.

Google Scholar

Verbaeten S. & Van Assche A. 2003. Ensemble methods for noise elimination in classification problems. In 4th International Workshop on Multiple Classifier Systems (MCS 2003), LNCS 2709, 317–325. Springer.

Google Scholar

Wilson D. L. 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics 2(3), 408–421.

Google Scholar

Zhu X. & Wu X. 2004. Class noise vs. attribute noise: a quantitative study. Artificial Intelligence Review 22(3), 177–210.

Google Scholar

About this article

Cite this article

Shivani Gupta, Atul Gupta. 2018. Handling class overlapping to detect noisy instances in classification. The Knowledge Engineering Review. 33:115 doi: 10.1017/S0269888918000115

Shivani Gupta, Atul Gupta. 2018. Handling class overlapping to detect noisy instances in classification. The Knowledge Engineering Review. 33:115 doi: 10.1017/S0269888918000115

Download PDF

Article Metrics

Article views(157) PDF downloads(213)

Handling class overlapping to detect noisy instances in classification

Shivani Gupta¹,
Atul Gupta¹

Jabalpur

Published online: 10 July 2018

The Knowledge Engineering Review 33, Article number: e8 (2018) | Cite this article

Abstract: Abstract: Automated machine classification will play a vital role in the machine learning and data mining. It is probable that each classifier will work well on some data sets and not so well in others, increasing the evaluation significance. The performance of the learning models will intensely rely on upon the characteristics of the data sets. The previous outcomes recommend that overlapping between classes and the presence of noise has the most grounded impact on the performance of learning algorithm. The class overlap problem is a critical problem in which data samples appear as valid instances of more than one class which may be responsible for the presence of noise in data sets.The objective of this paper is to comprehend better the data used as a part of machine learning problems so as to learn issues and to analyze the instances that are profoundly covered by utilizing new proposed overlap measures. The proposed overlap measures are Nearest Enemy Ratio, SubConcept Ratio, Likelihood Ratio and Soft Margin Ratio. To perform this experiment, we have created 438 binary classification data sets from real-world problems and computed the value of 12 data complexity metrics to find highly overlapped data sets. After that we apply measures to identify the overlapped instances and four noise filters to find the noisy instances. From results, we found that 60–80% overlapped instances are noisy instances in data sets by using four noise filters. We found that class overlap is a principal contributor to introduce class noise in data sets.

HTML

Rights and permissions

References (29)

About this article

Cite this article

Shivani Gupta, Atul Gupta. 2018. Handling class overlapping to detect noisy instances in classification. The Knowledge Engineering Review. 33:115 doi: 10.1017/S0269888918000115

Shivani Gupta, Atul Gupta. 2018. Handling class overlapping to detect noisy instances in classification. The Knowledge Engineering Review. 33:115 doi: 10.1017/S0269888918000115

DownLoad: Full-Size Img PowerPoint

Return

{{lists.name}}

Handling class overlapping to detect noisy instances in classification

Abstract