A comparison of machine learning methods for extremely unbalanced industrial quality data

doi:10.1007/978-3-030-86230-5_44

Utilize este identificador para referenciar este registo: https://hdl.handle.net/1822/73976

Título:	A comparison of machine learning methods for extremely unbalanced industrial quality data
Autor(es):	Pereira, Pedro José Pereira, Adriana Cortez, Paulo Pilastri, André Luiz
Palavras-chave:	Anomaly Detection Industrial Data Random Forest
Data:	Set-2021
Editora:	Springer
Revista:	Lecture Notes in Computer Science (including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Citação:	Pereira P.J., Pereira A., Cortez P., Pilastri A. (2021) A Comparison of Machine Learning Methods for Extremely Unbalanced Industrial Quality Data. In: Marreiros G., Melo F.S., Lau N., Lopes Cardoso H., Reis L.P. (eds) Progress in Artificial Intelligence. EPIA 2021. Lecture Notes in Computer Science, vol 12981. Springer
Resumo(s):	The Industry 4.0 revolution is impacting manufacturing companies, which need to adopt more data intelligence processes in order to compete in the markets they operate. In particular, quality control is a key manufacturing process that has been addressed by Machine Learning (ML), aiming to improve productivity (e.g., reduce costs). However, modern industries produce a tiny portion of defective products, which results in extremely unbalanced datasets. In this paper, we analyze recent big data collected from a major automotive assembly manufacturer and related with the quality of eight products. The eight datasets in- clude millions of records but only a tiny percentage of failures (less than 0.07%). To handle such datasets, we perform a two-stage ML comparison study. Firstly, we consider two products and explore four ML algorithms, Random Forest (RF), two Automated ML (AutoML) methods and a deep Autoencoder (AE), and three balancing training strategies, namely None, Synthetic Minority Oversampling Technique (SMOTE) and Gaussian Copula (GC). When considering both classification performance and computational effort, interesting results were obtained by RF. Then, the selected RF was further explored by considering all eight datasets and five balancing methods: None, SMOTE, GC, Random Undersampling (RU) and Tomek Links (TL). Overall, competitive results were achieved by the combination of GC with RF.
Tipo:	Artigo em ata de conferência
URI:	https://hdl.handle.net/1822/73976
ISBN:	978-3-030-86229-9
e-ISBN:	978-3-030-86230-5
DOI:	10.1007/978-3-030-86230-5_44
ISSN:	0302-9743
Versão da editora:	https://link.springer.com/chapter/10.1007/978-3-030-86230-5_44
Arbitragem científica:	yes
Acesso:	Acesso aberto
Aparece nas coleções:	CAlg - Artigos em livros de atas/Papers in proceedings