A comparison of machine learning methods for extremely unbalanced industrial quality data

doi:10.1007/978-3-030-86230-5_44

Utilize este identificador para referenciar este registo: https://hdl.handle.net/1822/73976

Registo completo

Campo DC	Valor	Idioma
dc.contributor.author	Pereira, Pedro José	por
dc.contributor.author	Pereira, Adriana	por
dc.contributor.author	Cortez, Paulo	por
dc.contributor.author	Pilastri, André Luiz	por
dc.date.accessioned	2021-09-09T11:18:34Z	-
dc.date.available	2021-09-09T11:18:34Z	-
dc.date.issued	2021-09	-
dc.identifier.citation	Pereira P.J., Pereira A., Cortez P., Pilastri A. (2021) A Comparison of Machine Learning Methods for Extremely Unbalanced Industrial Quality Data. In: Marreiros G., Melo F.S., Lau N., Lopes Cardoso H., Reis L.P. (eds) Progress in Artificial Intelligence. EPIA 2021. Lecture Notes in Computer Science, vol 12981. Springer	por
dc.identifier.isbn	978-3-030-86229-9	-
dc.identifier.issn	0302-9743	por
dc.identifier.uri	https://hdl.handle.net/1822/73976	-
dc.description.abstract	The Industry 4.0 revolution is impacting manufacturing companies, which need to adopt more data intelligence processes in order to compete in the markets they operate. In particular, quality control is a key manufacturing process that has been addressed by Machine Learning (ML), aiming to improve productivity (e.g., reduce costs). However, modern industries produce a tiny portion of defective products, which results in extremely unbalanced datasets. In this paper, we analyze recent big data collected from a major automotive assembly manufacturer and related with the quality of eight products. The eight datasets in- clude millions of records but only a tiny percentage of failures (less than 0.07%). To handle such datasets, we perform a two-stage ML comparison study. Firstly, we consider two products and explore four ML algorithms, Random Forest (RF), two Automated ML (AutoML) methods and a deep Autoencoder (AE), and three balancing training strategies, namely None, Synthetic Minority Oversampling Technique (SMOTE) and Gaussian Copula (GC). When considering both classification performance and computational effort, interesting results were obtained by RF. Then, the selected RF was further explored by considering all eight datasets and five balancing methods: None, SMOTE, GC, Random Undersampling (RU) and Tomek Links (TL). Overall, competitive results were achieved by the combination of GC with RF.	por
dc.description.sponsorship	This work is supported by: European Structural and Investment Funds in the FEDER component, through the Operational Competitiveness and Internation- alization Programme (COMPETE 2020) [Project n 39479; Funding Reference: POCI-01-0247-FEDER-39479].	por
dc.language.iso	eng	por
dc.publisher	Springer	por
dc.rights	openAccess	por
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	por
dc.subject	Anomaly Detection	por
dc.subject	Industrial Data	por
dc.subject	Random Forest	por
dc.title	A comparison of machine learning methods for extremely unbalanced industrial quality data	por
dc.type	conferencePaper	por
dc.peerreviewed	yes	por
dc.relation.publisherversion	https://link.springer.com/chapter/10.1007/978-3-030-86230-5_44	por
oaire.citationStartPage	561	por
oaire.citationEndPage	572	por
oaire.citationVolume	LNCS 12981	por
dc.identifier.doi	10.1007/978-3-030-86230-5_44	por
dc.identifier.eisbn	978-3-030-86230-5	-
dc.subject.fos	Ciências Naturais::Ciências da Computação e da Informação	por
dc.subject.wos	Science & Technology	por
sdum.journal	Lecture Notes in Computer Science (including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)	por
sdum.conferencePublication	20th EPIA Conference on Artificial Intelligence (EPIA 2021)	por
oaire.version	AM	por
dc.subject.ods	Indústria, inovação e infraestruturas	por
Aparece nas coleções:	CAlg - Artigos em livros de atas/Papers in proceedings