UNDERSTANDING THE IMPACT OF EXPERIMENTAL DESIGN CHOICES ON MACHINE LEARNING CLASSIFIERS IN SOFTWARE ANALYTICS

dc.contributor.authorRajbahadur, Gopien
dc.contributor.departmentComputingen
dc.contributor.supervisorHassan, Ahmed E.
dc.date.accessioned2020-09-30T18:45:24Z
dc.date.available2020-09-30T18:45:24Z
dc.degree.grantorQueen's University at Kingstonen
dc.description.abstractSoftware analytics is the process of systematically analyzing software engineering related data to generate actionable insights that help software practitioners make data-driven decisions. Machine learning classifiers lie at the heart of these software analytics pipelines and help automate the process of generating insights from large volumes of low-level software engineering data (e.g., static code metrics of software projects). However, the generated results from these classifiers are extremely sensitive to the various experimental design choices (e.g., choice of feature removal techniques) that one makes when constructing a software analytics pipeline. Despite that prior studies only explore the impact of a few experimental design choices on the results of classifiers and, the impact of many other experimental design choices on generated results remains unexplored. It is critical to further understand how the various experimental design choices impact the generated insights of a classifier. Such an understanding enables us to ensure the accuracy and validity of the generated insights from a classifier. Therefore, in this PhD thesis, we further our understanding of how several previously unexplored experimental design choices impact the results that are generated by a classifier. Through several case studies on various software analytics datasets and contexts, 1) we find that the common practice of discretizing the dependent feature could be avoided in some cases (where the defective ratio of the dataset is <15%) by using regression-based classifiers. 2) In cases where the discretization of the dependent feature cannot be avoided, we propose a framework that the researchers and practitioners can use to mitigate its impact on the generated insights of a classifier. 3) We find that interchangeable use of feature importance methods should be avoided as different feature importance methods produce vastly different interpretations even on the same classifier. Based on these findings we provide several guidelines for future software analytics studies.en
dc.description.degreePhDen
dc.identifier.urihttp://hdl.handle.net/1974/28167
dc.language.isoengen
dc.relation.ispartofseriesCanadian thesesen
dc.rightsAttribution-ShareAlike 3.0 United States*
dc.rightsAttribution-ShareAlike 3.0 United States
dc.rights.urihttp://creativecommons.org/licenses/by-sa/3.0/us/*
dc.rights.urihttp://creativecommons.org/licenses/by-sa/3.0/us/
dc.subjectMachine learningen
dc.subjectSoftware engineeringen
dc.subjectData miningen
dc.subjectSoftware analyticsen
dc.subjectmining software repositoriesen
dc.subjectDefect predictionen
dc.subjectExplainable machine learningen
dc.titleUNDERSTANDING THE IMPACT OF EXPERIMENTAL DESIGN CHOICES ON MACHINE LEARNING CLASSIFIERS IN SOFTWARE ANALYTICSen
dc.typethesisen

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Rajbahadur_GopiKrishnan_202009_PhD.pdf
Size:
3.43 MB
Format:
Adobe Portable Document Format
Description:
Thesis Document

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.6 KB
Format:
Item-specific license agreed upon to submission
Description: