Achieving Consumable Big Data Analytics by Distributing Data Mining Algorithms

dc.contributor.authorKhalifa, Shadyen
dc.contributor.departmentComputingen
dc.contributor.supervisorMartin, Patricken
dc.date.accessioned2017-03-22T19:46:14Z
dc.date.available2017-03-22T19:46:14Z
dc.degree.grantorQueen's University at Kingstonen
dc.description.abstractBusinesses look at Big Data as an opportunity to gain insights for improving their services. The derivation of such insights requires using different data mining techniques. Mature data mining tools like WEKA or R have been in development for years. They implement a large number of data mining algorithms and can support sophisticated Analytics. However, these mature tools are designed to run on a single machine making them unsuitable to handle Big Data. Using these tools requires data mining and statistics knowledge, and some of them, like R, are hard to learn. Businesses do not always have the technical skills required to carry on such Analytics. Even if they do, it is challenging to find a tool with the needed algorithms that supports distributed processing to handle the Big Data high arrival velocity and large volumes. The Businesses’ analytical requirements can be addressed by Consumable Big Data Analytics, that is, solutions that allow businesses to do Big Data Analytics themselves using their in-house expertise. In this work, we provide a Consumable Analytics solution to meet the businesses’ analytical needs. First, we conduct a survey of existing Analytics solutions to identify possible areas of improvement to provide Consumable Analytics. Second, instead of developing distributed data mining algorithms to handle Big Data, we develop the Data Mining Distribution (DMD) algorithm and the Label-Aware Disjoint Partitioning (LADP) algorithm to distribute the execution of all existing single-machine data mining algorithms without rewriting a single line of their code. This gives users the flexibility to use any available data mining library, have algorithms like Hoeffding Tree run 70% to 95% faster and achieve up to 18% increase in prediction accuracy. Third, we develop the free and open source QDrill solution to implement our DMD and LADP algorithms for distributed Analytics. QDrill implements our proposed Distributed Analytics Query Language (DAQL) interface that adds Analytics capabilities to the regular SQL syntax and allows integration with Business Intelligence (BI) tools. This allows businesses to use their in-house expertise to do Big Data Analytics using the spreadsheets and visualizations of their BI tools.en
dc.description.degreePhDen
dc.identifier.urihttp://hdl.handle.net/1974/15460
dc.language.isoengen
dc.relation.ispartofseriesCanadian thesesen
dc.rightsAttribution-ShareAlike 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-sa/3.0/us/
dc.subjectBig Dataen
dc.subjectAnalyticsen
dc.subjectData Mininigen
dc.subjectDistributeden
dc.subjectDrillen
dc.subjectMachine Learningen
dc.subjectClassifier Ensemblesen
dc.subjectConsumable Analyticsen
dc.subjectQuery Languageen
dc.subjectWekaen
dc.titleAchieving Consumable Big Data Analytics by Distributing Data Mining Algorithmsen
dc.typethesisen

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Khalifa_Shady_201703_PhD.pdf
Size:
2.91 MB
Format:
Adobe Portable Document Format
Description:
Thesis document

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.6 KB
Format:
Item-specific license agreed upon to submission
Description: