Knowledge Discovery and Data Mining (KDD) Lab

Knowledge Discovery and Data Mining (KDD) is the nontrivial process of extracting implicit, novel, and useful information from large volume of data. It has emerged as a unique combination of several fields of science and technology including statistics, database systems, computer programming, machine learning, and artificial intelligence. KDD spans a wide range of applications in Engineering (intrusion detection and network security, flow classification, Web mining), business (fraud detection, decision support systems, risk analysis, forecasting market trend), medicine and population health (study of drug implications, disease outbreak), bioinformatics (protein interactions, gene sequence analysis), and environmental science (flood prediction, sattelite image processing).

Dr. Raahemi has established the The Knowledge Discovery and Data mining (KDD) Lab at the University of Ottawa hosting graduate students and researchers from multidisciplinary areas of Computer Science, Computer Engineering, and E-Business Technologies. Students and researchers from the related fields of Mathematics & Statistics are also welcome in his research group.

The research projects in the KDD lab focus on the following two main areas:

1- Novel techniques in data analytics and machine learning, as well as emerging applications of data mining and machine learning in engineering, healthcare, and business. In particular, the focal points of the projects are on study and development of advanced algorithms for (a) Big data analytics; (b) outlier detection in high-dimensional data; and (c) stream data mining; as well as emerging applications of the proposed solutions in the areas of Engineering (network security, intrusion detection), business (business analytics, fraud detection, intelligent prediction of stock market), and Healthcare (study of health coverage, predicting high-cost patients, and risk of hospitalization, predicting immune-base disease).

2- Information systems and technologies; Data communications networks and services; and applications of information systems in business and healthcare.

A partial list of the projects in the KDD Lab includes:

Outlier Detection in High-Dimensional Big Data using Bio-Inspired Algorithms (Supported by NSERC Discovery)

In his Discovery research program funded by NSERC Discovery since 2007, Dr. Raahemi explores innovative algorithms for feature engineering and analysis of large data using bio-inspired and machine learning approaches with a particular focus on outlier detection. He investigates the competency of the proposed algorithms in various applications including (a) intrusion detection systems and anomaly detection for network security; (b) protocol identification of the Internet traffic for resource allocation and quality of service assurance; and (c) maritime vessel scheduling.

Cyber Threat and Malware Detection in Network Traffic using Big Data Analytics (supported by Bell Canada and MITACS)

Maintaining Quality of Services in the network requires traffic monitoring and security control measurements. Classification of internet traffic (e.g., peer-to-peer, web server, mail server and attacks including malware, virus and worm) is a fundamental requirement in areas such as network provisioning, network security, traffic engineering, and network management.

In a close collaboration with Bell Canada Cyber Threat Intelligence (CTI) team , Dr. Raahemi and his group developed solutions, using big data analytic techniques, to classify cyber threat including malware and attacks, based on their behavioral characteristics.


Metaheuristic Optimization in Maritime Vessel Scheduling:
Big-Data-Enabled Multi-Objective Modelling of Vessel Scheduling Recovery Problem
(supported by Larus Technologies and NSERC-CRD)

Seaborne includes 90% of international trades (significant impact on the global economy). Due to limited differentiation of services, the main competition between stakeholders in this industry is cost-based.
This research explores multi-objective optimization techniques to address an optimization problem with 3 objectives:
-  minimize financial loss
-  minimize delay time
-  maximize average speed compliance

Traffic at port, traffic on major world sea routes, and  special atmospheric condition at a geospatial location are the parameters affecting the sailing speed.
This research employs metaheuristic techniques to solve the optimizations problems. In particular, we use distributed cooperative coevolution methods on Apache Spark framework to increase the performance and quality of solutions.

Our proposed solution generated a Pareto front which reflects the trade-off among the three objectives.

Estimating Bus Passengers' Origin-Destination of Travel Route using Data Analytics on Wi-Fi and Bluetooth Signals (supported by SMATS Traffic Solutions and OCE/NSERC-Engage)

The solutions we propose in this research improve the efficiency of public transportation systems by facilitating efficient bus scheduling and route planning, improving ride comfort, and also lowering operating costs of cities.

TrafficBox sensor collects mobile devices’ MAC addresses, Received Signal Strength Indication (RSSI), time stamps, and GPS data and then stores them on its internal storage.

The main challenge in using Wi-Fi and Bluetooth sensors is distinguishing between passengers and non-passengers’ signals as the sensors detect all the transmitted signals from the surrounding environment. To address this issue, we employed K-Means and Hierarchical clustering methods based on our previous experiment to automatically differentiate between passengers’ and others’ signals.


Managing and Analysing Data for Concrete Building Infrastructure (Supported by Giatec Scientific and NSERC-Engage) 

Dr. Raahemi led a research in collaboration with Giatech to collect and store the data generated by wireless sensors on a cloud infrastructure, then manage and analyze the data using data mining and machine learning techniques to detect anomalies and explore hidden patterns in the data.


Analyzing EEG signals for depression diagnosis (supported by the IBM, Royal Ottawa Hospital and MITACS)

Dr. Raahemi and his team, in collaboration with the researchers at the Royal Ottawa Hospital, supported by the IBM Canada and MITACS, have undertaken an interesting project to analyze the electroencephalogram (EEG) signals collected from patients with major depressive disorder to build predictive models and identify the brain bio-markers for diagnosis of depression.

Predicting Immune-bases Disease with Reliable Data Mining on Population-Based Health Administrative Data (supported by Children Hospital of Eastern Ontario Research Instritute CHEO-RI, ICES and MITACS)

The prevalence of immune-mediated chronic diseases has increased worldwide, including in Canada, over the past years. 

In the project sponsored by the Institute for Clinical Evaluation Sciences, Children Hospital of Eastern Ontario Research Instritute, and MITACS, Dr. Raahemi and his colleagues are currently investigating exploratory data analysis and predictive modelling to build risk prediction models for chronic immune-mediated diseases such as IBD, asthma, multiple sclerosis, and type-1 diabetes.

The ase study is implemented using on a real-world health data (ToH, OHIP, ICES) to tackle a rising population-based issue – immune-mediated diseases among children in Ontario, Canada, and validate the results in consulting with domain experts.


Classification of Peer-to-Peer traffic using data mining techniques (supported by Alcatel Networks (now Nokia) and ORNEC)
Telecommunication equipment vendors and the Internet Service Providers are very interested in solutions to classify Peer-to-Peer (P2P) traffic. P2P applications consume significant bandwidth and exhausts network resources, resulting in network congestion, affecting the availability, reliability and quality of services.

Supported by Alcatel Networks, and in collaboration with its Research and Innovation Centre (R&I), we collected real Internet traffic, performed pre-processing on the data, and prepared a training data set based on which we built several models including decision tree, neural networks, incremental neural networks, incremental Tri-Training, fast decision tree, and concept-drift fast decision tree to identify P2P traffic.