Data Science Projects

This page showcases the Data Science Projects that I’ve completed either personally or as part of the Johns Hopkins University Data Science Specialization Track offered through Coursera. You will see a brief description on the project, followed by links to applications, repositories, presentations and/or reports, where applicable. (Image credit: Drew Conway)

Personal Projects

Data Science Cross Reference Notes [On-going]

This is a collection of notes from my learning journey that is attempt to be a cross reference between language implementations for common data science related tasks. I am starting with the languages R and Python.

Highlights
– Example implementations of common data science tasks

Github

Exploring Focal Lengths

This is an attempt to visualize my photography styles by looking at the focal lengths I’ve used in the past few years.

Highlights
– EXIF metadata extraction
– Cleaned and visualized data using R packages like dplyr, lubridate, ggplot2
– Interactive visualization using Tableau

Blog Post
Github
Tableau Visualization

JHU Data Science Specialization

Capstone Project

In this inaugural capstone run, which is offered in partnership with SwiftKey, I’ve created a text prediction application that allows a user to enter a phrase and subsequently predict the next word the user might enter.

Highlights
– Applied Natural Language Processing concepts such as tokenization, stemming and language modelling
– Created N-grams from text corpus using R packages like tm and RWeka
– Created presentations and application using knitr, HTML5 R Presentation (created with RStudio), and Shinyapps.io

App on Shinyapps.io
Presentation on Rpubs
Milestone report on Rpubs

Developing Data Products

In this project, I’ve created a data product using Shinyapps.io, a platform that allows the deployment of web applications using the R Environment via the RStudio IDE.

Highlights
– Created a query tool to convert postal addresses or place names to map coordinates
– Integration with Google Maps API for geocoding
– Presentation and application created using slidify R package and shinyapps.io

App on Shinyapps.io
Presentation on Github Pages

Practical Machine Learning

For this project, I worked on the Human Activity Recognition dataset where data are recorded by sensors in wearable activity trackers similar to the products created by Nike and Fitbit. A predictive model that can recognize human activities like sitting-down and standing-up is created.

Highlights
– Built a prediction model with the Random Forest classifier using the caret R package
– Applied Prediction Study design principles like creation of training, validation and test sets, as well as model selection and cross validation
– Created a HTML report with R Markdown and knitr R package

Report on Github Pages

Regression Models

In this project, I analyzed the provided dataset and created a regression model to answer questions on motor car trends. An analysis in PDF format is produced.

Highlights
– Built a multivariate linear regression model with R
– Applied statistical techniques like t-tests and stepwise regression
– Created a PDF report using R Markdown and knitr package

Github

Reproducible Research

In this project, I worked on the Storm Events Database to produce an analysis of the impact of weather events in the United States. The report concludes by identifying the top 10 events that cause the greatest casualties and greatest monetary damage.

Highlights
– Created a reproducible report using R Markdown and knitr package
– Cleaned and visualized data using R and ggplot2 package

Report on Rpubs

Exploratory Data Analysis

In this course, I performed exploratory data analysis on the “Individual household electric power consumption Data Set” provided by the UC Irvine Machine Learning Repository and created plots using the base R graphics system and ggplot2 package.

Highlights
– Visualized data using base R graphics and ggplot2 package

Github

Getting and Cleaning Data

In this project, I cleaned a raw data source and produced a tidy dataset. You can see the analysis file, tidy dataset and codebook on Github.

Highlights
– Created a tidy data set after cleaning raw data
– Created a complimentary codebook for tidy data set

Github