This page contains a list of course and independent projects.

Overview

Topics include (from most recent to oldest):

spatial, temporal and spatio-temporal point processes big data computing and simulation GLM statistical/machine learning time series multivariate analysis linear models EDA (exploratory data analysis) regression analysis

Dataset include (from most recent to oldest):

1D and 2D simulations of point/stochastic processes NYC Taxi & Limousine Commission trip records GMM (Gaussian mixture model) brain injury recovery stages Porto Seguro insurance claims U.S. Census Bureau’s American Community Survey (ACS) data stock prices wine quality Kaggle’s Ames housing prices Colorado state standardized test scores World Bank data



Master’s Project

Research


Prequel to Hawkes Processes: An Overview of Spatial, Temporal and Spatio-Temporal Point Processes and Some Simulations, Portland, OR, Mar 2021 – June 2021

Master’s Project for MS in statistics at OSU

Title: Prequel to Hawkes Processes: An Overview of Spatial, Temporal and Spatio-Temporal Point Processes and Some Simulations

Description: We give brief introductions, review definitions, discuss properties and applications of selected spatial and temporal point processes leading up to spatio-temporal SEPP, and simulate some of the processes in 1D and 2D in hope that interested readers have the background knowledge to comprehend existing SEPP literature as well as explore the field further.

[PDF] [Repository]






Data Science Projects

Data Science Tools and Programming/Big Data


Big Data Analysis of NYC TLC Trip Record (Yellow, Green Taxi, For-Hire Vehicle, HFHV) Data, Portland, OR, Feb 2021 – Mar 2021

Final Project for CS 512 Data Science Tools and Programming/Big Data

Title: Big Data Analysis of NYC TLC Trip Record (Yellow, Green Taxi, For-Hire Vehicle, HFHV) Data

Description: We use Google Cloud Platform (GCP) services (e.g. Compute Engine, BigQuery, Cloud Dataproc) and PySpark/Apache Spark to explore and analyze the NYC Taxi & Limousine Commission's trip records of 2019-2020 (~ 35.26 GB). Project was completed in cloud computing services, R, Python and query language.

[PDF] [Repository]






Statistical Modeling Projects

Spatial Statistics

in R, Portland, OR, Pending

Survival Analysis/GLMs II

in R, Portland, OR, Pending



Probability, Computing, and Simulation


EM (Expectation-Maximization) Algorithm with An Application for Gaussian Mixture Model, Portland, OR, Nov 2020 – Dec 2020

Final Project for ST 541 Probability, Computing, and Simulation in Statistics

Title: EM (Expectation-Maximization) Algorithm with An Application for Gaussian Mixture Model

Description: We first discuss the conditions under which EM algorithm can be used to find (local) maximum likelihood estimate (MLE) of parameter, as compared to other iterative method such as Newton-Raphson, Fisher's scoring, and IRLS (iteratively reweighted least squares). Then, we briefly review MLE and posterior probability. Next, we introduce the EM algorithm in the context of the Gaussian mixture model and expand the E-step and the M-step of the algorithm in details. Finally, we apply the algorithm to a simulated dataset that follows a two-component Gaussian mixture distribution and evaluate its performance.

[PDF] [Repository]




Generalized Regression Models/GLMs I


Multinomial GLMs for Multinomial Response with An Example of Brain Injury Recovery Stages, Portland, OR, Nov 2020 – Dec 2020

Final Project for ST 623 Generalized Regression Models (GLMs) I

Title: Multinomial GLMs for Multinomial Response with An Example of Brain Injury Recovery Stages

Description: We expand beyond binary (or binomial) reponse to focus on polychotomous (or multinomial) reponse. We first differentiate ordinal reponse from nominal reponse. Then, we briefly review multinomial distribution and latent variable. Next, we define the models and discuss model assumptions and estimation. Finally, we include an example of traumatic brain injury outcomes to illustrate how the proportional-odds cumulative logit model and the baseline-category logit model are used for estimation and prediction in practice.

[PDF] [Repository]




Time Series


Vector Autoregressive Models for Multivariate Time Series with An Example of Stock Prices, Corvallis, OR, Feb 2020 – Mar 2020

Final Project for ST 565 Time Series

Title: Vector Autoregressive Models for Multivariate Time Series with An Example of Stock Prices

Description: In many real-world datasets, multiple variables are at play, and they often interact with one another, producing interaction effects. Multivariate time series models such as vector autoregression (VAR) thus becomes a natural extension of its univariate autoregression (AR) model to achieve more flexibility and better forecasting. First, we discuss the differences between univariate time series and multivariate time series. Then, we introduce several key properties of multivariate time series such as stationarity, ergodicity, and cross-covariance. Next, we define the VAR model and discuss model estimation and causality. Finally, we include an example of stock prices to illustrate how the VAR model is used for estimation and forecasting in practice.




Machine Learning Projects

Machine Learning

in Python, Corvallis, OR, Pending

Class Projects for CS 534 Machine Learning





Data Analysis Projects

Statistical Methods for Large and Complex Data Sets/Statistical Learning


Auto Insurance Claim Prediction of Porto Seguro, Brazil’s Auto and Home Insurance Company, Corvallis, OR, May 2020 – Jun 2020

Final Project for ST 538 Modern Statistical Methods for Large and Complex Data Sets/Statistical Learning

Title: Auto Insurance Claim Prediction of Porto Seguro, Brazil’s Auto and Home Insurance Company

Description: We use several supervised and unsupervised learning algorithms such as PCA, penalized logistic regression, bagging, boosting, and random forests to determine factors that lead to file claiming and predict whether or not drivers would file claims. Data were extracted from Porto Seguro’s Safe Driver Prediction from Kaggle and included Claim_Initiated and 56 other explanatory variables for a total of 476k observations (n=476k) in the training dataset and a total of 119k observations (n=119k) in the test dataset. Project was completed in R.





Does Speaking Additional Language Predict Higher Earning Levels? A Case Study Using The U.S. Census, Corvallis, OR, Mar 2020 – Apr 2020

Final Project for ST 538 Modern Statistical Methods for Large and Complex Data Sets/Statistical Learning

Title: Does Speaking Additional Language Predict Higher Earning Levels? A Case Study Using The U.S. Census

Description: We use several regression models to explore the relationship between wages and additional language other than English spoken at home in the United States. Data were extracted from the U.S. Census Bureau's American Community Survey and, pertaining to this analysis, we limited the scope of inference to individuals who speak English, Spanish and Chinese in the West region of the U.S. and selected from 700+ variables to include only Wages and 7 other explanatory variables for a total of 1.1 millions individuals (n=1.1m). Project was completed in R.




Multivariate Analysis


Red vs. White Wine: Inference, Classification, and Clustering Based On Wine Quality and Chemical Attributes, Corvallis, OR, Nov 2019 – Dec 2019

Final Project for ST 557 Multivariate Analysis

Title: Red vs. White Wine: Inference, Classification, and Clustering Based On Wine Quality and Chemical Attributes

Description: I use inferential and predictive methods such as Hotelling T2 test, multivariate analysis of variance (MANOVA), principal component analysis (PCA), k-nearest neighbor (K-NN) classification, and k-means clustering to distinguish white wine from red wine and identify key variables in determining red wine quality. Data were provided as part of the project, which included quality score, fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol for a total of 1599 red wine (n1=1599) and a total of 4898 white wine (n2=4898). Project was completed in R.




Linear Models


A Look At Residential Housing Price in the Ames, Iowa Area, Corvallis, OR, Feb 2020 – Mar 2020

Final Project for ST 552 Statistical Methods II

Title: A Look At Residential Housing Price in the Ames, Iowa Area

Description: I use several inferential and predictive methods to test the relationship between SalePrice and other variables thought to be related and predict the final prices of residential houses in Ames, Iowa. Data were provided as part of the project, which were derived from and modified according to Ames Housing Price Challenge from Kaggle and included SalePrice and 55 other explanatory variables for a total of 1078 observations (n=1078) in the modified dataset and a total of 120 observations (n=120) in the test dataset. Project was completed in R.




Regression Analysis


Regression Analysis of Education data in R, Fort Collins, CO, Apr 2019 – May 2019

Final Project for STAA 566 Data Visualization

Title: An Investigation Into State Standardized Tests for Elementary School Students in the Denver, Colorado Area

Description: I use a locally estimated scatterplot smoothing (LOESS) model and an analysis of covariance (ANCOVA) model to explore the relationship between state standardized tests and several related variables. Data were provided as part of the project, which included test scores, English portion of the score, math portion of the score, student’s gender, student’s grade, and parents’ employment for a total of 1402 elementary school students in the Denver, Colorado area (n=1402). Project was completed in R: base R, car, and ggplot2.

[PDF] [Repository]





Regression Analysis of World Bank data in R, Fort Collins, CO, Apr 2019 – May 2019

Final Project for STAT 512 Design and Data Analysis II

Title: Environment vs. Economy: The Relationship Between CO2 Per Capita And Other Indicators and The Environmental Kuzents Curve

Description: I use a log-log multiple linear regression model with interaction terms and a log-log one-variable polynomial regression model to explore the relationship between CO2 per capita and several related variables and test the environmental Kuznets curve (EKC). Data were extracted from the World Bank (WB) database, which included CO2 per capita, GDP per capita, GNI per capita, energy use per capita, and electric power consumption per capita for a total of 264 countries (n=264). Project was completed in R: base R, dplyr, car, and ggplot2.

[PDF] [Repository]