Richard Ryu - Data Product Manager

I wanted to create a simple webpage that will help you better understand how UC Berkeley Master of Information Data Science transformed a non-technical student who only knew how to read SQL queries.

To be more specific, this will be a short comprehensive review of my time at UC Berkeley Master of Information and Data Science. Each section is chronologically ordered by the classes that I took and will include samples of codes, screenshots, and links to my work. If any of the links are broken, please let me know at richard.ryu@ischool.berkeley.edu

W200 Python Fundamentals for Data science (2018 Summer)

This was my first “coding” class in an academic setting. Coming in with only 2+ years of SQL and some decent excel skills, there were parts that were easy to understand like If conditions, and different data structures like float vs int. However, what I remember most from this class were two things:

Great community. Our class was comprised of students with no python background from all ages and all genders. With a common goal of learning, everybody (including the TAs and the instructors) helped each other out and was readily available!
Like any languages, English, French, Spanish, or Chinese, there’s unbelievable depths and nuisance in Python. To name a quick example, a simple list comprehension can replace a for loop that’s 10+ lines long. Productivity can’t be measured by the volume of code, yet we should try to evaluate the quality of the code that’s being produced.

Highlights

Object Oriented Programming, Functional Programming
Learned about Data Structures
EDA in Python (pandas, numpy, matplotlib)
Learned how to Git
Pseudo-coding

Mid Term Project: BBQ Management System

First python application I ever wrote! A restaurant management application for my friend’s Korean BBQ restaurant

Final Project: Bechdel Analysis

Used pandas on webscraped data from Box Office Mojo and IMDB

W203 Statistics for Data Science (2019 Summer)

I took a year break after my first semester at UC Berkeley MIDS due to my commitments as a product manager. However, I decided to re-prioritize and return to school full-time by taking Statistics and Data Engineering during the summer of 2019.

Highlights

Probability Theory and foundational understanding of classical statistics and how it fits within the broader context of data science
Learned and used R for most assignments

Final Project: Linear Regression on North Carolina Crime data

Our project group, Significant Effects, consisting of Jeff Li, Vasanth Ramani, and Richard Ryu, has been tasked by an imaginary political campaign in North Carolina to identify determinants of crime
As a result of our analysis, we came up with 3 different linear regression models to predict crime in North Carolina
We evaluated the models through: CLM assumptions and fit

W205 Fundamentals of Data Engineering (2019 Summer)

I was always fascinated by the fast moving lines of log on a terminal. I got a chance to watch a lot of that in work, when I participated in a project to switch the host infrastructure of our web platform from AWS to Microsoft Azure. By the time I was done with W205 Data Engineering, I think I have a general idea of what all those fast moving lines of logs were doing.

Highlights

First introduction to the world of Docker Containers, cloud services (GCP), data pipeline, query, transformation, and streaming
Learned how to use Kafka, Spark, and Flask

Final Project: Instrument an API server to catch and analyze events for a mobile gaming company

Each weekly assignments build up to the final project where we put it all together in a single data pipeline. Tools and packages leveraged in the final pipeline:

Docker Containers
Kafka
Flask
Hadoop HDFS
PySpark SQL
Pandas, Jupyter Notebook, Numpy

W209 Data Visualization (2019 Fall)

I was a bit hesitant to take this class since I already had decent business intelligence/data visualization background. I'm proficient with Tableau and have delivered data visualizations to managers of global pharmaceutical companies through SiSense and Microsoft PowerBI. However, this class made me realize that we're all actually spoiled by the data visualization tools out there. A lot of the fundamental concepts/theory behind designing a good data visualization that we learned in this class, is already selected/guided by tools such as Tableau or SiSense. We not only learned the fundamentals of good data visualization, but also learned how to code our own data visualization through D3.js. After all, a picture is worth more than thousand words.

Highlights

Main exposure to D3.js, with general front-end concepts
Data visualization concepts, theory, etc.
A picture is worth more than thousand words
Tableau
D3.js examples

Final Project: MOSS Dashboard

Created a dashboard that will visualize the outcome of LIME
github
Final Presentation

W207 Applied Machine Learning (2019 Fall)

This was my official introduction to Machine Learning. Yes, it took only about a year before I actually got the chance to do Machine Learning. I got a chance to learn from Yacov Salomon and one thing I will remember the most is his whiteboard and always dry markers… and how he always emphasized the importance of understanding the intuition behind any algorithms. To get a better understanding of what I mean, take a look at the Q&A below from one of the assignments:

Q: Any ideas why logistic regression doesn't work as well as Naive Bayes?

A: The difference between Naive Bayes and logistic regression is that Naive Bayes assumes independence amongst features while logistic regression looks for relationships between features. Since we're dealing with more than 2,000 features, the logistic regression will perform badly due to complexity. If we want to improve the performance of our logistic regression model, we must reduce the number of features

Highlights

Heavy Exposure: sklearn, numpy, matplotlib
Code Sample from one of the Assignments analyzing poisonous mushrooms:

Thoughts: I think this is the level of technical expectation that's expected from us, about half way into the program. Many differences and improvements can be noticed from my first python code :D. However, I do want to emphasize that my experience at UCB MIDS would not have been possible without all the positive/constructive participation of my fellow classmates, the TAs, and the instructors. Thank you to them.

Final Project: Kaggle - Facial Keypoint Detection Challenge

This was my first Kaggle competition
Used Keras, tensorflow, skimage, sklearn, pandas, numpy, matplotlib, CUDA, CuDNN
presentation
github
final notebook

W251 Deep Learning in the Cloud at the Edge (2020 Spring)

Despite the fact that data is growing at an exponential rate, there are so many un-captured data in the wild. This is where edge computing comes in and what this class is all about. Our first task was to setup a NVIDIA Jetson TX-2 with ubuntu. With the Jetson and IBM Cloud, we got to experiment with various algorithms of Deep Learning and truly appreciate the possibilities of big data.

NOTE: If you're not sure what Jetson TX-2 is, Bin Wang provided a nice video below that basically describes the first week of class for us

Highlights

Heavy exposure to Docker Containers, IBM cloud, dev/ops, IoT, edge computing, deep learning, NVDIA Jetson TX-2
Tuning models to safely land a lunar lander on moon github

Trained a Transformer-based Machine Translation Network on a small English to German WMT corpus github
Capture facial images from a webcam that is attached to the Jetson TX-2. Then, send the captured facial images to the cloud object storage on IBM virtual servers using MQTT github

Final Project: Kaggle - Deep Fake Detection Challenge

Participated in the Deepfake Detection Challenge to predict whether or not a particular video is a deep fake
Used IBM Cloud for the pipeline
Leveraged MTCNN, mixnet_m, LSTM, PyTorch, CUDA 10.0 for training
Achieved an accuracy of ~ 88.81% and log loss of 0.25, when predicting videos with a single face captured
github
presentation

Below is an example of a real video

Below is an example of a fake video

Can you tell the difference? :)

W261 Machine Learning at Scale (2020 Spring)

This course teaches the underlying principles required to develop scalabe machine learning pipelines for structured and unstructured data at the petabyte scale. Students will gain hands-on experience in Apache Hadoop and Apache Spark.

Highlights

Learn to recognize and apply key concepts in parallel computation and MapReduce design
Design stateless parallelizable implementations of core machine learning algorithms from scratch
Gain hands-on experience using Apache Hadoop and Apache Spark to analyze large datasets
OLS, Ridge, and Lasso with Spark RDDs

Final Project: Predict Flight Delay at Scale

Can weather data and basic airline metadata be used to predict whether or not a given flight will be delayed at a level of accuracy that will be practically useful? We used 'PySpark.ml.feature VectorAssembler' to prepare feature engineered data for Linear Regression and Gradient Boosted Decision Tree models. Although the results were not promising (RMSE ~42.15 and ~78.2% accuracy) for our final model, it was a good data science experience with my teammates Andrew Webb, Suzy Choi, and Pierce Coggins. To be specific, we got to experience the process of preparing our own data from scratch for the analysis and the invaluable experience of explaining why it wasn't effective. (This is the hard part!)

notebook

presentation

W210 Capstone (2020 Summer)

This is it. I teamed up with Hong, Michelle, and Rachael to come up with AccessiPark Denver

Our Plan:

First, train a model to detect street-level accessibility obstacles like lamps, fire hydrants, no parking signs
Second, Find areas from a zipcode level with the total count of accessibility parking signs
Third, use feature engineering, googlemaps, and data visualization to disseminate this information

Highlights

Train YOLOv5 on custom labelled images via Google Colab
Used Googlemaps reverse_geocode

                          
# You will need an active googlemaps API code
# david = pd.DataFrame(columns=['new_lat', 'new_long'])

import googlemaps
gmaps = googlemaps.Client(key = 'your API key')

for i in list(range(len(david))):
temp_result = gmaps.reverse_geocode((david.loc[i,'new_lat'], david.loc[i, 'new_long']), result_type = "postal_code")
david.loc[i, 'zipcode'] = temp_result[0].get('address_components')[0].get('long_name')

For more details and code samples of all the feature engineering required for this project, please refer to the README.md on our project github