About Me
My name is Scott, and I love working with data.
Throughout my career as a water/wastewater engineer and consultant, I've cultivated a profound appreciation for the transformative potential of data. In my pursuit of continuous learning and growth, I've dedicated my free time to exploring innovative tools and resources for parsing, analyzing, and harnessing data effectively. This journey led me to share my insights and discoveries through blogging on Medium. Feel free to explore some of my projects showcased below!
Why Data Science?
My enthusiasm for data science stems from my commitment to leveraging data in impactful ways that have a real-world impact. Initially drawn to civil engineering by a desire to design solutions that enhance people's lives in mundane ways, my passion for this field remains strong. Data science presents an unparalleled opportunity to extend those solutions and fulfill my aspirations by:
- Crafting systems that streamline daily tasks and enhance user experiences
- Uncovering actionable insights that enable businesses to connect with their target audiences effectively
- Creating visually compelling and informative data visualizations that drive informed, decision-making processes
Project Portfolio
Hover over each project to see more details, including links to code and an associated blog article summarizing the project.
Urban Environmental Audio Classification Using Mel Spectrograms
Urban Environmental Audio Classification Using Mel Spectrograms
Language: Python
Description: Using mel spectrograms generated from audio files included in the UrbanSound8K dataset, trained a CNN to classify urban sounds.
Skills: Audio Classification, Time/Frequency Domain Representations of Audio Data, Mel Spectrograms, Fourrier Transformations, Cross Validation, Convolutional Neural Networks (CNNs), Learning Rate Schedulers
Visualizations: Matplotlib, Seaborn, Librosa
Tools: NumPy, Pandas, Librosa, Scikit-learn, PIL, PyTorch
Categorical Clustering of Pittsburgh Car Accidents Using K-Modes
Categorical Clustering of Pittsburgh Car Accidents Using K-Modes
Language: Python
Description: Implemented a categorical clustering algorithm (k-modes) to cluster data for car accidents occurring in Pittsburgh, PA, from 2010-2019. Conducted an EDA comparing each of the assigned clusters to one another.
Skills: Unsupervised Learning, Categorical Data Clustering (k-Modes), Elbow Method for Cluster # Selection, Chi-square Test, Exploratory Data Analysis
Visualizations: Matplotlib, Seaborn, GeoPandas
Tools: NumPy, Pandas, kmodes, SciPy, Scikit-learn
San Francisco Crime Classification
San Francisco Crime Classification
Language: Python
Description: Developed models to predict the category of a crime based on its geographical location of occurrence using 12 years of historical crime data from San Francisco.
Skills: Multiclass Classification, Gradient-Boosting Algorithms, Random Forest, Fully Connected Neural Network, Clustering (Gaussian Mixture Models), Log-Odds Ratios, Word Embeddings (Word2Vec), Cross Validation, Ensembling
Visualizations: Matplotlib, Seaborn, Folium
Tools: NumPy, Pandas, Scikit-learn, PIL, LightGBM, XGBoost, CatBoost, TensorFlow, Keras
Fine-Tuning Language Models for Sentiment Analysis
Fine-Tuning Language Models for Sentiment Analysis
Language: Python
Description: Developed sentiment classification models by fine-tuning pre-trained language models (BERT, RoBERTa, DistilBERT) using financial news statements.
Skills: Pre-Trained Langauage Models, Transformers, Sentiment Classification, Fine-Tuning, Evaluation Metrics (Accuracy / Precision / Recall / F1 Score), Natural Language Processing
Visualizations: Matplotlib, Seaborn
Tools: NumPy, Pandas, Scikit-learn, PyTorch, Transformers (BERT, DistilBERT, RoBERTa), NLTK
Predicting Energy Consumption (Part 1)
Predicting Energy Consumption (Part 1)
Language: Python
Description: Conducted an EDA of hourly energy consumption data and compared performance of ARIMA time series forecasting methods.
Skills: Exploratory Data Analysis, Regression, Seasonal Composition, Stationarity, Moving Averages, Autoregression & Autocorrelation, ARIMA Models, Augmented Dickey-Fuller Test
Visualizations: Matplotlib, Seaborn
Tools: NumPy, Pandas, statsmodels, Scikit-learn
Predicting Energy Consumption (Part 2)
Predicting Energy Consumption (Part 2)
Language: Python
Description: Expanded on the work completed in Part 1, evaluating advanced time series forecasting methods using hourly energy consumption data.
Skills: Regression, Simple Exponential Smoothing, Triple Exponential Smoothing (Holt-Winters Method), LSTM Neural Networks, Prophet
Visualizations: Matplotlib, Seaborn
Tools: NumPy, Pandas, statsmodels, Scikit-learn, Keras, TensorFlow, Prophet
Audio Analysis: How to Impress (or Disappoint) Pitchfork
Audio Analysis: How to Impress (or Disappoint) Pitchfork
Language: Python
Description: Using the Spotify Web API, extracted audio features for songs featured on albums listed in the Pitchfork album review dataset. Compared songs from the top and bottom 10% of albums in the dataset, sorted by review score, to identify specific features that may be correlated a high album review score.
Skills: Web Scraping, Exploratory Data Analysis, Mann-Whitney U Test, Common Language Effect Size
Visualizations: Matplotlib, Seaborn
Tools: NumPy, Pandas, Spotipy, SciPy
Generating an Edgar Allan Poe-Styled Poem Using GPT-2
Generating an Edgar Allan Poe-Styled Poem Using GPT-2
Language: Python
Description: Fine-tuned a pre-trained language model (GPT-2) using the complete poetical works of Edgar Allan Poe to generate poetry in the author’s style.
Skills: Pre-Trained Language Models, Web Scraping, Unsupervised Language Models, Text Generation, Transformers, Natural Language Processing
Tools: NumPy, Pandas, BeautifulSoup, Transformers (GPT-2), PyTorch
UFO Sighting Explorer
UFO Sighting Explorer
Language: R
Description: Developed an interactive web application that allows the user to explore data for over 60,000 UFO sightings reported in the U.S. from 1949-2013.
Skills: Web Application Development, Data Cleaning
Visualizations: Plotly
Tools: Shiny, dplyr