Note: Make sure you update codes (e.g., paths, configs) before using them. Some codes are outdated and/or may not work depending on the version of your python, R, etc.
Machine learning for management research
Python script for topic modeling: (1) pre-process texts from software domains, (2) import LLM, (3) fine-tune the parameters of HDBSCAN using pseudo grid-search, and (4) test the model using mini-batch, which is ideal for big data (CUDA version) (created in 2024)
Python & Bash script to replicate and adjust Cross-Domain Data Augmentation (Li et al., 2022) to overcome data availability issues when fine-tuning LLM (created in 2023)
You will need to (1) clone the repo, (2) update the code for your purpose, and then (3) run the script above
Python script to run ChatGPT prompts for brand recognition on Instagram advertising posts (Batched prompts) (created in 2023)
Python script to run sentiment & emotion analysis using LLM (CUDA version) (created in 2023)
Python script to run stochastic gradient descent to derive coefficients of non-linear models (created in 2020)
Python script to run convolutional neural network (CNN) for character recognition on MNIST database (created in 2020)
Python script to run various clustering algorithms (e.g., DBSCAN, Louvain) on Covid-19 Open Research dataset (created in 2020)
R script to perform data-driven LDA topic modeling (created in 2020)
Instagram-related projects
Python library to download metadata from Instagram
Python script to connect Python and MySQL database
Python script to parse post data from JSON formatted Instagram metadata (make sure you adjust multi-threading configs & dataframe schema)
Python script to parse comment data from JSON formatted Instagram metadata
Python script to run ChatGPT prompts for brand recognition on Instagram advertising posts (Batched prompts)
Python script to run sentiment & emotion analysis using LLM (CUDA version)
Python script to measure degrees between nodes (e.g., influencers and users)
R script to perform data-driven LDA topic modeling
Python & Bash script to replicate and adjust Cross-Domain Data Augmentation (Li et al., 2022) to overcome data availability issues when fine-tuning LLM
You will need to (1) clone the repo, (2) update the code for your purpose, and then (3) run the script above
GitHub-related projects
SQL code to create main tables from GH Archive dataset (via BigQuery)
SQL code to create variables from main tables created above
Python script to export variables from BigQuery to Google Cloud Storage
Python script for topic modeling: (1) pre-process texts from software domains, (2) import LLM, (3) fine-tune the parameters of HDBSCAN using pseudo grid-search, and (4) test the model using mini-batch, which is ideal for big data (e.g., 11M texts) (CUDA version)
Patent, clinical trial, and VC network related projects
Python script to download and parse USPTO patent dataset
Python script to parse clinical trial data from Clinicaltrials.gov
Python script to clean entity names from different datasets before matching
Python script to create network measures from alliance dataset
Codes for fun (e.g., co-authorship networks during Covid-19, replication)
(To-Do) Multi-agent system for research: Python script to play with MAS to run analyses based on AutoGen
Visualization functions and diff-in-diff: Python script to play with data provided by Hakobyan & McLaren (2016)
SGD: Python script to run stochastic gradient descent to derive coefficients of non-linear models
CNN: Python script to run convolutional neural network (CNN) for character recognition on MNIST database
Clustering: Python script to run various clustering algorithms (e.g., DBSCAN, Louvain) on Covid-19 Open Research dataset