TT 2026 ยท Department of Economics

Applied Machine Learning & AI for Economics

Elodie Chervin
ยท
Daniel Barbosa

About the Course

This course bridges the gap between traditional econometrics and modern data science, offering both a theoretical understanding of machine learning and the practical skills to apply it. We examine how techniques like supervised and unsupervised learning, natural language processing, and the emerging field of Causal ML allow economists to tackle large, complex datasets. We also explore the transformative role of AI and Large Language Models in social science research.

๐ŸŽ“
Level
3rd-year Economics undergraduates
๐Ÿ
Language
Python (no prior experience required)
๐Ÿ“
Assessment
Weekly Questions (40%) + Presentations (30%) + Report (30%)

Prerequisites

Students should have completed a course in econometrics or statistics covering multivariate regression and hypothesis testing. No prior programming experience is required, but willingness to invest time and effort in learning the basics of Python in and outside of class is expected.

Weekly Schedule

00

Pre-Course Preparation

Show Readings & Questions

Mandatory Pre-Readings

  • James, G., Witten, D., Hastie, T. & Tibshirani, R. (2021). An Introduction to Statistical Learning. (Download ISL PDF). Read Chapter 1 and Chapter 2 (up to section 2.1.3).
  • Cunningham, S. (2021). Causal Inference: The Mixtape. (Read Chapter 2 Online). Focus on 2.1โ€“2.4, 2.7โ€“2.17, 2.24โ€“2.25.
  • ISL, Chapter 3 (Sections 3.1โ€“3.4 inclusive).

Questions

  1. Prediction vs. Inference: Explain the fundamental difference between Prediction (forecasting a future outcome) and Inference (understanding the causal effect). Provide one economic example where pure prediction is sufficient, and one where causal inference is required.
  2. Parametric vs. Non-parametric: Define "parametric" and "non-parametric" models within a statistical context. Why might an economist actively choose a rigid parametric model over a flexible non-parametric one?
  3. Multiple Variable Regression: In a multivariate regression model, we interpret a coefficient as the effect of X "holding all other variables constant." At a high level, why does this interpretation become practically difficult when your dataset contains hundreds of overlapping variables?
  4. ML in real life: Consider a real-world scenario where a firm uses a Machine Learning algorithm trained on historical data to automatically screen loan applications or job candidates. What are the economic or ethical risks of blindly deploying this model without understanding its internal logic?
  5. OLS Geometry: Using (ISL chap 3.4), argue that in the case of simple linear regression, the least squares line always passes through the point (xฬ„, yฬ„).
01

Regression & Regularisation

Download Slides (PDF)
02

Classification & Validation

Download Slides (PDF)
Show Preparation Questions

Preparation Questions

  1. The "Naive" Assumption in NLP: Consider using a Multinomial Naive Bayes classifier for Sentiment Analysis (predicting if a tweet is positive or negative). What assumption does the "Naive" in Naive Bayes refer to? Provide a real-world example of a short phrase or sentence where this "naive" assumption dramatically fails, leading the classifier to make the wrong prediction.
  2. Naive Bayes Calculation: Assume the following likelihoods for each word being part of a positive or negative movie review, and equal prior probabilities for each class.
    Word P(word | pos) P(word | neg)
    I0.090.16
    always0.070.06
    like0.290.06
    foreign0.040.15
    films0.080.11
    What class will Naive Bayes assign to the sentence: "I always like foreign films."?
  3. The Cost of Being Wrong in Medical AI: Consider an AI system deployed in a hospital to screen patients for a rare but aggressive form of cancer. The model makes two types of mistakes: False Positives (an unnecessary, stressful biopsy) and False Negatives (sending a sick patient home without treatment). Are these errors equally costly? Why does a single overarching "Accuracy" metric fail to capture the real-world utility and ethical implications of this AI system? If you were the developer, how would you mathematically "tune" the classifier to prioritize saving lives, even if it means more false alarms?
  4. LLMs and Data Leakage: Consider a tech company training a new Large Language Model (LLM) to act as a coding assistant. They evaluate the model on a test set of challenging 2024 coding problems and achieve a 95% success rate. However, the model was pre-trained on the entire internet, inadvertently including the solutions to those very problems. If this model is deployed to users writing completely new code, what will happen? Why is it mathematically dangerous to evaluate an AI on data it has already seen?
  5. The Bootstrap Intuition: You have trained a complex statistical model to predict regional housing prices, but your manager asks: "What is the margin of error for these predictions?" Unlike simple linear regression, complex non-linear models don't have a neat mathematical formula for standard errors. Without collecting more data, how could you use the data you already have, along with computing power, to simulate "new" datasets and estimate this uncertainty?
03

Trees & Ensembles

04

Unsupervised Learning

05

Causal ML

06

Text as Data (NLP)

07

Deep Learning & AI Foundations

08

Large Language Models in Economics

Assessment

Weekly Questions (40%): Conceptual and practical questions assigned each week to consolidate learning.

Presentations (30%): Students will present a mini data project application of the methods (2 per week during Weeks 5, 6, 7, and 8).

Report (30%): A written report due alongside the presentation, containing all reproducible Python code used for the application.