The rise of massive self-supervised (pre-trained) models has transformed various data-driven fields such as natural language processing, computer vision, robotics, and medical imaging. This advanced graduate course aims to provide a holistic view of the issues related to these models: We will start with the history of how we got here, and then delve into the latest success stories. We will then focus on the implications of these technologies: social harms, security risks, legal issues, and environmental impacts. The class ends with reflections on the future implications of this trajectory.

Prerequisites: Students must have extensive experience with deep learning, machine learning, artificial intelligence, and natural language processing. Familiarity with linear algebra, statistics and probability are necessary, as well as with the design and implementation of learning models (via one of the learing libraries, such as PyTorch, Tensorflow, Keras, JAX). Students must be comfortable with reading papers and extracting key concepts and ideas from papers.



For much of the semester, each class will involve the presentation and discussion of recent important papers on pre-trained (self-supervised) statistical models. The objective of the course is to instill a holistic view of the latest developments in various fields (NLP, computer vision, biology. etc.), and help the participants understand their broad implications.


Each paper will be presented by a group of students each with an assigned "role". This role defines the lens through which they read the paper and determines what they prepare for the group in-class discussion. Here are the roles we will experiment with:

The presentation of each role can be done individually or in groups of ≤3. If done as a group, you and your partner should decide how to equally divide the work for a given paper presentation session.

Who presents what role and when? At the beginning of the semester, students will be divided into two halves, one half presenting on Tuesdays and the other on Thursdays. In a given class session, the students in the presenting half will each be given a random role (determined the week before at the end of the classes). Each role group (irrespective of how many students are assigned to it) should aim for specified time budgets for each role. You're encouraged to have slides for your role, though it is not mandatory. If you do so, I would recommend less than 7-10 slides to make sure stay within our time budget.

What slides? To minimize time spent context switching or fighting with screen sharing/projector dongles, we will have a shared pool of slides (hosted on Google presentations, will be shared a week before). Each role group are encouraged to title their slides with "[role emoji]: [student name]" (as in "🏺: Jane,John") so that the slides are quickly identified during the session. If you choose to make slides, you're not expected to prepare a full-blown presentation -- they're encouraged for visual aid and facilitating the presentation.


If you aren't in the presenting group during a given class period:

Before the class Please provide a short answer to a prompt posed by the instructor a few days before the class.

The beginning of each class Come up with one question about the paper (either something you're confused about or something you'd like to hear discussed more).

During the class While only a subset of the class will participate in presenting a paper, the rest of the class is expected to come to class ready to participate in the discussions.


The current class schedule is below (subject to change):

Date Topic Course Materials
#1 - Tue Aug 30 Course overview, plan and expectations Slides: PPTX, PDF
#2 - Thu Sept 1 Preliminaries: Past, Architectures, Pre-training, Capabilities Slides: PPTX, PDF

Additional Reading(s):
  1. Attention Is All You Need
  2. The Annotated Transformer
  3. The Illustrated Transformer
#3 - Tue Sept 6 Pretraining Language Models Slides: PPTX, PDF
Main Reading: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Additional Reading(s):
  1. The Illustrated BERT, ELMo, and co
  2. RoBERTa: A Robustly Optimized BERT Pretraining Approach
  3. Exploring the limits of transfer learning with a unified text-to-text transformer
  4. BART: Denoising Sequence-to-Sequence Pre-training
#4 - Thu Sept 8 Pretraining Language Models Slides: PPTX, PDF
Main Reading: Language Models are Few-Shot Learners
Additional Reading(s):
  1. Language Models are Unsupervised Multitask Learners
  2. OPT: Open Pre-trained Transformer Language Models
  3. GPT-NeoX-20B: An Open-Source Autoregressive Language Model
  4. The Illustrated GPT-2
#5 - Tue Sept 13 Architectures Slides: PPTX, PDF
Main Reading: What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?
Additional Reading(s):
  1. Unifying Language Learning Paradigms
  2. Do transformer modifications transfer across implementations and applications?
  3. Staged Training for Transformer Language Models
#6 - Thu Sept 15 In-context Learning Slides: PPTX, PDF
Main Reading: Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
Additional Reading(s):
  1. Reframing Instructional Prompts to GPTk's Language
  2. On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model
  3. What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
  4. In-context Learning and Induction Heads
#7 - Tue Sept 20 Limits of In-context Learning Slides: PPTX, PDF
Main Reading: Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
Additional Reading(s):
  1. Data Distributional Properties Drive Emergent In-Context Learning in Transformers
  2. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  3. How does in-context learning work? A framework for understanding the differences from traditional supervised learning
#8 - Thu Sept 22 Limits of In-context Learning Slides: PPTX, PDF
Main Reading: Impact of Pretraining Term Frequencies on Few-Shot Reasoning
Additional Reading(s):
  1. Do Prompt-Based Models Really Understand the Meaning of Their Prompts?
  2. How transferable are features in deep neural networks?
  3. Frequency Effects on Syntactic Rule Learning in Transformers
#9 - Tue Sept 27 Social Harms: Bias Slides: PPTX, PDF
Main Reading: Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models
Additional Reading(s):
  1. UnQovering Stereotypical Biases via Underspecified Questions.
  2. CommunityLM: Probing Partisan Worldviews from Language Models.
  3. Robots Enact Malignant Stereotypes.
  4. Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender Bias.
  5. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
  6. Red Teaming Language Models with Language Models
#10 - Thu Sept 29 Social Harms: Toxicity Slides
Main Reading: RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
Additional Reading(s):
  1. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
  2. TruthfulQA: Measuring How Models Mimic Human Falsehoods
  3. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Fri Sept 30 Project proposal submission deadline [ proposal document ]
#11 - Tue Oct 4 Memorization and Privacy Slides
Main Reading: Quantifying Memorization Across Neural Language Models
Additional Reading(s):
  1. Extracting Training Data from Large Language Models
  2. Counterfactual Memorization in Neural Language Models
  3. Data Contamination: From Memorization to Exploitation
  4. Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models
#12 - Thu Oct 6 Memorization and Privacy Slides
Main Reading: Deduplicating Training Data Mitigates Privacy Risks in Language Models
Additional Reading(s):
  1. Differentially Private Fine-tuning of Language Models
  2. Can a Model Be Differentially Private and Fair?
  3. Large Language Models Can Be Strong Differentially Private Learners
  4. What Does it Mean for a Language Model to Preserve Privacy?
#13 - Tue Oct 11 External Speaker: Anjalie Field
#14 - Thu Oct 13 Project Proposal Presentation Slides
#15 - Tue Oct 18 Pretraining Coding Models Slides
Main Reading: Evaluating Large Language Models Trained on Code
Additional Reading(s):
  1. Competition-Level Code Generation with AlphaCode
  2. InCoder: A Generative Model for Code Infilling and Synthesis
  3. Solving Quantitative Reasoning Problems with Language Models
  4. Copilot’s impact on developer productivity
#16 - Thu Oct 20 No Class - Fall Break
#17 - Tue Oct 25 Pretraining Vision-Language Models Slides
Main Reading: Learning Transferable Visual Models From Natural Language Supervision
Additional Reading(s):
  1. MERLOT: Multimodal Neural Script Knowledge Models
  2. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
  3. LiT: Zero-Shot Transfer with Locked-image text Tuning
  4. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  5. Emerging Properties in Self-Supervised Vision Transformers
  6. CM3: A Causal Masked Multimodal Model of the Internet
#18 - Thu Oct 27 Pretraining Vision-Language Models Slides
Main Reading: Hierarchical Text-Conditional Image Generation with CLIP Latents
Additional Reading(s):
  1. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
  2. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
  3. Discovering the Hidden Vocabulary of DALLE-2
  4. High-Resolution Image Synthesis with Latent Diffusion Models
#19 - Tue Nov 1 Pretraining Speech/Audio Models Slides
Main Reading: wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
Additional Reading(s):
  1. Jukebox: A Generative Model for Music
  2. mSLAM: Massively multilingual joint pre-training for speech and text
  3. Towards Learning Universal Audio Representations
  4. AudioLM: a Language Modeling Approach to Audio Generation
  5. Robust Speech Recognition via Large-Scale Weak Supervision
#20 - Thu Nov 3 Retrieval from Memory Slides
Main Reading: REALM: Retrieval-Augmented Language Model Pre-Training
Additional Reading(s):
  1. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
  2. Unsupervised Dense Information Retrieval with Contrastive Learning
  3. Improving language models by retrieving from trillions of tokens
  4. Few-shot Learning with Retrieval Augmented Language Models
  5. Relational Memory Augmented Language Models
#21 - Tue Nov 8 Evolving Memory Slides
Main Reading: Memorizing Transformers
Additional Reading(s):
  1. Fast Model Editing at Scale
  2. SERAC: Memory-based Model Editing at Scale
  3. Towards Teachable Reasoning Systems
#22 - Thu Nov 10 External Speaker: Jared Kaplan
#23 - Tue Nov 15 Midway Project Presentation Slides
#24 - Thu Nov 17 Generalism Slides
Main Reading: A Generalist Agent
Additional Reading(s):
  1. UnifiedQA: Crossing Format Boundaries With a Single QA System
  2. Perceiver IO: A General Architecture for Structured Inputs & Outputs
  3. Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks
  4. Multitask Prompted Training Enables Zero-Shot Task Generalization
  5. Finetuned Language Models Are Zero-Shot Learners
  6. BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning
#25 - Tue Nov 22 No Class - Fall Recess
#26 - Thu Nov 24 No Class - Fall Recess
#27 - Tue Nov 29 Calibration Slides
Main Reading: Language Models (Mostly) Know What They Know
Additional Reading(s):
  1. How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering
  2. An Information-theoretic Approach to Prompt Engineering Without Ground Truth Labels
#28 - Thu Dec 1 Environmental Impact Slides
Main Reading: Energy and Policy Considerations for Deep Learning in NLP
Additional Reading(s):
  1. Foundation Models report: Environment (section 5.3)
  2. Green AI
#29 - Tue Dec 6 External Speaker: Colin Raffel
#30 - Thu Dec 8 Last class: open-ended reflections on power, limitations, and future of self-supervised statistical models Slides

Additional Reading(s):
  1. What Do NLP Researchers Believe? Results of the NLP Community Metasurvey
  2. Climbing towards NLU: On meaning, form, and understanding in the age of data
  3. Is it possible for language models to achieve language understanding?
  4. AI And The Limits Of Language
  5. Could A Computer Ever Be Conscious
Fri Dec 9 Final report submission deadline
Thu Dec 22
9 AM - 12 PM *
Final project presentation Slides


All students in the class will write a "mini-paper" as a final project. The topic of this project is open-ended. This project, for example, can focus on demonstrate systemic limitations of a prior work or suggesting improvements on methods or benchmarks discussed in the class.

Relevant Resources

Here are several resources available for free:

Besides these resources, we will try our best to satisfy individual needs through discussion.


Since this is a discussion class, it's especially important that we respect everyone's perspective and input. In particular, I value the perspectives of individuals from all backgrounds reflecting the diversity of our students. I will strive to make this classroom an inclusive space for all students. Please let me know if there is anything I can do to improve.

This course will have a zero-tolerance philosophy regarding plagiarism or other forms of cheating, and incidents of academic dishonesty will be reported. A student who has doubts about how the Honor Code applies to this course should obtain specific guidance from the course instructor before submitting the respective assignment.

The Johns Hopkins University is committed to equal opportunity for its faculty, staff, and students. To that end, the university does not discriminate on the basis of sex, gender, marital status, pregnancy, race, color, ethnicity, national origin, age, disability, religion, sexual orientation, gender identity or expression, veteran status, military status, immigration status or other legally protected characteristic. The University's Discrimination and Harassment Policy and Procedures provides information on how to report or file a complaint of discrimination or harassment based on any of the protected statuses listed in the earlier sentence, and the University’s prompt and equitable response to such complaints.

Relevant Courses