CSCI 601.771: Self-supervised Models

Instructor

Logistics

Classes: on Tuesday/Thursday 1:30 - 2:45 pm EST (Malone 228)
Office hours: by appointment
Contact: If you have any questions about the course email the instructor.
Class Structure: The class will be in-person. Each session will involve the presentation and/or discussion of recent important papers on the self-supervised statistical models. The course also involves a project.
Coursework: Your grade is based on three activities: paper presentation by undertaking your role and present it in a clear and compelling way (33%), in-class participation in discussions for any class that you're not presenting (33%), the class project (33%). Everyone gets one 1% bonus point!
Changes: The professor reserves the right to make changes to the syllabus or project due dates. These changes will be announced as early as possible.
Recordings: Recorded versions of the sessions will be available online after each class on Canvas.
Attendance and late work: You can miss 3 sessions. Additionally, you get 2 sessions of presentation relief (i.e., you can skip 2 presentation assignments) to accommodate any deadlines you might have. If you decide to use these, make sure to email the instructor in advance (at least two days). Beyond that limit, you'd lose the attendance/presentation credits for any class you miss. There's really no way to accept late work.
COVID: Students who report symptoms associated with COVID-19 are expected not to attend class and to isolate themselves for at least five days and until they have been symptom-free for 24 hours.

Content

For much of the semester, each class will involve the presentation and discussion of recent important papers on pre-trained (self-supervised) statistical models. The objective of the course is to instill a holistic view of the latest developments in various fields (NLP, computer vision, biology. etc.), and help the participants understand their broad implications.

Presenters

Each paper will be presented by a group of students each with an assigned "role". This role defines the lens through which they read the paper and determines what they prepare for the group in-class discussion. Here are the roles we will experiment with:

Stakeholder ✍️: Act as if you're the authors of this paper. Describes their motivation, problem definition, method and experimental findings of this paper. (time budget: 15 minutes)
Scientific Reviewer 🔎: Act like you're a reviewer of this work. Be critical of the work, though not necessarily negative. You can follow the guidelines for NeurIPS reviewers (under "Review Content"), taking note of the example reviews included therein. (time budget: 10 minutes)
Empiricist 👩🏽‍🔬: Implement something related to the paper. This could be a small part of the paper on a small dataset or toy problem. If you download an existing code or model, you should try least (a toy version of a) the experiments from the paper. Prepare to share your core code and some empirical intuitions on your implementation (what worked, what didn't, what you think is the best way to do it, etc.) (time budget: 10 minutes)
Archaeologist 🏺: Determine where this paper sits in the context of previous and subsequent work. Find and report on one prior paper that substantially influenced the current paper and one newer paper that cites this current paper. (time budget: 10 minutes)
Visionary 🔭: Propose an imaginary follow-up research project or a new application -- not just based on the current but only possible due to the existence and success of the current paper. (time budget: 10 minutes)

The presentation of each role can be done individually or in groups of ≤3. If done as a group, you and your partner should decide how to equally divide the work for a given paper presentation session.

Who presents what role and when? At the beginning of the semester, students will be divided into two halves, one half presenting on Tuesdays and the other on Thursdays. In a given class session, the students in the presenting half will each be given a random role (determined the week before at the end of the classes). Each role group (irrespective of how many students are assigned to it) should aim for specified time budgets for each role. You're encouraged to have slides for your role, though it is not mandatory. If you do so, I would recommend less than 7-10 slides to make sure stay within our time budget.

What slides? To minimize time spent context switching or fighting with screen sharing/projector dongles, we will have a shared pool of slides (hosted on Google presentations, will be shared a week before). Each role group are encouraged to title their slides with "[role emoji]: [student name]" (as in "🏺: Jane,John") so that the slides are quickly identified during the session. If you choose to make slides, you're not expected to prepare a full-blown presentation -- they're encouraged for visual aid and facilitating the presentation.

Non-Presenters

If you aren't in the presenting group during a given class period:

Before the class Please provide a short answer to a prompt posed by the instructor a few days before the class.

The beginning of each class Come up with one question about the paper (either something you're confused about or something you'd like to hear discussed more).

During the class While only a subset of the class will participate in presenting a paper, the rest of the class is expected to come to class ready to participate in the discussions.

Schedule

The current class schedule is below (subject to change):

Date	Topic	Course Materials
#1 - Tue Aug 30	Course overview, plan and expectations	Slides: PPTX, PDF
#2 - Thu Sept 1	Preliminaries: Past, Architectures, Pre-training, Capabilities	Slides: PPTX, PDF Additional Reading(s): Attention Is All You Need The Annotated Transformer The Illustrated Transformer
#3 - Tue Sept 6	Pretraining Language Models	Slides: PPTX, PDF Main Reading: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Additional Reading(s): The Illustrated BERT, ELMo, and co RoBERTa: A Robustly Optimized BERT Pretraining Approach Exploring the limits of transfer learning with a unified text-to-text transformer BART: Denoising Sequence-to-Sequence Pre-training
#4 - Thu Sept 8	Pretraining Language Models	Slides: PPTX, PDF Main Reading: Language Models are Few-Shot Learners Additional Reading(s): Language Models are Unsupervised Multitask Learners OPT: Open Pre-trained Transformer Language Models GPT-NeoX-20B: An Open-Source Autoregressive Language Model The Illustrated GPT-2
#5 - Tue Sept 13	Architectures	Slides: PPTX, PDF Main Reading: What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? Additional Reading(s): Unifying Language Learning Paradigms Do transformer modifications transfer across implementations and applications? Staged Training for Transformer Language Models
#6 - Thu Sept 15	In-context Learning	Slides: PPTX, PDF Main Reading: Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity Additional Reading(s): Reframing Instructional Prompts to GPTk's Language On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model What Can Transformers Learn In-Context? A Case Study of Simple Function Classes In-context Learning and Induction Heads
#7 - Tue Sept 20	Limits of In-context Learning	Slides: PPTX, PDF Main Reading: Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? Additional Reading(s): Data Distributional Properties Drive Emergent In-Context Learning in Transformers Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models How does in-context learning work? A framework for understanding the differences from traditional supervised learning
#8 - Thu Sept 22	Limits of In-context Learning	Slides: PPTX, PDF Main Reading: Impact of Pretraining Term Frequencies on Few-Shot Reasoning Additional Reading(s): Do Prompt-Based Models Really Understand the Meaning of Their Prompts? How transferable are features in deep neural networks? Frequency Effects on Syntactic Rule Learning in Transformers
#9 - Tue Sept 27	Social Harms: Bias	Slides: PPTX, PDF Main Reading: Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models Additional Reading(s): UnQovering Stereotypical Biases via Underspecified Questions. CommunityLM: Probing Partisan Worldviews from Language Models. Robots Enact Malignant Stereotypes. Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender Bias. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP Red Teaming Language Models with Language Models
#10 - Thu Sept 29	Social Harms: Toxicity	Slides: PPTX, PDF Main Reading: RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models Additional Reading(s): On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? TruthfulQA: Measuring How Models Mimic Human Falsehoods Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
Fri Sept 30	Project proposal submission deadline [ proposal document ]
#11 - Tue Oct 4	Memorization and Privacy	Slides: PPTX, PDF Main Reading: Quantifying Memorization Across Neural Language Models Additional Reading(s): Extracting Training Data from Large Language Models Counterfactual Memorization in Neural Language Models Data Contamination: From Memorization to Exploitation Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models
#12 - Thu Oct 6	Memorization and Privacy	Slides: PPTX, PDF Main Reading: Deduplicating Training Data Mitigates Privacy Risks in Language Models Additional Reading(s): Differentially Private Fine-tuning of Language Models Can a Model Be Differentially Private and Fair? Large Language Models Can Be Strong Differentially Private Learners What Does it Mean for a Language Model to Preserve Privacy?
#13 - Tue Oct 11	External Speaker: Anjalie Field (Stanford)	Social Applications of Pre-trained Language Models [video - slides]
#14 - Thu Oct 13	Project Proposal Presentation	Slides
#15 - Tue Oct 18	Pretraining Coding Models	Slides: PPTX, PDF Main Reading: Evaluating Large Language Models Trained on Code Additional Reading(s): Competition-Level Code Generation with AlphaCode InCoder: A Generative Model for Code Infilling and Synthesis Solving Quantitative Reasoning Problems with Language Models Copilot’s impact on developer productivity Grounded Copilot: How Programmers Interact with Code-Generating Models Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models
#16 - Thu Oct 20	No Class - Fall Break
#17 - Tue Oct 25	Pretraining Vision-Language Models	Slides: PPTX, PDF Main Reading: Learning Transferable Visual Models From Natural Language Supervision Additional Reading(s): An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision Emerging Properties in Self-Supervised Vision Transformers MERLOT: Multimodal Neural Script Knowledge Models LiT: Zero-Shot Transfer with Locked-image text Tuning An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale CM3: A Causal Masked Multimodal Model of the Internet
#18 - Thu Oct 27	Pretraining Vision-Language Models	Slides: PPTX, PDF Main Reading: Hierarchical Text-Conditional Image Generation with CLIP Latents Additional Reading(s): Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks Discovering the Hidden Vocabulary of DALLE-2 High-Resolution Image Synthesis with Latent Diffusion Models
#19 - Tue Nov 1	External Speaker: Jason Wei (Google)	Emergence and reasoning in large language models [video - slides]
#20 - Thu Nov 3	Pretraining Speech/Audio Models	Slides: PPTX, PDF Main Reading: wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations Additional Reading(s): Jukebox: A Generative Model for Music mSLAM: Massively multilingual joint pre-training for speech and text Towards Learning Universal Audio Representations AudioLM: a Language Modeling Approach to Audio Generation Robust Speech Recognition via Large-Scale Weak Supervision ESPnet: End-to-End Speech Processing Toolkit Unsupervised Speech Recognition Masked Autoencoders that Listen Towards End-to-end Unsupervised Speech Recognition SSAST: Self-Supervised Audio Spectrogram Transformer
#21 - Tue Nov 8	Retrieval from Memory	Slides: PPTX, PDF Main Reading: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Additional Reading(s): REALM: Retrieval-Augmented Language Model Pre-Training Unsupervised Dense Information Retrieval with Contrastive Learning Improving language models by retrieving from trillions of tokens Few-shot Learning with Retrieval Augmented Language Models Relational Memory Augmented Language Models An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP Tasks
#22 - Thu Nov 10	External Speaker: Jared Kaplan
#23 - Tue Nov 15	Midway Project Presentation	Slides
#24 - Thu Nov 17	Calibration	Slides: PPTX, PDF Main Reading: Language Models (Mostly) Know What They Know Additional Reading(s): How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering An Information-theoretic Approach to Prompt Engineering Without Ground Truth Labels
#25 - Tue Nov 22	No Class - Fall Recess
#26 - Thu Nov 24	No Class - Fall Recess
#27 - Tue Nov 29	Environmental Impact	Slides: PPTX, PDF Main Reading: Measuring the Carbon Intensity of AI in Cloud Instances Additional Reading(s): Energy and Policy Considerations for Deep Learning in NLP Foundation Models report: Environment (section 5.3) Green AI
#28 - Thu Dec 1	Last class: open-ended reflections on power, limitations, and future of self-supervised statistical models	Slides: PPTX, PDF Additional Reading(s): What Do NLP Researchers Believe? Results of the NLP Community Metasurvey Climbing towards NLU: On meaning, form, and understanding in the age of data Is it possible for language models to achieve language understanding? AI And The Limits Of Language Could A Computer Ever Be Conscious DALLE2 fails to reliably capture common syntactic processes Inverse Scaling Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality
#29 - Tue Dec 6	External Speaker: Colin Raffel	Building Better Language Models [video - slides]
#30 - Thu Dec 8	Final project presentation (I)	Slides
Thu Dec 22 9 AM - 12 PM ^*	Final project presentation (II)	Slides
Thu Dec 22, 9 AM	Final report submission deadline

CSCI 601.771: Self-supervised Statistical Models

Johns Hopkins University - Fall 2022

Instructor

Logistics

Content

Presenters

Non-Presenters

Schedule

Additional topics

Project

Relevant Resources

Conduct

Relevant Courses