Large self-supervised (pre-trained) models have transformed various data-driven fields such as natural language processing (NLP). In this course, students will gain a thorough introduction to self-supervised learning techniques for NLP applications. Through lectures, assignments, and a final project, students will learn the necessary skills to design, implement, and understand their own self-supervised neural network models, using the Pytorch framework.
Note: The course is different from 601.771 (offered in the fall semesters) which involves reading recent papers and is geared toward grad students that want to specialize in the latest developments in self-supervised models.
Prerequisites: (1) Data Structures (601.226), (2) All the class assignments will be in Python/PyTorch. If you don’t know Python or PyTorch but have experience in other programming languages (Java, C++, etc.) you can probably pick Python/PyTorch pretty quickly. (3) Calculus and linear algebra: you should be comfortable with matrix operations (matrix multiplication, transpose, inverse, dot product, gradients). (4) Probability: basic probability properties (conditionals, marginals, mean, standard deviation), distributions (gaussian, categorical). (5) Background in Natural Language Processing & Machine Learning or having finished one of the relevant courses such as Machine Learning (475.675), Artificial Intelligence (464.664), Natural Language Processing (600.465), Machine Translation (600.468), or Introduction to HLT (601.467/667).
Relevant Courses at Hopkins: This course has some overlap with "Natural Language Processing" (EN.601/665), "Introduction to Human Language Technology" (601.467/667), and "Artificial Agents" (EN.601.470/670), though the courses have different focuses.
The homework is your opportunity to practice doing the thing. The lectures and office hours hopefully provide good intuition and motivation and justification for the skills we want you to develop, but the best way to develop those skills is by trying to solve the problems yourself. The practice is far more important than the solution.
The course has ~12 weekly assignments which will improve both your theoretical understanding and your practical skills. All assignments contain both written questions and programming parts (mainly in Python). They will be released on this website, and submissions should be uploaded to Gradescope.
Here is a tentative list of topics for the assignments:
# | Focus |
---|---|
#1 | Algebra, calculus, probability recap, implementing Skip-Gram model, classification, evaluation, comparison to basic features (unigrams, bigrams) and existing word embeddings. |
#2 | Understanding softmax function, classification via vector representations, playing with gradient descent. |
#3 | PyTorch introduction, automatic differentiation, computation graph, how to use PyTorch on GPUs, basic feedforward network and backpropagation, Word2vec as a feedforward net with automatic differentiation |
#4 | Neural language model with feedforward network, evaluating language modeling, count-based models, decoding language models |
#5 | Recurrent neural language model and evaluation; Transformers |
#6 | Fine-tuning LMs, prompting language models, fine-tuning them, distributed tuning. |
#7 | Prompt engineering, in-context learning; Retrieval-augmented language models |
There will be one in-class midterm. The midterm exam will be paper-based and during the usual class time. This midterm exam aims to evaluate students' progress and understanding of ideas presented in the first half of the semester, which will serve as a foundation for the material covered in the second half of the semester. The exam will assess students' mastery of the topics discussed in the lectures and weekly homework assignments. The exam will also provide feedback to both the student and the instructor, and identify areas that need improvement to inform further learning and teaching. The midterm will cover all material until the end of "Transformer Language Models", just before "Doing Things with Language Models". The week leading to the midterm, we will not have homework assignments.
The objective of the final project is to make use of what you have learned during this course to solve a hard problem.
The project deliverables are: (1) A project proposal, (2) a final report, (3) a final project poster summarizing the technical aspects of the project.
Each session will involve an instructor-led presentation on a focused topic self-supervised models. There will be weekly assignments related to class presentations, a midterm exam, and a final project.
The current class schedule is below (subject to change):
Date | Topic | Course Materials | Events | Deadlines | |
---|---|---|---|---|---|
#1 - Tue Jan 24 |
Course overview
Plan and expectations [slides: pdf, pptx] |
HW1 released! [tex] [pdf] [colab] | |||
⬇️ -- Self-supervised Word Representations | |||||
#2 - Thu Jan 26 |
Word meaning and representation [slides: pdf, pptx] |
Suggested Reading: Jurafsky & Martin Chapter
6
Additional Reading: |
|||
Fri Jan 27 | TA Review Session (virtual over Zoom): Math background + Python [zoom link] [slides] | Time: 9 - 9:50 AM | |||
#3 - Tue Jan 31 |
Word2vec objective function (continued), inspecting and evaluating word vectors [slides: pdf, pptx] |
Suggested Reading: Jurafsky & Martin Chapter
6
Additional Reading: |
HW2 released! [tex] [pdf] [colab] | HW1 due | |
⬇️ -- Self-Supervised Representation of Feedforward Neural Language Models | |||||
#4 - Thu Feb 2 |
Word2vec limitations and modeling context feedforward networks Neural nets: brief history Word2vec as simple feedforward net [slides: pdf, pptx] |
Suggested Reading: Jurafsky & Martin Chapter
7
Additional Reading: |
|||
#5 - Tue Feb 7 |
Analytical backpropagation Automatic differentiation Practical tips for training neural networks [slides: pdf, pptx] |
Suggested Reading: Jurafsky & Martin Chapter
7
Additional Reading: |
HW3 released! [tex] [pdf] [colab] | HW2 due | |
#6 - Thu Feb 9 |
Language modeling, N-gram models, evaluating LMs [slides: pdf, pptx] [slides: pdf, pptx] |
Suggested Reading: Jurafsky & Martin Chapter
3
Additional Reading:
|
|||
Fri Feb 10 | TA Review Session (virtual over Zoom): Backpropagation and PyTorch | Time: 9 - 9:50 AM | |||
#7 - Tue Feb 14 |
Measuring LM quality Fixed-window language modeling with FFNs [slides: pdf, pptx] |
Suggested Reading: Jurafsky & Martin Chapter
7
Additional Reading:
|
HW4 released! [tex] [pdf] [colab] | HW3 due | |
⬇️ -- Self-Supervised Representation of Recurrent Neural Language Models | |||||
#8 - Thu Feb 16 |
Text generation algorithms Recurrent neural networks Encoder-decoder models [slides: pdf, pptx] |
Suggested Reading: CS231N course notes on RNNs
Additional Reading:
|
|||
#9 - Tue Feb 21 |
RNN continued: ELMo Language units and subwords [slides: pdf, pptx] |
Suggested Reading: Deep contextualized word
representations (ELMo paper)
Additional Reading: |
HW5 released! [tex] [pdf] [colab] | HW4 due | |
⬇️ -- Self-Supervised Representation of Transformer Language Models | |||||
#10 - Thu Feb 23 |
Self-attention Transformer [slides: pdf, pptx] |
Suggested Reading: Attention Is All You Need Additional Reading: |
|||
⬇️ -- Large Language Models | |||||
#11 - Tue Feb 28 |
Encoder family (BERT, RoBERTa, ...) [slides: pdf, pptx] |
Suggested Reading: BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding Additional Reading: |
HW5 due | ||
#12 - Thu Mar 2 |
Encoder-Decoder family (T5, BART), Decoder family (GPTk) [slides: pdf, pptx] |
Suggested Reading: Exploring the limits
of transfer learning with a unified text-to-text transformer (T5 paper) Additional Reading: |
|||
#13 - Tue Mar 7 | Midterm exam | HW6 released! [tex] [pdf] | |||
#14 - Thu Mar 9 |
Final projects [slides] Decoder family (GPTk) In-context learning [slides: pdf, pptx] |
Suggested Reading: Language Models are Few-Shot Learners
(GPT3 paper) Additional Reading: |
|||
Mar 9-17 | Project teaming extravaganza | ||||
#15 - Tue Mar 14 |
In-context learning Adapting models with prompting (prompt engineering) Failure modes of in-context learning [slides: pdf, pptx] |
Suggested Reading: Calibrate Before Use: Improving Few-Shot Performance of Language Models Additional Reading:
|
HW7 released! [tex] [pdf] | HW6 due | |
#16 - Thu Mar 16 |
Multi-step reasoning via prompts Adapting models with parameter change (head-tuning, prompt-tuning, adaptors) [slides: pdf, pptx] |
Suggested Reading: The Power of Scale for Parameter-Efficient Prompt Tuning Additional Reading: |
|||
#17 - Tue Mar 21 | No Class - Spring Break | ||||
#18 - Thu Mar 23 | No Class - Spring Break | ||||
#19 - Tue Mar 28 |
Scaling laws Modifying self-attention for long context Retrieval augmented language models [slides: pdf, pptx] |
Suggested Reading: Retrieval-Augmented
Generation for Knowledge-Intensive NLP Tasks Additional Reading:
|
|||
#20 - Thu Mar 30 |
Social concerns about LMs: Bias, fairness and toxic language Hallucination, truthfulness, and veracity [slides: pdf, pptx] |
Suggested Reading:
Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular
Generative Language Models Additional Reading:
|
HW7 due + Project proposals deadline |
||
#21 - Tue Apr 4 |
Alignment via language instructions: existing solutions and challenges [slides: pdf, pptx] |
Suggested Reading: Training language models to follow instructions
with human feedback (GPT3 + RLHF paper) Additional Reading:
|
|||
#22 - Thu Apr 6 |
Vision-language models [slides: pdf, pptx] |
Suggested Reading: Multimodal Few-Shot Learning with Frozen Language Models Additional Reading:
|
|||
#23 - Tue Apr 11 | Guest speaker: Jim Fan (NVIDIA) |
||||
#24 - Thu Apr 13 |
Guest speaker: Tim Dettmers (UW) [slides] |
Title: 8-bit Methods for Efficient Deep Learning Abstract: Large language models are effective tools for many tasks but are difficult to train and inference due to their size. Moving from 32-bit models to 16-bit models resulted in considerable efficiency gains that made training and inference of large models easier. Can we train and inference in 8-bit to make further gains? In this talk, I will show that 8-bit inference and training can be used without degrading performance while improving efficiency. To make 8-bit methods work, it is essential to understand how quantization precision affects model performance and training stability as we scale the model size. I will talk about how these factors change with scale and how we need to adjust 8-bit methods to make them work. In particular, I will speak about 8-bit optimizers for training and Int8 inference for large language models with up to 175B parameters. These methods make training and inference more efficient and make large models more accessible to researchers. |
|||
#25 - Tue Apr 18 | Guest speaker: Yizhong Wang (UW) [slides] |
||||
#26 - Thu Apr 20 | Guest speaker: Ruiqi Zhong (UCBerkeley) [slides] |
Title: Getting AI to Do Things I Can't: Scalable Oversight via Indirect Supervision Abstract: Can we tame powerful AI systems even when we struggle to determine the ground truth ourselves? In this talk, I will cover two example NLP tasks: 1) automatically searching for patterns in large text collections and explaining them to humans in natural language; 2) labeling complex SQL programs using non-programmers with the aid of our AI system and achieving accuracy on par with database experts. In both cases, we build tools that help humans to indirectly scrutinize the AI’s output with high effectiveness but low effort, bringing new insights that human experts have not anticipated. |
Midway reports deadline | ||
#27 - Tue Apr 25 | Guest speaker: Max Ryabinin (Yandex) [slides] |
||||
#28 - Thu Apr 27 | Guest speaker: Rogério Bonatti (Microsoft) [slides] |
||||
#29 - Tue May 2 | No Class - Reading Days | ||||
#30 - Thu May 4 | No Class - Reading Days | ||||
May 15 EOD | Final project reports | ||||
May 15 | Final project poster session (9 am-12 pm - Hackerman hall) |
There is no required text. Though the following can be useful:
Here are several resources available for free:
Besides these resources, we will try our best to satisfy individual needs through discussion.
The strength of the university depends on academic and personal integrity. In this course, you must be honest and truthful, abiding by the Computer Science Academic Integrity Policy:
Cheating is wrong. Cheating hurts our community by undermining academic integrity, creating mistrust, and fostering unfair competition. The university will punish cheaters with failure on an assignment, failure in a course, permanent transcript notation, suspension, and/or expulsion. Offenses may be reported to medical, law or other professional or graduate schools when a cheater applies. Violations can include cheating on exams, plagiarism, reuse of assignments without permission, improper use of the Internet and electronic devices, unauthorized collaboration, alteration of graded assignments, forgery and falsification, lying, facilitating academic dishonesty, and unfair competition. Ignorance of these rules is not an excuse.
Academic honesty is required in all work you submit to be graded. Except where the instructor specifies group work, you must solve all homework and programming assignments without the help of others. For example, you must not look at anyone else’s solutions (including program code) to your homework problems. However, you may discuss assignment specifications (not solutions) with others to be sure you understand what is required by the assignment. If your instructor permits using fragments of source code from outside sources, such as your textbook or on-line resources, you must properly cite the source. Not citing it constitutes plagiarism. Similarly, your group projects must list everyone who participated.
In the above paragraph "outside sources" also include content that was produced by an AI assistant like ChatGPT. This follows either by treating the AI assistant as a person for the purposes of this policy (controversial) or acknowledging that the AI assistant was trained directly on people's original work. Thus, while you are not forbidden from using these tools, you should consider the above policy carefully and quote where appropriate. Assignments that are in large part quoted from an AI assistant are very unlikely to be evaluated positively. In addition, if a student's work is substantially identical to another student's work, that will be grounds for an investigation of plagiarism regardless of whether the prose was produced by an AI assistant.
Falsifying program output or results is prohibited. Your instructor is free to override parts of this policy for particular assignments. To protect yourself: (1) Ask the instructor if you are not sure what is permissible. (2) Seek help from the instructor, TA or CAs, as you are always encouraged to do, rather than from other students. (3) Cite any questionable sources of help you may have received.
Report any violations you witness to the instructor. You can find more information about university misconduct policies on the web for undergraduates and undergraduates students.
Johns Hopkins University is committed to equal opportunity for its faculty, staff, and students. To that end, the university does not discriminate on the basis of sex, gender, marital status, pregnancy, race, color, ethnicity, national origin, age, disability, religion, sexual orientation, gender identity or expression, veteran status, military status, immigration status or other legally protected characteristic. The University's Discrimination and Harassment Policy and Procedures provides information on how to report or file a complaint of discrimination or harassment based on any of the protected statuses listed in the earlier sentence, and the University’s prompt and equitable response to such complaints.