CS 601.771: Advances in Self-supervised Models

Expectations and Deliverables

Written assignment: The course has ONE assignment to measure your understanding of the foundational concepts of self-supervised learning. This is to make sure that when coming in, you know all the pre-requisites needed for the class. They will be released on this website, and submissions should be uploaded to Gradescope.
- Submission:The assignment must be submitted through Gradescope.
- Regrading: Regrade requests can be submitted directly within Gradescope and must include a brief written justification for the request.
- Late days: There are no late days for the one homework assignment.
Paper presentation: Each class will start with a 20-25 minute presentation by a team of 2-3 students about 1-2 assigned papers. The teams/paper assignments will be done at the beginning of the semester or at least 10 days ahead of the presentation.
- Feedback: The presenting team must share their slides ahead of time with the course staff 48 hours before the presentation (Slack channel: #presentation-feedback). We will review the slides and provide feedback about its content.
- Late days: There are no late days for paper presentations.
Discussion sentences/questions:
- What? The students who are not presenting a paper are required to write a 3-4 "reaction" bullets (sentences/questions) about the paper.
- Why? The purpose of these questions that you study the assigned reading and think about it. The questions will help us to understand common points of interest and fascilitate the in-class discussions. These questions will be factored into your grade.
- When? This is due the night before the class (before midnight EST)
- Where? Submit your responses through this google form.
- Content: These papers should not be descriptive summaries of the paper. Also avoid generic questions/statements (e.g., What is their learning rate? How long did they train? Didn't understand their intro). Instead, they should be probing, analytical, and thought-provoking by offering specific critical comments or questions. For example, you may choose to highlight a problem or limitation with one of the readings. Or you may offer a better approach or method.
- Length: The intention is to be as concise as possible. Should be limited to 3-4 bullets, each no longer than 1-3 sentences.
- Late days: You may skip the discussion questions/sentence for 3 classes. Note there is no "late" reactions (i.e., if you send your reactions later it will count as "skip").
Pre/in-class Participation:
- Why? It is important that you attend each session and complete the readings prior to class. The discussion and interaction during class time will be an integral part of the course.
- Step Up/Step Back: This means, if you are the person who feels very comfortable sharing, take note of how often you are sharing, and consider giving time for others to share (make sure to not monopolize the class time). By all means, be present and active in this conversation, but make sure others have the time to as well. If you tend to be a quiet participant, take a chance and “step up” with your idea, share your concerns, your ideas, concerns, and excitement with the group. A good facilitator will make sure this is safe for you.
- Skip days? You may skip the participation grades for 3 days. That means that you can skip 3 class (during which you're not presenting) or be in class without engaging in discussions for 3 sessions without any penalties.
- Punctuality: We expect you to arrive before the presentations start. Late arrivals will negatively impact your participation grade.

Key links

Slack for discussion and announcements. Sign up, follow, ask questions, and participate in discussions!
Gradescope for submitting your reports.

Final project

The objective of the final project is to make use of what you have learned during this course to solve a hard problem.

The final project milestones include: (1) A project proposal, (2) A project midway report, (3) progress update presentation, (4) a final report, (5) a final project poster summarizing the technical aspects of the project. See the course calendar for the due dates.

Topic: The topic of this project is open-ended. This project, for example, can focus on demonstrating systemic limitations of prior work or suggesting improvements on methods or benchmarks discussed in the class.
Group work: Students are encouraged to work in groups on the final project (team sizes limited to 2 or 3 people).
Project proposals: All groups will be required to submit a project proposal (due on the class calendar). The project proposal is a 2-page description of what you intend to do (experiments, datasets, methods, etc.) All documents should follow template. The instructor(s) will provide feedback on these ideas to help the teams with finding a concrete idea. Here is examples of project proposals from previous years:
- Ensemble Domain-Specific Knowledge Distillation
Midway progress reports: Reports discussing the progress made thus far (at most 5 pages; this template.) and elaborates on the remaining work. Describe the progress made, experiments you have run, preliminary results you have obtained, how you plan to spend the rest of your time, etc. While this is called "midway" in practice it should be considered more than halfway! By this milestone, you’re expected to have implemented some system, and to have some experimental results to show by this date.
Final poster presentations: All students will present their findings at a poster presentation during the final exam period.
Final report: Students should write code and carry out additional experiments and then write up the results in a standard conference paper format (at most 8 pages; use this template). References don't count toward the page limit. Note that longer reports are not necessarily better. Students in groups are required to include a “contributions” section concretely lists each author’s contributions (see Section 8 of this paper, for example). The final report should concisely summarize your findings and answer the following questions: 1. What approach did you take to address this problem, and why? 2. How did you explore the space of solutions? 3. How did you evaluate the performance of the approach(es) you investigated? 4. What worked, what did not work, and why?
Here are examples of final reports from the previous year:
- Ensemble Domain-Specific Knowledge Distillation
- Efficient Distillation of Transformers via Self-Teaching
Project grading: The goal of the project is to demonstrate the group’s understanding of the tools and challenges when using self-supervised models. Grading will reflect the quality of the approach, the rigor of evaluation, and reasoning about successes and failures. Grading will also depend on the completeness of the project, the clarity of the writeup, the level of complexity/difficulty of the approach, and your ability to justify the choices you made. Here is the grade breakdown for the projects:
- Project proposal: 15%
- Midway report: 15%
- Progress update presentation: 10%
- Quality of final report write-up, implementation and results: 40%
- Final poster and its presentation: 20%
Using Other Resources: We strongly encourage you to use any outside source at your disposal, provided you use your sources properly and give them proper credit. If you get an idea from an outside source, citing that source will not lower your grade. Failing to properly cite an outside source—thereby taking credit for ideas that are not your own—is plagiarism.
Appropriate Citations: You must write everything in your own words, and properly cite every outside source you use, including other students. Using ideas from other sources or people without citation is plagiarism. Copying other sources verbatim, even with proper citation, is plagiarism. Don't do that. The only sources that you are not required to cite are the official course materials (lectures, slides, and assignments).
Honor code: We expect students to not look at solutions or implementations online. We take the student Honor Code seriously. We sometimes use automated methods to detect overly similar assignment solutions.
Can it become a paper? It often takes a lot of effort to turn a class project into a publishable work. But it's certainly feasible (prior students have done it; e.g., this or this). In terms of timing, it's something that can be discussed after the semester is over.

Content Schedule

The current class schedule is below (subject to change). You can also see this spreadsheet containing a larger set of papers we considered.

Date	Topic	Course Materials
#1 - Tue Aug 27	Reviewing the foundation	Course introduction: Course overview Plan and expectations [slides: pptx, pdf] Reviewing the foundations: Language modeling [slides: pptx, pdf]
#2 - Thu Aug 29	Reviewing the foundation	Transformers Pre-training [slides: pptx, pdf]
#3 - Tue Sept 3	Reviewing the foundation	Prompting Tuning [slides: pptx, pdf]
#4 - Thu Sept 5	Reviewing the foundation	Retrieval Alignment [slides: pptx, pdf]
#5 - Tue Sept 10	Pre-training	Main Reading(s): The Llama 3 Herd of Models (Sec 3 and the relevant portion of Sec 5) Additional Suggested Reading: OLMo: Accelerating the Science of Language Models Apple Intelligence Foundation Language Models (up to Sec 4 and relevants from Sec 6) [slides: pptx, pdf]
#6 - Thu Sept 12	Alignment	Main Reading(s): The Llama 3 Herd of Models (Sec 4 and the relevant portion of Sec 5) Additional Suggested Reading: Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing Fundamental Limitations of Alignment in Large Language Models [slides: pptx, pdf]
#8 - Tue Sept 17	Data for Pre-training	Main Reading(s): The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Additional Suggested Reading: Dated Data: Tracing Knowledge Cutoffs in Large Language Models A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity [slides: pptx, pdf]
#9 - Thu Sept 19	Evaluation	Main Reading(s): Are Emergent Abilities of Large Language Models a Mirage? Additional Suggested Reading: Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures Understanding Emergent Abilities of Language Models from the Loss Perspective [slides: pptx, pdf]
#9 - Tue Sept 24	Scalable oversight	Main Reading(s): Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision Additional Suggested Reading: On scalable oversight with weak LLMs judging strong LLMs Measuring Progress on Scalable Oversight for Large Language Models Prover-Verifier Games improve legibility of LLM outputs [slides: pptx, pdf]
#10 - Thu Sept 26	Reasoning and inference-scaling	Main Reading(s): Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters Additional Suggested Reading: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models Large Language Monkeys: Scaling Inference Compute with Repeated Sampling Are More LM Calls All You Need? Towards the Scaling Properties of Compound AI Systems Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling [slides: pptx, pdf]
#11 - Tue Oct 1	Interpreting LLM activations via SAEs	Main Reading(s): Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Section 1-4, i.e. From the top to "Features as Computational Intermediates") Additional Suggested Reading: Scaling and evaluating sparse autoencoders Disentangling Dense Embeddings with Sparse Autoencoders [slides: pptx, pdf]
Oct 3	Project proposals deadline
#12 - Thu Oct 3	Security	Main Reading(s): Stealing Part of a Production Language Model (summary) Additional Suggested Reading: Logits of API-protected LLMs leak proprietary information [slides: pptx, pdf]
#13 - Tue Oct 8	Guest Speaker	Ziang Xiao, Assistant Professor of Computer Science at JHU Title: Towards Human-Centered Evaluation of Generative Models
#14 - Thu Oct 10	Weight Quantization	Main Reading(s): AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration Additional Suggested Reading: GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers Optimal brain compression: A framework for accurate post-training quantization and pruning Extreme Compression of Large Language Models via Additive Quantization SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models [slides: pptx, pdf]
#15 - Tue Oct 15	Fast n-gram membership/counting	Main Reading(s): Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens Additional Suggested Reading: Data Portraits: Recording Foundation Model Training Data N-gram Is Back: Residual Learning of Neural Text Generation with n-gram Language Model [slides: pptx, pdf]
Oct 15	Project proposal revision deadline
#16 - Thu Oct 17	No Class - Fall break
#17 - Tue Oct 22	Efficient decoding	Main Reading(s): Fast Inference from Transformers via Speculative Decoding Additional Suggested Reading: Nearest Neighbor Speculative Decoding for LLM Generation and Attribution Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads Few-Shot Semantic Parsing with Language Models Trained on Code A semi-comprehensive collection of papers around "speculative decoding" [slides: pptx, pdf]
#18 - Thu Oct 24	Long-context training	Main Reading(s): How to Train Long-Context Language Models (Effectively) Additional Suggested Reading: Effective Long-Context Scaling of Foundation Models Data Engineering for Scaling Language Models to 128K Context [slides: pptx, pdf]
#19 - Tue Oct 29	Model merging	Main Reading(s): TIES-Merging: Resolving Interference When Merging Models Additional Suggested Reading: RE-Adapt: Reverse Engineered Adaptation of Large Language Models TBD [slides: pptx, pdf]
#20 - Thu Oct 31	Representing world	Main Reading(s): The Platonic Representation Hypothesis Additional Suggested Reading: The Linear Representation Hypothesis and the Geometry of Large Language Models Emergent Linear Representations in World Models of Self-Supervised Sequence Models [slides: pptx, pdf]
#21 - Tue Nov 5	Weight Adaptation	Main Reading(s): DoRA: Weight-Decomposed Low-Rank Adaptation Additional Suggested Reading: QLoRA: Efficient Finetuning of Quantized LLMs RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation [slides: pptx, pdf]
#22 - Thu Nov 7	Representation Adaptation	Main Reading(s): Steering Language Models With Activation Engineering Additional Suggested Reading: In-context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering ReFT: Representation Finetuning for Language Models [slides: pptx, pdf]
#23 - Tue Nov 12	Episodic memory	Main Reading(s): CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization Additional Suggested Reading: Language Modeling with Editable External Knowledge HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models [slides: pptx, pdf]
#24 - Thu Nov 14	Compression	Main Reading(s): Training LLMs over Neurally Compressed Text Additional Suggested Reading: Rethinking LLM Memorization through the Lens of Adversarial Compression [slides: pptx, pdf]
Nov 14	Midway reports deadline
#25 - Tue Nov 19	Tool use	Main Reading(s): LLMs in the Imaginarium: tool learning through simulated trial and error Additional Suggested Reading: CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models WorldAPIs: The World Is Worth How Many APIs? A Thought Experiment [slides: pptx, pdf]
#26 - Thu Nov 21	Mixture of Experts	Main Reading(s): Mixtral of Experts Additional Suggested Reading: ST-MoE: Designing Stable and Transferable Sparse Expert Models Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints [slides: pptx, pdf]
#27 - Tue Nov 26	No Class - Fall Recess
#28 - Thu Nov 28	No Class - Fall Recess
#29 - Tue Dec 3	Robotic planning	Main Reading(s): VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [slides: pptx, pdf]
#30 - Thu Dec 5	Open discussion
Dec 9-10	Reading Days
Dec 17	Final project reports
Dec 17	Final project poster session (6-9pm) -- final exam schedule

Relevant Resources

Here are several resources available for free:

Compute resources:
- Google Colab provides free GPU usage for up to 12 hours/day for academic purposes. One can obtain more compute on Colab with relatively minimal pay.
- Google offers research TPU credits.
- AWS and Azure both offer welcome credits to students.
- If you need credits to use GPT3/GPT4 or other APIs, discuss it with the instructor.
Demos:
Tutorials:

A course on Huggingface's Transformers library.

Besides these resources, we will try our best to satisfy individual needs through discussion.

Code of Conduct

The strength of the university depends on academic and personal integrity. In this course, you must be honest and truthful, abiding by the Computer Science Academic Integrity Policy:

Cheating is wrong. Cheating hurts our community by undermining academic integrity, creating mistrust, and fostering unfair competition. The university will punish cheaters with failure on an assignment, failure in a course, permanent transcript notation, suspension, and/or expulsion. Offenses may be reported to medical, law or other professional or graduate schools when a cheater applies. Violations can include cheating on exams, plagiarism, reuse of assignments without permission, improper use of the Internet and electronic devices, unauthorized collaboration, alteration of graded assignments, forgery and falsification, lying, facilitating academic dishonesty, and unfair competition. Ignorance of these rules is not an excuse.

Academic honesty is required in all work you submit to be graded. Except where the instructor specifies group work, you must solve all homework and programming assignments without the help of others. For example, you must not look at anyone else’s solutions (including program code) to your homework problems. However, you may discuss assignment specifications (not solutions) with others to be sure you understand what is required by the assignment. If your instructor permits using fragments of source code from outside sources, such as your textbook or on-line resources, you must properly cite the source. Not citing it constitutes plagiarism. Similarly, your group projects must list everyone who participated.

In the above paragraph "outside sources" also include content that was produced by an AI assistant like ChatGPT. This follows either by treating the AI assistant as a person for the purposes of this policy (controversial) or acknowledging that the AI assistant was trained directly on people's original work. Thus, while you are not forbidden from using these tools, you should consider the above policy carefully and quote where appropriate. Assignments that are in large part quoted from an AI assistant are very unlikely to be evaluated positively. In addition, if a student's work is substantially identical to another student's work, that will be grounds for an investigation of plagiarism regardless of whether the prose was produced by an AI assistant.

Falsifying program output or results is prohibited. Your instructor is free to override parts of this policy for particular assignments. To protect yourself: (1) Ask the instructor if you are not sure what is permissible. (2) Seek help from the instructor, TA or CAs, as you are always encouraged to do, rather than from other students. (3) Cite any questionable sources of help you may have received.

Report any violations you witness to the instructor. You can find more information about university misconduct policies on the web for undergraduates and graduates students.

Johns Hopkins University is committed to equal opportunity for its faculty, staff, and students. To that end, the university does not discriminate on the basis of sex, gender, marital status, pregnancy, race, color, ethnicity, national origin, age, disability, religion, sexual orientation, gender identity or expression, veteran status, military status, immigration status or other legally protected characteristic. The University's Discrimination and Harassment Policy and Procedures provides information on how to report or file a complaint of discrimination or harassment based on any of the protected statuses listed in the earlier sentence, and the University’s prompt and equitable response to such complaints.

CS 601.771 Advances in Self-supervised Models

Johns Hopkins University - Fall 2024

Logistics