CS 601.771: Advances in Self-supervised Models

Large self-supervised (pre-trained) models (such as Large Language Models or LLMs) have transformed various data-driven fields, such as natural language processing (NLP). This advanced course aims to provide a holistic view of the issues related to these models. The class will mainly involve reading and discussing recent papers in the field.

The focuses of this class will involve various issues regarding "scaling": data efficiency, model social, long context, multi-modality, reasoning grounded in web or physical world, security/legal/privacy issues.

Note: The course is different from 601.471/671 (offered in the spring semesters) which is focused on building the foundations for self-supervised models.

Prerequisites: Natural Language Processing (CS 465/665), NLP: Self-Supervised Models (CS 471/671), or instructor consent.

Relevant Courses at Hopkins: This course has some overlap with "Natural Language Processing" (EN.601/665), and "Artificial Agents" (EN.601.470/670), though the courses have different focuses.

Logistics

Classes: on Tuesday/Thursday 9 - 10:15 am EST (room: TBD, or zoom meeting)
Office hours: Daniel office hour: Thursdays 12 - 1 pm EST, or by appointment (Hackerman hall, 316B).
Contact: If you have any questions about the course, you can post them on Slack.
Virtual or in-person: The class will be in-person.
Changes: The instructor reserves the right to make changes to the syllabus or project due dates. These changes will be announced as early as possible.
News and announcements: All the news and announcements will be made on Slack.
COVID: Students who report symptoms associated with COVID-19 are expected not to attend class and to isolate themselves for at least five days and until they have been symptom-free for 24 hours.
Course grade: Your grade is based on the following activities: (1) one assignment (10%) -- done individually, (2) in-class participation (40%) -- done individually, (3) a final project (40%) -- done in groups. (4) attendance (10%) -- participation in class is our chance to learn more effectively. Up to 3% additional credit for any actions taken to improve the course that is brought to the instructors' attention.
Late days: Each student has 10 late days to use for assignments. A late day extends the deadline 24 hours. Once you have used all your late days, the penalty is 5% off the final homework grade for each additional late day. For example, if you're late for 3 days on an assignment (beyond your legitimate "late days" capacity, which is 7 days per homework and 10 days in total), you will lose 15% of the points for that assignment. The deadline cutoffs are at 12pm of each day. There are no fractional late days. If you're late for 1 hour, you lose a full day.
You can use up to 7 late days per assignment (so, if you're late on HW1, you can submit it until the release of HW2). Assignments submitted after 7 late days will not be graded (unless explicit permission is given in advance by the instructor).
Grading: Homeworks are graded by the entire course staff, directly within Gradescope. To keep grading consistent, each numbered problem is graded by a single grader, under the supervision of one of the TAs, using a detailed rubric developed within Gradescope. Under normal circumstances, all homework should be graded within 10 calendar days of submission.
Regrading: Regrade requests can be submitted directly within Gradescope and must include a brief written justification for the request. We encourage students who have questions or concerns about their grades to talk with the course staff before submitting a regrade request. However, no grades will be changed in any student's presence.

Key links

Slack for discussion and announcements. Sign up, follow, ask questions, and participate in discussions!
Gradescope for submitting your assignments.

Assignment

The course has ONE assignment to measure your understanding of the foundational concepts of self-supervised learning. This is to make sure that when coming in, you know all the pre-requisites needed for the class. They will be released on this website, and submissions should be uploaded to Gradescope.

Pre/in-class Participation

TBD

Final project

The objective of the final project is to make use of what you have learned during this course to solve a hard problem.

The final project milestones include: (1) A project proposal, (2) A project midway report, (3) progress update presentation, (4) a final report, (5) a final project poster summarizing the technical aspects of the project. See the course calendar for the due dates.

Topic: The topic of this project is open-ended. This project, for example, can focus on demonstrating systemic limitations of prior work or suggesting improvements on methods or benchmarks discussed in the class.
Group work: Students are encouraged to work in groups on the final project (team sizes limited to 2 or 3 people).
Project proposals: All groups will be required to submit a project proposal (due on the class calendar). The project proposal is a 2-page description of what you intend to do (experiments, datasets, methods, etc.) All documents should follow template. The instructor(s) will provide feedback on these ideas to help the teams with finding a concrete idea. Here is examples of project proposals from previous years:
- Ensemble Domain-Specific Knowledge Distillation
Midway progress reports: Reports discussing the progress made thus far (at most 5 pages; this template.) and elaborates on the remaining work. Describe the progress made, experiments you have run, preliminary results you have obtained, how you plan to spend the rest of your time, etc. While this is called "midway" in practice it should be considered more than halfway! By this milestone, you’re expected to have implemented some system, and to have some experimental results to show by this date.
Final poster presentations: All students will present their findings at a poster presentation during the final exam period.
Final report: Students should write code and carry out additional experiments and then write up the results in a standard conference paper format (at most 8 pages; use this template). References don't count toward the page limit. Note that longer reports are not necessarily better. Students in groups are required to include a “contributions” section concretely lists each author’s contributions (see Section 8 of this paper, for example). The final report should concisely summarize your findings and answer the following questions: 1. What approach did you take to address this problem, and why? 2. How did you explore the space of solutions? 3. How did you evaluate the performance of the approach(es) you investigated? 4. What worked, what did not work, and why?
Here are examples of final reports from the previous year:
- Ensemble Domain-Specific Knowledge Distillation
- Efficient Distillation of Transformers via Self-Teaching
Project grading: The goal of the project is to demonstrate the group’s understanding of the tools and challenges when using self-supervised models. Grading will reflect the quality of the approach, the rigor of evaluation, and reasoning about successes and failures. Grading will also depend on the completeness of the project, the clarity of the writeup, the level of complexity/difficulty of the approach, and your ability to justify the choices you made. Here is the grade breakdown for the projects:
- Project proposal: 15%
- Midway report: 15%
- Progress update presentation: 10%
- Quality of final report write-up, implementation and results: 40%
- Final poster and its presentation: 20%
Collaboration: Study groups are allowed, but students must understand and complete their own assignments, and hand in one assignment per student. If you worked in a group, please put the names of the members of your study group at the top of your assignment. Please ask if you have any questions about the collaboration policy. Again, you must understand and complete your own assignments in your own words, and hand in one assignment per student.
Using Other Resources: We strongly encourage you to use any outside source at your disposal, provided you use your sources properly and give them proper credit. If you get an idea from an outside source, citing that source will not lower your grade. Failing to properly cite an outside source—thereby taking credit for ideas that are not your own—is plagiarism.
Appropriate Citations: You must write everything in your own words, and properly cite every outside source you use, including other students. Using ideas from other sources or people without citation is plagiarism. Copying other sources verbatim, even with proper citation, is plagiarism. Don't do that. The only sources that you are not required to cite are the official course materials (lectures, slides, and assignments).
Honor code: We expect students to not look at solutions or implementations online. We take the student Honor Code seriously. We sometimes use automated methods to detect overly similar assignment solutions.

What papers should we read?

Which papers should we read? What are the important topics in the field? Use this form to suggest papers and topics for the class.

Content Schedule

The current class schedule is below (subject to change):

Date	Topic	Course Materials	Events	Deadlines
#1 - Tue Jan 23	Course introduction: Course overview Plan and expectations [slides: pptx, pdf]	Suggested reading: Dive into Deep Learning: Linear Algebra in PyTorch< Additional Reading: Python / Numpy Tutorial (with Jupyter and Colab) Optimization: Stochastic Gradient Descent	HW1 is released! [tex]
#2 - Thu Jan 25	Reviewing the foundations:	Suggested Reading: TBD Additional Reading: TBD TBD
#3 - Tue Jan 30	TBD	Suggested Reading: TBD Additional Reading: TBD TBD		HW1 due
#4 - Thu Feb 1	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#5 - Tue Feb 6	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#6 - Thu Feb 8	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#8 - Thu Feb 15	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#9 - Tue Feb 20	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#10 - Thu Feb 22	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#11 - Tue Feb 27	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#12 - Thu Feb 29	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#13 - Tue Mar 5	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#14 - Thu Mar 7	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#15 - Tue Mar 12	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
Apr 1	Project proposals deadline
#17 - Tue Mar 19	No Class - Spring Break
#18 - Thu Mar 21	No Class - Spring Break
#19 - Tue Mar 26	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#20 - Thu Mar 28	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#21 - Tue Apr 2	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#22 - Thu Apr 4	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
TBD	Midway reports deadline
#23 - Tue Apr 9	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#24 - Thu Apr 11	TND	Suggested Reading: TBD Additional Reading: TBD TBD
#25 - Tue Apr 16	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#26 - Thu Apr 18	TBD	Suggested Reading: TBD Additional Reading: TBD TBD
#27 - Tue Apr 23	Project progress presentation
#28 - Thu Apr 25	Project progress presentation
#29 - Tue April 30	No Class - Reading Days
#30 - Thu May 2	No Class - Reading Days
May 13	Final project reports
May 13	Final project poster session (6-9pm)

Relevant Resources

Here are several resources available for free:

Compute resources:
- Google Colab provides free GPU usage for up to 12 hours/day for academic purposes. One can obtain more compute on Colab with relatively minimal pay.
- Google offers research TPU credits.
- AWS and Azure both offer welcome credits to students.
- If you need credits to use GPT3/GPT4 or other APIs, discuss it with the instructor.
Demos:
Tutorials:

A course on Huggingface's Transformers library.
Tutorials on Learn with Torch Lightning

Besides these resources, we will try our best to satisfy individual needs through discussion.

Code of Conduct

The strength of the university depends on academic and personal integrity. In this course, you must be honest and truthful, abiding by the Computer Science Academic Integrity Policy:

Cheating is wrong. Cheating hurts our community by undermining academic integrity, creating mistrust, and fostering unfair competition. The university will punish cheaters with failure on an assignment, failure in a course, permanent transcript notation, suspension, and/or expulsion. Offenses may be reported to medical, law or other professional or graduate schools when a cheater applies. Violations can include cheating on exams, plagiarism, reuse of assignments without permission, improper use of the Internet and electronic devices, unauthorized collaboration, alteration of graded assignments, forgery and falsification, lying, facilitating academic dishonesty, and unfair competition. Ignorance of these rules is not an excuse.

Academic honesty is required in all work you submit to be graded. Except where the instructor specifies group work, you must solve all homework and programming assignments without the help of others. For example, you must not look at anyone else’s solutions (including program code) to your homework problems. However, you may discuss assignment specifications (not solutions) with others to be sure you understand what is required by the assignment. If your instructor permits using fragments of source code from outside sources, such as your textbook or on-line resources, you must properly cite the source. Not citing it constitutes plagiarism. Similarly, your group projects must list everyone who participated.

In the above paragraph "outside sources" also include content that was produced by an AI assistant like ChatGPT. This follows either by treating the AI assistant as a person for the purposes of this policy (controversial) or acknowledging that the AI assistant was trained directly on people's original work. Thus, while you are not forbidden from using these tools, you should consider the above policy carefully and quote where appropriate. Assignments that are in large part quoted from an AI assistant are very unlikely to be evaluated positively. In addition, if a student's work is substantially identical to another student's work, that will be grounds for an investigation of plagiarism regardless of whether the prose was produced by an AI assistant.

Falsifying program output or results is prohibited. Your instructor is free to override parts of this policy for particular assignments. To protect yourself: (1) Ask the instructor if you are not sure what is permissible. (2) Seek help from the instructor, TA or CAs, as you are always encouraged to do, rather than from other students. (3) Cite any questionable sources of help you may have received.

Report any violations you witness to the instructor. You can find more information about university misconduct policies on the web for undergraduates and graduates students.

Johns Hopkins University is committed to equal opportunity for its faculty, staff, and students. To that end, the university does not discriminate on the basis of sex, gender, marital status, pregnancy, race, color, ethnicity, national origin, age, disability, religion, sexual orientation, gender identity or expression, veteran status, military status, immigration status or other legally protected characteristic. The University's Discrimination and Harassment Policy and Procedures provides information on how to report or file a complaint of discrimination or harassment based on any of the protected statuses listed in the earlier sentence, and the University’s prompt and equitable response to such complaints.

CS 601.771 Advances in Self-supervised Models

Johns Hopkins University - Fall 2024