Topics in CS – Human-AI Alignment

(CSCI-SHU 205)

Autumn 2025 | New York University Shanghai

Hua Shen

Instructor

Hua Shen

Time Mon/Wed 2:15 - 3:30 PM
Location EB227
Welcome! 🤗
This course will provide an overview of (Bidirectional) Human-AI Alignment, emphasizing both how AI systems can be designed to reflect human values and how humans can be empowered to critically engage and collaborate with AI. Topics include human-centered data collection and curation, reinforcement learning from human feedback (RLHF), human-in-the-loop evaluation, and human-AI interaction. By focusing on this two-way alignment, you will be equipped to shape AI systems responsibly while developing the skills to navigate and contribute to both HCI and AI research.
Class Schedule
See NYU Shanghai's Course Syllabus for the tentative schedule, which is subject to change.
Week Date Theme Topics Reading Materials
1
Sep 1 (M) Foundations
Overview: Introduction to Human-AI Alignment
slides video
1
Sep 3 (W) Foundations
Overview: Evolving Challenges of AI Alignment and Human's Role
πŸŽ“ Project Proposal Start slides video
2
Sep 8 (M) Foundations
Values & Morals in LLMs: Theories and Evaluation
slides
β€’ Gabriel, Iason. "Artificial intelligence, values, and alignment." Minds and Machines, 2020.
β€’ Sorensen, Taylor, Jared Moore et al. "A roadmap to pluralistic alignment." ICML 2024.
β€’ Ye, Haoran, Jing Jin, Yuhang Xie, Xin Zhang, and Guojie Song. "Large language model psychometrics: A systematic review of evaluation, validation, and enhancement." arXiv:2505.08245.
2
Sep 10 (W) Foundations
Values & Morals in LLMs: Misalignment Mitigation and Social Impacts
slides
β€’ Feng, Shangbin, Taylor Sorensen, Yuhan Liu, Jillian Fisher, Chan Young Park, Yejin Choi, and Yulia Tsvetkov. "Modular pluralism: Pluralistic alignment via multi-llm collaboration." arXiv:2406.15951.
β€’ Shen, Hua, Nicholas Clark, and Tanushree Mitra. "Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values?." arXiv:2501.15463.
β€’ AlKhamissi, Badr, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab. "Investigating cultural alignment of large language models." arXiv:2402.13231.
β€’ Rao, Abhinav, Aditi Khandelwal, Kumar Tanmay, Utkarsh Agarwal, and Monojit Choudhury. "Ethical reasoning over moral alignment: A case and framework for in-context ethical policies in LLMs." arXiv:2310.07251.
β€’ Santurkar, Shibani, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. "Whose opinions do language models reflect?." ICML 2023.
β€’ Kapoor, Sayash, Rishi Bommasani, Kevin Klyman, Shayne Longpre, Ashwin Ramaswami, Peter Cihon, Aspen Hopkins et al. "On the societal impact of open foundation models." arXiv:2403.07918.
3
Sep 15 (M) Foundations
Paper Presentation: Values and Morals in LLMs
3
Sep 17 (W) Methods
Data is Gold: Human-in-the-loop Collection and Co-Annotation
β€’ Kim, Sungdong, Sanghwan Bae, Jamin Shin, Soyoung Kang, Donghyun Kwak, Kang Min Yoo, and Minjoon Seo. "Aligning large language models through synthetic feedback." EMNLP 2023.
β€’ Li, Minzhi, Taiwei Shi, Caleb Ziems, Min-Yen Kan, Nancy F. Chen, Zhengyuan Liu, and Diyi Yang. "Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation." EMNLP 2023.
β€’ Aher, Gati V., Rosa I. Arriaga, and Adam Tauman Kalai. "Using large language models to simulate multiple humans and replicate human subject studies." ICML 2023.
4
Sep 22 (M) Methods
Data is Gold: Human Preference Styles
πŸŽ“ Project Proposal Due
β€’ Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen et al. "Constitutional ai: Harmlessness from ai feedback." arXiv:2212.08073.
β€’ Shen, Hua, Vicky Zayats, Johann C. Rocholl, Daniel D. Walker, and Dirk Padfield. "Multiturncleanup: A benchmark for multi-turn spoken conversational transcript cleanup." EMNLP 2023.
β€’ Huang, Saffron, Divya Siddarth, Liane Lovitt, Thomas I. Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli. "Collective constitutional ai: Aligning a language model with public input." FAccT 2024.
β€’ Gordon, Mitchell L., Michelle S. Lam, Joon Sung Park, Kayur Patel, Jeff Hancock, Tatsunori Hashimoto, and Michael S. Bernstein. "Jury learning: Integrating dissenting voices into machine learning models." CHI 2022.
4
Sep 24 (W) Methods
Data is Gold: Human Preference Substance and Validation
🎯 Assignment 1 Released
(LLM Post-Training Alignment)
β€’ Bradley, Herbie, Andrew Dai, Hannah Teufel, Jenny Zhang, Koen Oostermeijer, Marco Bellagente, Jeff Clune, Kenneth Stanley, GrΓ©gory Schott, and Joel Lehman. "Quality-diversity through AI feedback." ICLR 2024.
β€’ Zhao, Dora, Jerone TA Andrews, Orestis Papakyriakopoulos, and Alice Xiang. "Position: measure dataset diversity, don't just claim it." ICML 2024.
5
Sep 29 (M) Methods
Post-Training for Alignment: LLM Prompting and Supervised Fine-Tuning (SFT)
β€’ Schulhoff, Sander, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li et al. "The prompt report: a systematic survey of prompt engineering techniques." arXiv:2406.06608.
5
Oct 1 (W)
πŸ’ƒπŸ» National Holiday Day - No Class
6
Oct 6 (M)
πŸ’ƒπŸ» National Holiday Day - No Class
6
Oct 8 (W) Methods
Paper Presentation: Data is Gold
7
Oct 13 (M) Methods
Post-Training for Alignment: RLHF and similar Approaches
β€’ Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang et al. "Training language models to follow instructions with human feedback." NeurIPS 2022.
β€’ Rafailov, Rafael, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. "Direct preference optimization: Your language model is secretly a reward model." NeurIPS 2024.
7
Oct 15 (W) Methods
Post-Training for Alignment: Interactive Alignment and LLM Adaptation
β€’ Petridis, Savvas, Benjamin D. Wedin, James Wexler, Mahima Pushkarna, Aaron Donsbach, Nitesh Goyal, Carrie J. Cai, and Michael Terry. "Constitutionmaker: Interactively critiquing large language models by converting feedback into principles." IUI 2024.
β€’ Terry, Michael, Chinmay Kulkarni, Martin Wattenberg, Lucas Dixon, and Meredith Ringel Morris. "Interactive AI alignment: Specification, process, and evaluation alignment." arXiv:2311.00710.
8
Oct 20 (M) Methods
Paper Presentation: Post-Training for Alignment
8
Oct 22 (W) Methods
Evaluation and Ecosystem: Automatic and Human-in-the-loop Evaluation Approaches
β€’ Felkner, Virginia K., Ho-Chun Herbert Chang, Eugene Jang, and Jonathan May. "Winoqueer: A community-in-the-loop benchmark for anti-lgbtq+ bias in large language models." ACL 2023.
β€’ Sun, Zhiqing, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. "Principle-driven self-alignment of language models from scratch with minimal human supervision." NeurIPS 2024.
β€’ Lee, Noah, Na Min An, and James Thorne. "Can large language models capture dissenting human voices?." EMNLP 2023.
β€’ Zhou, Xuhui, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency et al. "Sotopia: Interactive evaluation for social intelligence in language agents." ICLR 2024.
9
Oct 27 (M) Methods
Evaluation and Ecosystem: Platforms to empower human-AI alignment
🎯 Assignment 1 Due
β€’ Dubois, Yann, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S. Liang, and Tatsunori B. Hashimoto. "Alpacafarm: A simulation framework for methods that learn from human feedback." NeurIPS 2023.
β€’ Yuan, Yifu, Jianye Hao, Yi Ma, Zibin Dong, Hebin Liang, Jinyi Liu, Zhixin Feng, Kai Zhao, and Yan Zheng. "Uni-rlhf: Universal platform and benchmark suite for reinforcement learning with diverse human feedback." ICLR 2024.
9
Oct 29 (W) Methods
Paper Presentation: Evaluation and Ecosystem
10
Nov 3 (M)
πŸŽ“ Midway Project Showcase
10
Nov 5 (W) Practice
Human-AI Interaction: Design and Development
🎯 Assignment 2 Released
(Human-LLM Interactive Alignment System)
β€’ Lee, Mina, Katy Ilonka Gero, John Joon Young Chung, Simon Buckingham Shum, Vipul Raheja, Hua Shen, Subhashini Venugopalan et al. "A design space for intelligent and interactive writing assistants." CHI 2024.
β€’ Zagalsky, Alexey, Dov Te'eni, Inbal Yahav, David G. Schwartz, Gahl Silverman, Daniel Cohen, Yossi Mann, and Dafna Lewinsky. "The design of reciprocal learning between human and artificial intelligence." CSCW 2021.
β€’ Park, Joon Sung, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. "Generative agents: Interactive simulacra of human behavior." IUI 2023.
β€’ Friedman, Batya, David G. Hendry, and Alan Borning. "A survey of value sensitive design methods." Foundations and Trends in Human–Computer Interaction, 2017.
11
Nov 10 (M) Practice
Human-AI Interaction: Evaluation and Intervention
β€’ Shen, Hua, and Tongshuang Wu. "Parachute: Evaluating interactive human-lm co-writing systems." CHI 2023 In2Writing.
β€’ Birhane, Abeba, William Isaac, Vinodkumar Prabhakaran, Mark Diaz, Madeleine Clare Elish, Iason Gabriel, and Shakir Mohamed. "Power to the people? Opportunities and challenges for participatory AI." EAAMO 2022.
β€’ Wu, Sherry, Hua Shen, Daniel S. Weld, Jeffrey Heer, and Marco Tulio Ribeiro. "Scattershot: Interactive in-context example curation for text transformation." IUI 2023.
11
Nov 12 (W) Practice
Human-AI Interaction: Empirical Applications and Use Cases
β€’ Ma, Qianou, Hua Shen, Kenneth Koedinger, and Sherry Tongshuang Wu. "How to teach programming in the ai era? using llms as a teachable agent for debugging." AIED 2024.
β€’ Yang, Diyi, Caleb Ziems, William Held, Omar Shaikh, Michael S. Bernstein, and John Mitchell. "Social skill training with large language models." arXiv:2404.04204.
12
Nov 17 (M) Practice
Paper Presentation: Human-AI Interaction
12
Nov 19 (W) Practice
LLM Interpretability: Mechanistic Interpretability Approaches
β€’ Bereska, Leonard, and Efstratios Gavves. "Mechanistic interpretability for AI safety--a review." arXiv preprint arXiv:2404.14082 (2024). TMLR 2024.
β€’ Sharkey, Lee, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill et al. "Open problems in mechanistic interpretability." arXiv:2501.16496.
β€’ Cunningham, Hoagy, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. "Sparse autoencoders find highly interpretable features in language models." ICLR 2024.
13
Nov 24 (M) Practice
LLM Interpretability: Human Evaluation and Interactive Explanation
β€’ Shen, Hua, Chieh-Yang Huang, Tongshuang Wu, and Ting-Hao Kenneth Huang. "ConvXAI: Delivering heterogeneous AI explanations via conversations to support human-AI scientific writing." CSCW 2023 Demo.
β€’ Gebreegziabher, Simret Araya, Zheng Zhang, Xiaohang Tang, Yihao Meng, Elena L. Glassman, and Toby Jia-Jun Li. "Patat: Human-ai collaborative qualitative coding with explainable interactive rule synthesis." CHI 2023.
13
Nov 26 (W)
πŸ¦ƒ Thanksgiving Holiday - No Class
14
Dec 1 (M)
πŸ¦ƒ Thanksgiving Holiday - No Class
14
Dec 3 (W) Practice
Paper Presentation: LLM Interpretability
🎯 Assignment 2 Due
15
Dec 8 (M) Practice
Risks, Trust and Safety: Taxonomies and Benchmarks
β€’ Bengio, Yoshua, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari et al. "Managing extreme AI risks amid rapid progress." Science 2024.
15
Dec 10 (W) Practice
Risks, Trust and Safety: Human-in-the-loop Auditing and Mitigation
β€’ Prabhudesai, Snehal, Ananya Prashant Kasi, Anmol Mansingh, Anindya Das Antar, Hua Shen, and Nikola Banovic. "'Here the GPT made a choice, and every choice can be biased': How Students Critically Engage with LLMs through End-User Auditing Activity." CHI 2025.
16
Dec 15 (M)
πŸŽ“ Final Project Presentation
16
Dec 17 (W)
πŸŽ“ Final Project Presentation
Overview
Office Hour
Hua Shen: 11:00 AM - 12:00 PM, Tue/Thu; Office: S749; Zoom: https://nyu.zoom.us/my/hua.shen (Passcode: enQb9h)
Prerequisite
Students are expected to (1) be proficient in Python and relevant programming libraries (e.g. HuggingFace, OpenAI API) for completing assignments, and (2) know basic Machine Learning and Human-computer Interaction (optional) concepts to comprehend research papers.