I am Harsh Raj, currently pursuing a Master of Science in Computer Science at Northeastern University, Boston. Additionally, I am the Co-Founder of the open-source organization Ontocord AI and a Researcher at AI Risk and Vulnerability Alliance (ARVA). Previously, I worked as an Applied Scientist at VijilAI, where I focused on making AI agents trustworthy.
I am passionate about making language models safe, useful, and controllable. Over the past few years, I have taken my first steps as a researcher, thanks to some wonderful collaborators and mentors.
I am currently working with David Bau on understanding reasoning models through mechanistic interpretability. Most recently, I co-led the Preventing Adversarial Reward Optimization project (in collaboration with Dom) at the AI Safety Camp. As an Applied Scientist at VijilAI, I worked alongside Leif to develop a database of red-teaming prompts.
Previously, I collaborated with Subho, Dom, and Vipul on evaluating and improving the consistency of language models.
Through the MLC community, I was fortunate to work with Yash and Laura on quantifying the robustness transfer from pretraining to downstream tasks.
News and Timeline
2024
- December: Our work on Improving Consistency in Large Language Models through Chain of Guidance got accepted at the Transaction of Machine Learning Research (TMLR).
- December: Our work on Mitigating Unsafe Feedback with Learning Constraints got accepted for poster presentation at AAAI-25 Workshop on Artificial Intelligence for Cyber Security.
- September: Published my first Lesswrong blog on Interpreting the effects of Jailbreaking in LLMs.
- June: Released the preprint of our work on Reverse Preference Attack, led by Domenic.
- January: Joined VijilAI as an Applied Scientist.
- January: Our work on Vision and Language Navigation ranked 3rd on the R2R leaderboard. Team Name: MLR_Lab_DTU.
2023
- December: Presented our work on robustness transfer at NeurIPS 2023 in New Orleans.
- May: Our work on robustness transfer accepted to NeurIPS 2023. Led by Laura and mentored by Yash.
2022
- November: Our work on consistency evaluation won the Best Paper Award with a cash prize of $5000.
- April: Two papers accepted to NeurIPS Workshop 2023: one on consistency evaluation and another on evaluating the robustness of biomedical concept normalization.