3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS)
6 Dec 2023 @ EMNLP 2023 in Singapore
With great scientific breakthrough comes solid engineering and open communities. The Natural Language Processing (NLP) community has benefited greatly from the open culture in sharing knowledge, data, and software. The primary objective of this workshop is to further the sharing of insights on the engineering and community aspects of creating, developing, and maintaining NLP open source software (OSS), which we seldom talk about in scientific publications. Our secondary goal is to promote synergies between different open source projects and encourage cross-software collaborations and comparisons.
There are many workshops focusing on the creation and curation of open language resources and annotations (e.g. BUCC, GWN, LAW, LOD, WAC). Moreover, we have the flagship LREC conference dedicated to linguistic resources. However, the engineering aspects of NLP-OSS are overlooked and under-discussed within the community. There are open source conferences and venues (such as FOSDEM, OSCON, Open Source Summit) where discussions range from operating system kernels to air traffic control hardware but the representation of NLP related presentations is limited. In the Machine Learning (ML) field, the Journal of Machine Learning Research - Machine Learning Open Source Software (JMLR-MLOSS) is a forum for discussions and dissemination of ML OSS topics. We envision that the Workshop for NLP-OSS becomes a similar avenue for NLP-OSS discussions.
Recently there have been successful Big Science workshop series which examine and promote open science in NLP. While important and complementary, the goals of Big Science are distinct from those of NLP-OSS which focuses more on the community of practice in open-source software in support of NLP and language technologies. We expect many who participated in the BigScience workshop to participate in NLP-OSS as many of the participants are former PC members in past editions of NLP-OSS. Another grassroot community movement, Eleuther AI started with the researchers attempting to replicate commercial language models and has since grown to an active decentralized community of volunteer researchers, engineers, and developers focused on AI alignment, scaling, and open source AI research.
With the rise of open source startups like Huggingface, the democratization of NLP gives researchers and the general public easy access to language models once available only to a handful of industrial research labs. This acceleration of NLP tools availability creates new synergies between cloud integrations, e.g. Huggingface x AWS Sagemaker, that allows engineers and researchers to train and deploy live applications with minimal infrastructure setups. Building on the shoulders of giants, the scikit-learn and Huggingface ecosystems are now interoperable under the skops framework. We want to highlight these emergent communities and synergies in the NLP-OSS workshop and promote future collaborations with like-minded open source NLP researchers in the third NLP-OSS workshop. We hope that the NLP-OSS workshop could also be hosted in an *ACL conference, and be the intellectual forum to collate this type of knowledge, announce new software/features, promote the open source culture and OSS best practices.
Call for Papers
We invite full papers (8 pages) or short papers (4 pages) on topics related to NLP-OSS broadly categorized into (i) software development, (ii) scientific contribution and (iii) NLP-OSS case studies.
- Software Development
- Designing and developing NLP-OSS
- Licensing issues in NLP-OSS
- Backward compatibility and stale code in NLP-OSS
- Growing, maintaining and motivating an NLP-OSS community
- Best practices for NLP-OSS documentation and testing
- Contribution to NLP-OSS without coding
- Incentivizing OSS contributions in NLP
- Commercialization and Intellectual Property of NLP-OSS
- Defining and managing NLP-OSS project scope
- Issues in API design for NLP
- NLP-OSS software interoperability
- Analysis of the NLP-OSS community
- Scientific Contribution
- Surveying OSS for specific NLP task(s)
- Demonstration, introductions and/or tutorial of NLP-OSS
- Small but useful NLP-OSS
- NLP components in ML OSS
- Citations and references for NLP-OSS
- OSS and experiment replicability
- Gaps between existing NLP-OSS
- Task-generic vs task-specific software
- Case studies
- Case studies of how a specific bug is fixed or feature is added
- Writing wrappers for other NLP-OSS
- Writing open-source APIs for open data
- Teaching NLP with OSS
- NLP-OSS in the industry
Submission information
Authors are invited to submit a
- Full paper up to 8 pages of content or
- Short paper up to 4 pages of content
Submissions can be non-archival and be presented in the NLP-OSS workshop, but we would still require at least a 4-page submission so that reviewers have enough information to make the acceptance/rejection decision. This non-archival option is helpful for author(s) who wants to publish or had published the work elsewhere and would like to present/discuss pertinent NLP-OSS related work to the workshop PCs and attendees.
All papers are allowed unlimited but sensible pages for references. Final camera-ready versions will be allowed an additional page of content to address reviewers’ comments.
Due to the nature of open source software, we find it a bit tricky to “anonymize” “open source”. For this reason, we don’t require your publication to be anonymous. However, if you prefer your paper to be anonymized, please mask any identifiable phrase with REDACTED.
Submission should be formatted according to the EMNLP 2023 LaTeX or MS Word templates at https://2023.emnlp.org/calls/style-and-formatting/
Submissions should be uploaded to OpenReview conference management system at https://openreview.net/group?id=EMNLP/2023/Workshop/NLP-OSS
Important dates
The 3rd NLP-OSS workshop will be co-located with the EMNLP 2023 conference.
- Paper submission: 09 August, 2023
- Paper Reviews Starts: 25 August 2023
- Paper Reviews Due: 01 October 2023
- Notification of Acceptance: 10 October 2023
- Camera-Ready Version: 25 October 2023
- Workshop: 6 Dec 2023
Invited Speakers
trlX: A Framework for Large Scale Open Source RLHF
Louis Castricato
Reinforcement learning from human feedback (RLHF) utilizes human feedback to better align large language models with human preferences via online optimization against a learned re- ward model. Current RLHF paradigms rely on Proximal Policy Optimization (PPO), which quickly becomes a challenge to implement and scale up to large architectures. To address this difficulty we created the trlX library as a feature-complete open-source framework for RLHF fine-tuning of models up to and exceeding 70 billion parameters. This talk presents the trlX implementation that supports for multiple types of distributed training including distributed data par- allel, model sharded, as well as tensor, sequential, and pipeline parallelism.
Bio
Louis Castricato is a research scientist at EleutherAI, working on RLHF infrastructure and engineering. Previously, Louis was head of LLMs at Stability AI and team lead at CarperAI, the largest open source RLHF group, as well as a PhD student at Brown University.
Southeast Asia LLMs: SEA-LION and Wangchan-LION
David Tat-Wee Ong and Peerat Limkonchotiwat
SEA-LION (Southeast Asian Languages In One Network) is a family of multilingual LLMs that is specifically pre-trained and instruct-tuned for the South- east Asian (SEA) region, incorporating a custom SEABPETokenizer which is specially tailored for SEA languages. The first part of this talk will cover our design philosophy and pre-training methodology for SEA- LION. The second part of this talk will cover PyThaiNLP’s work on Wangchan-LION, an instruct-tuned version of SEA-LION for the Thai community.
Bio
David is presently the Head of Engineering in AI Singapore’s (AISG) Products pillar, managing a team of software engineers to support AISG’s Products research implementation. David Tat-Wee holds a M.Sc in Computer Science and a M.Sc in Financial Engineering from NUS.
Bio
Peerat Limkonchotiwat is a Ph.D. student in information science and technology (IST) at VISTEC, Thailand. He contributes to the WangchanX project, a Thai NLP group developing applications, that comprises Thai sentence embedding benchmarks, Thai text processing datasets (VISTEC-TPTH-2021 and NNER-TH), and WangchanGLM and Wangchan-Sealion generative models.
Towards Explainable and Accessible AI
Brandon Duderstadt and Yuvanesh Anand
Large language models (LLMs) have recently achieved human-level performance on a range of professional and academic benchmarks. Unfortunately, the explainability and accessibility of these models has lagged behind their performance. State-of-the-art LLMs require costly infrastructure, are only accessible via rate-limited, geo-locked, and censored web interfaces, and lack publicly available code and technical reports. Moreover, the lack of tooling for understanding the massive datasets used to train and produced by LLMs presents a critical challenge for explainability research. This talk will be an overview of Nomic AI’s efforts to address these challenges through its two core initiatives: GPT4Alli> and Atlas.
Bio
Brandon Duderstadt is the founder and CEO of Nomic AI, a startup whose mission is to improve the explainability and accessibility of AI. His experiences at Rad AI and John Hopkins convinced him of both the profound impact that this new wave of AI technology would have as well as the need to improve the explainability and accessibility of AI models.
Bio
Yuvanesh Anand is a freshman computer science student at the Virginia Institute of Technology with broad interests in natural language processing and open-source software development. While still in high school, Yuvanesh joined Nomic AI as a software engineering intern, where he led the data collection and early development of the GPT4All project.
Workshop Program
The timezone for the program schedule below are in Singapore Time (GMT +8).
09:00 - 09:15 Opening Remarks
09:15 - 10:15 < Secret Breakfast!
10:30 - 11:00 Coffee Break
11:00 - 11:30 Lightning Session 1
11:30 - 12:15 Poster Session 1
An Open-source Web-based Application for Development of Resources and Technologies in Underresourced Languages
Siddharth Singh, Shyam Ratan, Neerav Mathur and Ritesh Kumar
AWARE-TEXT: An Android Package for Mobile Phone Based Text Collection and On-Device Processing
Salvatore Giorgi, Garrick Sherman, Douglas Bellew, Sharath Chandra Guntuku, Lyle Ungar and Brenda Curtis
Beyond the Repo: A Case Study on Open Source Integration with GECToR
Sanjna Kashyap, Zhaoyang Xie, Kenneth Steimel and Nitin Madnani
calamanCy: A Tagalog Natural Language Processing Toolkit
Lester James Validad Miranda
Deepparse: A State-Of-The-Art Library for Parsing Multinational Street Addresses
David Beauchemin
Empowering Knowledge Discovery from Scientific Literature: A novel approach to Research Artifact Analysis
Petros Stavropoulos, Ioannis Lyris, Natalia Manola, Ioanna Grypari, Haris Papageorgiou
Kani: A Lightweight and Highly Hackable Framework for Language Model Applications
Andrew Zhu, Liam Dugan, Alyssa Hwang and Chris Callison-Burch
nanoT5: Fast & Simple Pre-training and Fine-tuning of T5 Models with Limited Resources
Piotr Nawrot
PyThaiNLP: Thai Natural Language Processing in Python
Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip and Can Udomcharoenchaikit
SOTASTREAM: A Streaming Approach to Machine Translation Training
Matt Post, Thamme Gowda, Roman Grundkiewicz, Huda Khayrallah, Rohit Jain and Marcin Junczys-Dowmunt
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo and Nghi D. Q. Bui
12:15 - 13:45 Lunch Break
13:45 - 14:45 Invited Talk - SEA-LION (Southeast Asian Languages In One Network): A Family of Southeast Asian Language Models
David Ong and Peerat Limkonchotiwat
14:45 - 15:15 Lightning Session 2
15:15 - 15:30 Coffee Break
15:30 - 16:15 Poster Session 2
Antarlekhaka: A Comprehensive Tool for Multi-task Natural Language Annotation
Hrishikesh Terdalkar and Arnab Bhattacharya
DeepZensols: A Deep Learning Natural Language Processing Framework for Experimentation and Reproducibility
Paul Landes, Barbara Di Eugenio and Cornelia Caragea
EDGAR-CRAWLER: Automating Financial Data Harvesting and Preprocessing for NLP
Lefteris Loukas, Manos Fergadiotis and Prodromos Malakasiotis
GPT4All: An Ecosystem of Open Source Compressed Language Models
Yuvanesh Anand, Zach Nussbaum, Adam Treat, Aaron Miller, Richard Guo, Benjamin M Schmidt, Brandon Duderstadt and Andriy Mulyar
GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings
Fu Bang
Improving NER Research Workflows with SeqScore
Constantine Lignos, Maya Kruse and Andrew Rueda
Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models
Michael Günther, Georgios Mastrapas, Bo Wang, Han Xiao and Jonathan Geuter
LaTeX Rainbow: Open Source Document Layout Semantic Annotation Framework from LaTeX to PDF
Changxu Duan and Sabine Bartsch
nerblackbox: A High-level Library for Named Entity Recognition in Python
Felix Stollenwerk
News Signals: An NLP Library for Text and Time Series
Chris Hokamp, Demian Gholipour Ghalandari and Parsa Ghaffari
PyTAIL: An Open Source Tool for Interactive and Incremental Learning of NLP Models with Human in the Loop for Online Data
Shubhanshu Mishra, Jana Diesner
Rumour Detection in the Wild: A Browser Extension for Twitter
Andrej Jovanovic and Björn Ross
Two Decades of the ACL Anthology: Development, Impact, and Open Challenges
Marcel Bollmann, Nathan Schneider, Arne Ko ̈hn and Matt Post
torchdistill Meets Hugging Face Libraries for Reproducible, Coding-free Deep Learning Studies: A Case Study on NLP
Yoshitomo Matsubara
Using Captum to Explain Generative Language Models
Vivek Miglani, Aobo Yang, Aram H. Markosyan, Diego Garcia-Olano and Narine Kokhlikyan
16:15 - 17:15 Invited Talk - Towards Explainable and Accessible AI
Brandon Duderstadt and Yuvanesh Anand
17:15 - 17:30 Closing Remarks
Note: Please to Hybrid In-Person & Virtual Workshop Format” section for instructions on how to access the both in-person and virtual workshop content.
Organizers
- Geeticka Chauhan, Massachusetts Institute of Technology
- Dmitrijs Milajevs, Grayscale AI
- Elijah Rippeth, University of Maryland
- Jeremy Gwinnup, Air Force Research Laboratory
- Liling Tan, Amazon
Programme Committee
- Aakanksha Naik, Allen Institute for Artificial Intelligence
- Abhinav Arora, Meta AI
- Aitor Soroa, HiTZ Center, University of the Basque Country UPV/EHU
- Akintunde Oladipo, University of Waterloo
- Alexander Rush, Cornell / Hugging Face
- Aline Paes, Universidade Federal Fluminense
- Amittai Axelrod, Apple AI
- Anish Mohan, NVIDIA
- Arun Balajiee Lekshmi Narayanan, University of Pittsburgh
- Atnafu Lambebo Tonja, University of Colorado
- Atul Kr. Ojha, University of Galway
- Cassandra Jacobs, University at Buffalo
- Christoph Teichmann, Bloomberg L.P.
- Daniel Braun, University of Twente
- David M. Howcroft, Edinburgh Napier University
- Fabio Kepler, Unbabel
- Flammie a Pirinen, UiT The Arctic University of Norway
- Francis Bond, Palacký University Olomouc
- Gérard Dupont, Mavenoid
- Guillaume Becquin, Bloomberg L.P.
- Ignatius Ezeani, Lancaster University
- Jana Götze, University of Potsdam
- Jack Morris, Cornell University
- John X. Morris, Cornell University
- Jörg Tiedemann, University of Helsinki
- Kamile Lukosiute, Anthropic
- Karin Sim, Language Weaver
- Kevin Cohen, University of Colorado
- Lane Schwartz, University of Alaska Fairbanks
- Leo Boytsov, Amazon
- Lucy Park, Upstage
- Maarten van Gompel, Radboud University
- Maheshwar Ghankot, Indian Space Research Organization
- Mallika Singh, Boston Children’s Hospital - Harvard Medical School
- Marcel Bollmann, Linköping University
- Marco Cognetta, Tokyo Institute of Technology, Google
- Marzieh Fadaee, Cohere for AI
- Matt Post, Microsoft
- Micah Shlain, Allen Institute for Artificial Intelligence
- Michael Wayne Goodman, LivePerson Inc.
- Mohd Sanad Zaki Rizvi, University of Edinburgh
- Nelson F. Liu, Stanford University
- Nitin Madnani, Educational Testing Service
- Ogundepo Odunayo, University of Waterloo
- Pasquale Lisena, EURECOM
- Philipp Koehn, Johns Hopkins University
- Phu Mon Htut, AWS AI Labs
- Raeid Saqur, Princeton University
- Raphael Tang, Comcast
- Sagnik Ray Choudhury, University of Michigan
- Shilpa Suresh, Boston Children’s Hospital - Harvard Medical School
- Shubhanshu Mishra, Shubhanshu.com
- Sina Ahmadi, George Mason University
- Steve DeNeefe, Language Weaver - RWS Group
- Steven Bethard, University of Arizona
- Sudhakar Singh, NVIDIA
- Taha Zerrouki, Bouira University Algeria
- Tenzin Singhay Bhotia, Georgia Institute of Technology
- Thomas Kober, Zalando
- Tomas Mikolov, Czech Institute of Informatics
- Tommaso Teofili, Roma Tre University
- Vijay Murari Tiyyala, Johns Hopkins University
- Vlad Niculae, University of Amsterdam
- Won Ik Cho, Samsung Advanced Institute of Technology
- Zaid Alyafeai, King Fahad University of Petroleum and Minerals
- Ziv Litmanovitz, University of Haifa
Previous Workshops
Second Workshop for Natural Language Processing Open Source Software (NLP-OSS 2020)
[Proceedings]
First Workshop for Natural Language Processing Open Source Software (NLP-OSS 2018)
[Proceedings]
Hybrid In-Person & Virtual Workshop Format
The most up-to-date and canonical schedule for the workshop will be https://nlposs.github.io/2023/index.html#workshop-program All sessions will listed at in Singapore (GMT +8) time.
To access the virtual conference,
- Attendees should have at least one author signed up on the EMNLP registration https://2023.emnlp.org/registration/
- The main page to access the virtual conference is https://virtual2023.emnlp.org/
- The main page to access the content of the NLP-OSS workshop is https://underline.io/events/431/sessions?searchGroup=lecture&eventSessionId=16447
During the workshop, all live in-person sessions will be streamed through the zoom link on https://underline.io/events/431/sessions?searchGroup=lecture&eventSessionId=16447 (through the “Join Live Session” button).
Live sessions includes (i) opening and closing remarks, (ii) invited talks (presenter will be in person but streamed live), and (iii) lightning sessions. To keep our schedule on-time, we’ll try keep a strict max 5 mins to the lightning sessions.
Then during the poster sessions,
- In-person presenters will proceed to the poster boards assigned by the workshop organisers.
- Virtual presenters will login to their individual channels in https://underline.io/events/431/sessions?searchGroup=lecture&eventSessionId=16447 and interact through chat our video call (we are seeing that the video call option on underline.io isn’t working yet, we’ll reconfirm with the conference admin asap)
Thank you everyone again for the various procedures considerations taken to make sure that both in-person and virtual attendees get the most of our the workshop. This is our first time organising a mix-hybrid mode workshop, we ask for your understanding for any inconvenience caused.