Problem
Hausa is the second most spoken language in Africa, with an estimated 64 million speakers. Despite its widespread use, natural language processing (NLP) for Hausa remains significantly underdeveloped. HausaNLP is an open-source community of academics, researchers, students, ML engineers, and NLP enthusiasts who are dedicated to democratizing Hausa natural language processing by developing Hausa language resources, promoting natural processing research, and advancing collaboration among relevant stakeholders.
Solution
We collaborated with HausaNLP to build a comprehensive Hausa language data catalog, a centralized platform where researchers and developers can easily access datasets and models. The platform also allows users to upload and share their own contributions, ensuring that resources remain accessible, community-driven, and relevant to Hausa-speaking communities across West Africa and the diaspora.
Design
Our primary users are researchers who often need to create and curate specialized datasets. To support this, we designed a straightforward upload process that allows users to add their datasets and tag them with key metadata for easy discovery in the catalog. We also implemented advanced filtering to help users quickly find what they need. To encourage contributions, we added a contributors page where users can choose to be acknowledged for their work on the platform. We centered our design around what researchers needed in order to find particular datasets.
Tech Stack
HausaNLP Data Catalog is a Next.JS full-stack application that was developed and designed from scratch. The backend is built with MongoDB, Prisma, and tRPC for API calls, and the frontend is built with React, TailwindCSS, and Material UI. TypeScript is used throughout the application. HausaNLP Data Catalog has been fully deployed using Vercel and MongoDB Atlas.
Features
Data Catalog
To help researchers quickly find the public datasets most relevant to their projects, this feature provides a structured way to browse, filter, and navigate dataset collections. Each dataset is identified with detailed metadata—including its name, link, year, language, collection style, size, unit, and associated task—so that users can easily compare options and seamlessly move from high-level exploration to accessing the dataset itself.
Dataset/Model Upload Form
This workflow provides a straightforward three-step process for contributors to upload their datasets: users fill in the necessary information about their dataset, submit the form, and then wait for an admin to review the contribution. Once the review is complete, the dataset can be approved and published on the website.
Admin Dataset Review
Admins can review all pending dataset submissions and choose to either approve or deny them. When a submission is approved, the dataset is added to the catalog; when it is denied, the admin supplies a reason for the denial so the contributor knows what changes are needed.
Admin Dashboard
The admin dashboard enables ongoing maintenance of the platform by letting admins review and update existing datasets in the catalog. It also provides an easy way to switch between pending submissions and the main catalog, allowing admins to manage new contributions and maintain published datasets in one place.
Statistics Page
The statistics page aggregates information from all current datasets on HausaNLP, providing an overview of key metrics such as year and language. By compiling this data across the entire catalog, the page helps users quickly understand the distribution and characteristics of the datasets available on the platform.
Contributors Page
The contributors page highlights individuals who have added datasets to the HausaNLP data catalog and serves to recognize and acknowledge their work. Contributors are displayed on the page only if they provide consent, ensuring that recognition is both accurate and voluntary.
About Us Page
The About Us page provides information about the HausaNLP website and its mission, helping users understand the purpose of the platform. It also offers ways for users to stay connected, giving visitors a clear sense of what the project aims to achieve and how they can follow its work.