Language Documentation, Ethics, and Artificial Intelligence: Technical-Ethical Challenges for Minority Languages
Keywords:
language documentation, low-resource languages, machine learning, dataset datasheets, model cards, data provenance, informed consent, indigenous data governanceAbstract
The digital preservation of endangered and low-resource languages (LRLs) is increasingly intersecting with the training and deployment of large multilingual machine-learning models. This intersection raises technical and ethical challenges that are amenable to empirical and engineering treatment rather than exclusively normative debate. In this paper we (1) synthesize relevant literature from language documentation and machine-learning transparency practices, (2) identify four measurable problem domains—dynamic consent, provenance traceability, layered rights (individual vs. collective), and benefit allocation—and (3) propose a program of testable technical interventions (machine-readable provenance records, dataset datasheets, model cards, and prototype dynamic-consent mechanisms) together with experimental designs to evaluate their efficacy. Our contribution is methodological and operational: we reframe ethical requirements as concrete engineering and evaluation tasks that archives, researchers, and model developers can implement and measure. We conclude with a prioritized research agenda and practical recommendations for archival practice and model documentation.
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.