On data trust in
low-resource field contexts

The dataset says the village has 340 households. The enumerator counted 180. Both numbers are somewhere between wrong and right, and neither of them speaks Zarma.

Data quality frameworks tend to be written in English, designed in Geneva (or Washington D.C.), and applied in places where the gap between the framework and the field is wide enough to lose a programme in. This is not a complaint about frameworks. It's an observation that the concept of "data trust" has a geography. How much you can rely on a piece of data to mean what it claims to mean behaves differently in Niamey than in Amsterdam. The tools we've built to assess it mostly don't account for that.

I want to work through what data trust actually means when you're operating at the edge of the system: intermittent connectivity, multilingual enumerator teams, communities that may have rational reasons for underreporting, administrative boundaries that don't map to lived reality, and an AI-assisted analysis pipeline that has never been trained on a single sentence in the local language. Each of those conditions is common. Together, they describe the normal operating environment for a large share of humanitarian and development data collection.

01

The language gap is not a footnote

Take Zarma. It's a Nilo-Saharan language spoken by roughly four million people across Niger, Benin, Burkina Faso, and Nigeria. It's the primary language of the Zarma people of western Niger and a widespread lingua franca in Niamey. If you're running a health survey in the Dosso or Tillabéri regions, there is a meaningful chance that the community member you're interviewing thinks primarily in Zarma, that the enumerator translating for you is working from French via a Zarma-French interpreter, and that the questionnaire was originally written in English. Three translation steps between the data and its meaning. None of them documented.

What exists in Zarma for a large language model, the kind increasingly embedded in analysis, translation, and summarisation tools used across the sector? Almost nothing. There is no substantial Zarma corpus in any major LLM training set. A search for Zarma-language content on Common Crawl, the web scrape that underpins most LLM pretraining, returns mostly metadata. The Bible has been translated into Zarma; there are some government documents; there are scattered linguistic studies. That is approximately the full extent of the written record that a model could learn from.

Fig. 01 · Estimated training corpus size by language hover bars

Rough orders of magnitude from publicly available sources. "Tokens" here approximates words. The disparity between high-resource languages and regional African languages is not a rounding error. It is structural.

This is not a Zarma-specific problem. It's the shape of the whole landscape. Hausa has more written material, partly because it has a long literary tradition and a larger internet-connected population. Fulfulde (Fula) has some. Dendi, Tasawaq, Tuareg languages, Kanuri: essentially none in any usable digital form. The pattern holds broadly: the languages spoken by the communities most likely to be subjects of humanitarian data collection are the languages least represented in the tools increasingly used to process that data.

The practical consequence is subtle but consequential. An LLM asked to translate a Zarma response will not translate it; it will hallucinate a plausible-sounding approximation based on related languages, French loanwords, and whatever structural patterns it can infer. It may be mostly right. It may introduce systematic bias in one direction. You will likely not be able to tell which, because you don't speak Zarma either, which is why you were using the LLM in the first place.

Zarma · field survey context
Ay ga nda bonbonante se.
ay ga nda bon-bon-ante se
"I come for the vaccination." (Literally: I am coming with/for the small-medicine thing.)
Zarma · expressing uncertainty about enumeration
Ay mana hincine ka bay.
ay mana hin-ci-ne ka bay
"I cannot say/count exactly." (A common response to household census questions that an untrained model may map to "I don't know" rather than the more specific epistemic hedge it represents.)
Zarma · kinship and household structure
Ay kwaara ga bara hinka.
ay kwaara ga bara hin-ka
"My compound has two entrances." (Often used to signal two semi-independent households under one roof. This distinction matters enormously for census work, and maps poorly onto standard survey categories.)

That third example is the one that keeps me up at night. The distinction between a household and a compound in Sahelian contexts is not a translation problem. It is a conceptual problem that the data model itself cannot represent. The survey says "household." The community says "kwaara." Those are not the same thing. When an enumerator maps one onto the other, they're not being sloppy; they're doing the best they can with a framework that wasn't designed for this context. But the mapping introduces error that is invisible in the final dataset.

02

Why communities underreport, and why that's rational

Data trust runs in two directions. There's the question of whether we can trust the data. There's also the question of whether the community trusts us: specifically, whether they trust that providing accurate information is in their interest. In many field contexts, it isn't, or at least it hasn't been historically.

The literature on this is consistent and underacknowledged. Households underreport assets when they associate surveys with taxation or means-testing. They underreport children when birth registration has historically been used for conscription. They overreport displacement when they associate it with aid eligibility. They underreport health symptoms when they fear quarantine. None of this is irrational. It is a calibrated response to the actual incentive structures they have experienced. And it produces data that is systematically biased in ways that are very difficult to detect from the dataset alone.

"The community isn't lying to you. They're being rational about information disclosure in a context where disclosure has historically had unpredictable consequences. That's not a data quality problem. It's a trust deficit accumulated over decades."

Standard data quality checks (outlier detection, internal consistency, enumerator clustering) are designed to catch errors of measurement. They are not designed to catch systematic and coherent misreporting by respondents who have agreed on what to say. The two look completely different in the data. The first produces noise; the second produces a clean, consistent, wrong dataset.

Fig. 02 · Typical reporting bias direction by incentive context hover rows

Generalised patterns from field literature. Direction and magnitude vary significantly by context, programme history, and community-enumerator relationship. The "detectable by standard QC" column is the uncomfortable one.

03

The administrative boundary problem

A third layer. Administrative boundaries used in data collection are almost always inherited from colonial-era administrative divisions, subsequently reorganised by post-independence governments, and then frozen into shapefiles that haven't been updated since the last GIS officer left. They may or may not correspond to how people in the area think about where they live, who their community is, or where they go to access services.

In Niger, the commune (municipality) is the primary unit of local government and the unit at which most humanitarian data is aggregated. A commune in the Sahel can cover thousands of square kilometres and contain communities that have nothing in common except a shared administrative classification. The health data says "commune of X has 12% acute malnutrition." That number is an average across a geography that may contain both settled farming communities near the river and mobile pastoral communities in the bush, with different access patterns, different risk profiles, and different relationships to the enumeration process. Aggregating them tells you something. It might not tell you what you think it tells you.

Fig. 03 · Where trust can break in a standard data pipeline hover nodes

Each node is a potential point of attrition. The further right you go, the harder it is to trace a degraded signal back to its source.

04

What would actually help

Some things are structural and slow. Building Zarma corpora takes years of community-led transcription, linguist involvement, and sustained funding. That work is happening in places: the Masakhane project in South Africa, ORCAS in the Sahel, individual academic linguistics departments across francophone Africa. But it is chronically underfunded relative to the tools that will eventually consume it. Talking about the language gap without talking about who pays for closing it is an exercise in gesturing at problems.

Some things are operational and more tractable now. Back-translation as a standard practice, not a spot-check: if your questionnaire went English to French to Zarma, route a sample back through Zarma to French to English and see what arrived. It won't catch everything, but it catches conceptual drift. Community validation sessions that specifically surface the category-mismatch problem: not "did we count you right" but "does our word for household mean the same thing as yours." Documentation of the translation chain in the metadata, so that anyone using the dataset downstream knows how many hands the concepts passed through.

The most useful thing may be the most unfashionable: building relationships with community informants who can flag when a survey category does not map onto local reality, and treating that feedback as a data quality input rather than an implementation obstacle. This requires time and trust, neither of which are abundant in emergency response timelines. But the alternative is a dataset that looks clean and has biases baked in at the conceptual level, which is a worse outcome than a messier dataset that at least knows what it doesn't know.

05

A note on what AI tools can and cannot do here

The current generation of LLMs is genuinely useful for a range of tasks in field data work: restructuring messy exports, writing query logic, summarising secondary literature, drafting reports from structured data. These are real gains and I use them.

They are not useful, and are potentially harmful, for tasks that require reliable performance in low-resource languages or culturally specific conceptual mapping. A model that has seen ten billion words of English and approximately 200,000 words of Zarma (a generous estimate) does not have comparable competence in both. It has deep competence in one and a thin, structurally-inferred approximation in the other. Using it for Zarma translation is not like using a slightly less capable version of the English translation. It's a different category of uncertainty, and one that the model itself is not reliably calibrated to signal.

That uncertainty needs to be upstream in the workflow, not downstream. The question is not whether the translation looks right; it's whether anyone in the loop is in a position to know if it's wrong. In most field data pipelines, the honest answer is no. Treating AI-assisted analysis in low-resource language contexts as a decision-support tool rather than a decision tool is not excessive caution. It is accurate calibration to what the tool can actually do.

Notes
1. Zarma (also spelled Djerma or Zerma) is part of the Songhay language family. Written resources are sparse but growing: the SIL International Zarma dictionary, the CNRE (Centre National des Ressources Educationnelles) in Niger, and a handful of academic grammars are the primary sources. The language examples above are simplified for illustrative purposes and may not reflect all dialectal variation.
2. The Masakhane Research Foundation is one of the most important organisations working on African NLP and low-resource language datasets. Their work spans over 50 African languages. The disparity in corpus sizes shown in Fig. 01 is derived from estimates published in the "State of African Languages in NLP" literature, including the AfricaNLP workshop proceedings at ACL/EMNLP.
3. The reporting bias patterns in Fig. 02 draw on the literature on strategic survey response in development contexts, including Deaton (1997) on measurement in poor countries, and more recent field experiment work on respondent behaviour in humanitarian assessments. The specific patterns are stylised; the literature on this is much more varied and context-specific than any table can represent.
4. The kwaara / household distinction is real and discussed in the literature on Sahelian household surveys. A kwaara (compound) in Zarma social structure typically houses an extended patrilineal family unit and may contain multiple cooking fires, the usual proxy for a separate household. Different enumerators resolve this differently, producing inconsistency even within a single survey round.