Surface Data Collective: First Steps

Last updated on February 17, 2022

Artificial Intelligence has really been taking off lately, especially language based models. Everyone is releasing large models that can do impressive mimics of human behavior. OpenAI has their GPT-* line of language generation models. Google has BERT and the T5 suite of models that can decide and generate all kinds of content. The work being accomplished is impressive and useful to a lot of people using the languages supported.

But there’s a wide range of challenges faced. First of all, most of these models only support around 100 or so languages. Even language detection models support a fairly small set of languages relative to the 7193 languages known to exist. When a language does get supported, these models typically train on data that’s available on the internet. Either via the suite of corpora on OPUS, research corpora like The Pipe, or just Wikipedia. These corpora only really cover the content viable on the internet and leave out language variants we may consider mundane or not worth writing about, but which represents many of the ways we use language all the time.

Beyond that, the companies building these language models aim to make quite a lot of profit. Sam Altman for example has said OpenAI may make the world’s first trillionare. Other companies may not be that ambitious but certainly expect to provide impressive returns to their venture capital investors. By design, these gains leave out the many people who helped created the datasets critical to these models. Much of The Pile is scraped from websites without tracking provenance. Wikipedia usually gets used and while companies do help support Wikipedia, it’s rarely in proprotion to how useful the dataset has been. That collectively leaves out everyone of us that’s helped produce data.

These are why the Surface Data Collective is starting. Speaking personally, I was previously part of Google Translate and helped build and gather datasets like all of those mentioned. We aimed to move fast and move scalably. Rarely did we stop to check if the data covered the languages users wanted or needed. Rarely did we ask if we captured only a biased version of a given language. And never did we ask how the data’s original creators should benefit from the models we create, other than of course getting access to a free service.

This effort is still brand new, so before getting to specific, let’s break down each part of this group’s name.

Going beyond the Surface

Everyone of us has interacted with some kind of surface. Constantly the surface of our skins come into contact with clothes, tables, objects, and more. The surface of our tables hold the many things we place upon them. The surface of the water separates the water from the sky and keeps us afloat. Everything has a surface that marks the barrier for where that thing begins and something else begins. But that surface is a fiction we’ve created because it simplifies life. Deep down the surface of our skin is just a set of atoms arranged together in a way to push away other things. There’s a lot of empty space in between those atoms. In the same way, there’s a lot of empty space between the datasets we create. They are just carefully aligned and arranged compilations of information. The same goes for companies, the surface of a company is really just a set of policies deciding who is in and out. What if we could go beyond that surface? What if we could decompose that surface?

This line of reasoning has backed my on going love for the broad metaphor of surfaces. But why name a group aimed a decomposing surfaces? The best way to improve or go beyond something is to fully understand all its aspects. You never should move a fence until you know why the fence exists. Its why sometimes someone may name their child Narciso or Rahula. The names don’t represent who the person is, but serve as a reminder of qualities we all need to be cautious of.

What to do with all the Data

Without data machine learning models wouldn’t be able to do much. They’d be random piles of numbers that generate noise. With data, well thought out data, they can learn and replicate alot. That’s what makes data critical. The Surface Data Collective will experiment with the many ways we could make that data better. How we could give the data authors more governance and ownership over the data. If done right, we can make data that better represents all of the worlds languages and all the ways we need and depend upon those languages. If done right, we wouldn’t see the worlds first Trillionare due to an A.I. model, but instead see a lot of people live with less poverty.

While we’ll start with language language data, data that just about anyone can help create and produce, one day we’ll need to expand into other types of data. That adventure will happen naturally once the collective grows and builds up an intuitive understanding of what issues and shortcoming exist.

You cant be a Collective forever

We’re picking the label Collective to make clear that the exact nature and structure of the effort has yet to be formed. One of the major decisions to figure out, through the help of whomever joins and contributes, will be to figure out what’s the right lasting structure. Should this be a Cooperative? A Data Union? A Distributed Autonous Organization? Those are questions that a small set of people can’t and shouldn’t answer. Only a suitably large collective with many viewpoints, many needs, and many intentions can figure out after deep rounds of deliberation.

So the ensure there’s room for change, a vague and broad word like Collective sets the best groundwork.

What’s next?

This effort is just starting. Very little has been accomplished so far. That’s why the first steps will be to narrow down the focus to something that can become an early small win. A first demonstration of making data better governed and owned.

One group of first steps will be to discuss a wide range of questions with big implications:

Learning from Mozilla’si Data Futures Lab and discovering who to learn from.
Deciding which type of Data Stewardship makes the most sense.
Deciding how to govern things.

Another first step will be finding groups already making progress on aligned goals. Groups like Masakhane and Lanfrica have started to catalog data that already exists for a wider range of languages and found ways to include those languages in A.I. models.

And of course, a functioning prototype should probably be built.

If this aligns with your intentions or needs, reach out! Email me (keith@surface-data-collective.com) or follow us on Twitter (@….).

Backlinks

No backlinks yet