...

Work Data Is the Next Frontier for GenAI


, the work output of knowledge workers, is the single most valuable data source for LLM training, uniquely capable of propelling LLM performance to unprecedented heights. In this article, I will present nine supporting arguments for this claim. Then I will reflect on the current conflict of interest between the owners of work data and AI companies wanting to train on this data. Then I will discuss potential resolutions and a win-win scenario.

While publicly accessible training data is predicted to run out, there is still an abundance of untapped private data. Within private data, the biggest and best opportunity is—I think—work data: work outputs of knowledge workers, from the code of devs, through the conversations of support agents, to the pitch decks of salespeople.

Many of these insights draw from Dara B Roy’s Sobering Talking Points for Knowledge Workers on Generative AI which extensively discusses the use of work data in the context of LLM training as well as its effects on the labor market of knowledge workers.

So, why is work data so valuable for LLM training? For 9 reasons.

Work data is the best quality data humanity has ever produced

Work data is obviously much better quality than our public internet content.

In fact, if we look at the public internet content used in pretraining: the best quality sources (the ones you would upsample during training) are the ones that are the work outputs of someone: articles of the New York Times, books of professional authors.

Why is work data so much better quality than non-work internet content?

  • More factual and trustworthy. What we say and produce at work is both more factual and trustworthy. After all, as employees, we are accountable for it and our livelihood depends on it.
  • Produced by vetted professionals: public internet content is produced by self-proclaimed experts. Work data, however, is produced by professionals who have been carefully picked from a vast pool of talents during multiple rounds of job interviews, tests, and background checks. Imagine, if the same was true for internet content: you could only post on Reddit if a board of professionals first evaluated your credentials and skills.
  • Reflects vetted knowledge: workers’ output reflects battle-tested ideas and industry best practices that proved their worth under real-life business conditions. Compare this to internet content, which typically only aims to grab the attention of the reader, featuring clever-sounding but ultimately untested ideas.
  • Reflects human preferences more closely: The way we express ourselves in our work products is more eloquent, more thoughtful, and more tactful. We just make an extra effort to follow the norms (aka human preferences) of our culture. If pretraining was done solely on work data, we might not need RLHF and alignment training at all because all that just permeates the training data.
  • Reflects more complex patterns, and reveals deeper connections: Public internet content is often only scratching the surface of any topic. After all, it’s for the public. Professional matters are discussed in much more depth within companies, revealing much deeper connections between concepts. It’s a better quality of thought, it’s better reasoning, it’s a more thorough consideration of facts and possibilities. If current foundational models grew as good as they are on crappy public internet data, imagine what would they be able to learn from work data which contains several layers more complexity, nuance, meaning, and patterns.

What’s more, work data is often labeled by quality. In some cases, there is data on whether the work was produced by a junior or a senior. In some cases, the work is labeled by performance metrics, so it’s clear which sample is worth more for training purposes. E.g. you may have data on which marketing content resulted in more conversions; you may have data on which support agent response produced higher customer satisfaction ratings.

Overall, I think, work data is probably the best quality data humanity has ever produced because the incentives are aligned. Workers are literally rewarded for their work outputs’ performance.

To put it differently:

On the open internet, good quality content is the exception. In the world of work, good quality content is the rule.

There are legendary stories of YOLO runs when big models are trained on astronomic budgets and you hope the training samples are good enough, so they don’t lead your model astray and blow your budget. Perhaps, training on work data would end the age of YOLO runs, making AI training much more predictable and financially feasible for less capitalized companies too.

Work data manifests the most valuable human knowledge

LLMs can extract valuable skills from reading the New York Times or practicing math test batteries. Writing like a NYT columnist is a nice skill to have; Acing an AP Calculus Exam is a great achievement.

But the real business value lies in the skills that real businesses are willing to pay for. Obviously, those skills are best extracted from the data that contains them: work outputs.

Work data is readily available for AI training

If you are working for a SaaS that helps a certain group of knowledge workers perform their tasks, naturally, their work outputs live in your cloud storage.

Technically that data is readily available for AI training. Whether you have a legal basis to use it for that purpose, is another question.

Work data is orders of magnitude bigger than public internet content

Intuitively, if you think about your public internet footprint (e.g. how much you post or publish online) it is dwarfed by the amount that you produce for work. I, for one, probably churn out 100x more words for work than for my public internet presence.

Work data is huge. A caveat is that any SaaS only has access to its slice of work data. That may be more than enough for fine-tuning, but may not be enough for pretraining general purpose models.

Naturally, incumbents have an advantage: the more users you have, the more data you have at your disposal.

Some companies are especially well positioned to take advantage of work data: Microsoft, Google, and some of the other generic work software providers (mail, docs, sheets, messages, etc.) have access to tremendous amounts of work data.

Work data manifests unique insights

Since businesses are like trees in a forest, each one is trying to find a sunny niche in the dense forest canopy, a place that they can uniquely fill, the data they produce is unique. Businesses call this “differentiation.” From a data standpoint, it means the businesses’ data contains insights that only ever accrued to that particular business.

This is one of the reasons why businesses are so protective of their data: it reflects their trade secrets and the insights that set them apart from their competition. If they gave it up, their competition could quickly fill in their place.

Work data has hidden gems

From time to time human workers have an epiphany, and recognize a pattern that has been in front of them all along.

If AI had access to the same data, it could recognize patterns that no human has ever recognized so far.

This, again, is an important difference to public internet content. On the internet, there are only insights, that humans have recognized and took the effort to put out there. Work data contains insights that no one has discovered so far.

Work data is clean(er) and structured

How much structure it has, depends on the field, but it definitely has more structure than internet content.

At the bare minimum, work products are organized in neat folders and appropriately named files. After all, work is a collaborative effort, so workers make an effort to grease this collaboration for their peers.

Some work data is even better structured and cleaned: it is generated through rigorous processes, it goes through many rounds of approvals until it is put into a standard format. Think of database architectures, that go from rough sketches to Terraform configuration files.

And if that isn’t enough, your company sets the rules. If you want, you can nudge or even force your users follow certain conventions. You have all the tools to do so: you can constrain their inputs, you can guide their workflow, and you can incentivize them to give you extra data points only to make your data cleaning easier.

Work data is—in many cases—explicitly labeled

In many cases, work data comes in input-output pairs. Challenge-solution.

E.g.

  • Translation: Original text -> translated text
  • Customer support: customer query -> resolution by the support agent.
  • Sales: data on a prospective customer -> winning sales pitch and final deal details.
  • Software engineering: backlog item + existing code -> new code in the repository.
  • Interface design: jobs-to-be-done + persona + design system -> new design.

If work is created with LLM assistance, there is even the prompt, the LLM’s answer, and the human-corrected final version. Could an LLM wish for a better personal trainer then hundreds of thousands of human professionals who are experts of the given field?

Work data is grounded data

Work outputs are often labeled by business metrics and KPIs. There is a way to tell which customer support resolutions tend to produce the highest customer lifetime value. There is a way to tell which sales offers produce the highest conversions or the shortest lead times. There is a way to tell if a piece of code led to incidents or performance issues.

KPIs and metrics are the business’s sensors to the outside world which provides them a feedback loop, evaluating the performance of its work outputs. This is better than human ratings. E.g. it’s not “soft data” like a human trying to guess how other people will like a marketing message. This is “hard data” that directly reflects how much that marketing copy is converting people.

Work data is more valuable for AI than workers think.

Despite all the above benefits, in my experience, knowledge workers grossly underestimate the value of their work. These misconceptions include:

  • If it’s not original, it’s not valuable: they don’t know that machine learning prefers repetition with slight variations because that’s how it extracts underlying patterns, the unchanged features beneath the surface noise.
  • If it’s easy work, it’s not valuable: people have a hard time grasping that if a skill comes easy to them, doesn’t mean it comes easy to AI. These skills feel natural to us only because they became our second nature through our millions of years of evolutionary history, or our decades-long upbringing and education.
  • If it’s not peak performance, it’s not valuable: employees only get praise and bonuses if they go above and beyond. That leads them to think that it’s only their peak performance that matters. They seem to forget that mundane acts, such as simply responding to a colleague’s message are just as much an essential part of running the business and making a profit – a very valuable skill for AI to learn.

Ethical considerations

Unfortunately, using work data for AI training comes with strings attached.

  • That data is the paid work of someone: Using those works to make a profit for a 3rd party probably qualifies as unpaid work or labor exploitation.
  • Not fair use: one of the defining factors of “fair use” is that the resulting work shouldn’t compete with the original work in the market. I am not a legal expert, but a Service as a Software offering the same service on the same market in which their data contributors operate is a clear case for a competing offer. Not fair use.
  • Producing this data costs real money to its owners. A company payrolled everyone to have this data produced. Knowledge workers put in years of study, student loans, and lots of effort. Even if we put aside the fear of AI making workers redundant, and focus solely on capitalist self-interest: it’s unlikely that workers would want to give up this valuable asset of theirs for free, only for the benefit of some private shareholders in SV.
  • This data reveals trade secrets and proprietary insights of a business. What business would like to train an AI on its processes only to hand it over to its competitors? What business would like to level the playing field for its challengers?!
  • This data is someone’s intellectual property. Usually, it is the company’s intellectual property. And companies have armies of lawyers to protect their interests.

Next up: your opportunity here and now

If you are a software engineer or a data professional, you have a very unique opportunity to change to course of AI & humanity for the better.

As a representative of your company, as someone who understands the role of data in the company’s AI efforts, and as someone who is striving to build the best and greatest, you can push for the acquisition of the right kind of data: work data.

On the other hand, as you are working to automate your users’ tasks, there are people out there who are working to automate your tasks as a knowledge worker. They desire to take your effort and hard-earned skills for granted, so they can further grow the wealth of their investors.

All in all, you are sitting on both sides of the negotiation table. But that is not all: given your knowledge and insights, you just might be the person who holds the keys to a win-win resolution in this conflict of interest.

Is there a business model in which both AI models get the data they need and knowledge workers get their fair share for their valuable contribution not just squeezed and then dumped?

Pondering about a win-win scenario

Currently, we see a lot of fighting between AI companies and data owners. AI companies claim they can’t operate and innovate without training data. Data owners argue AI ruins their businesses and takes their jobs. There are legal issues around the rights of using data for AI training and there are communities rallying people to opt out of AI training entirely. It’s a real battleground and that is not good for anyone. We should know better!

What would the ideal scenario look like? From the perspective of an AI company, we should imagine a world in which data owners are happy to contribute their data to AI models, moreover, they go above and beyond to satisfy the data needs of AI training by providing extra data points, maybe labeling and cleaning their data, and making sure it’s really good quality.

What would enable this scenario? It seems obvious. If the success of the AI company was the success of the data owners, they would be happy to contribute. In other words, the data owner must have a stake in the AI model, they must own a part of the model and participate in the profits the AI model makes.

To incentivize quality contributions, the data owners’ stake should be proportional to the value of their contributions.

Essentially, we would be treating data as capital, and treating data contribution as capital investment. That’s what training data is after all: it is physical capital, a human-made asset that is used in the production of goods and services.

Interestingly, this model of treating data contribution as capital investment also addresses the biggest fear of knowledge workers: losing their livelihood to AI. White-collar workers live off of the returns of their human capital. If a model extracts their human capital (knowledge and skills) from their works, their human capital loses its market value as AI will perform those skills and tasks faster and cheaper. If, however, knowledge workers get equity in exchange for their data contribution, they effectively exchange their human capital for equity capital, which keeps producing returns for them and thus a livelihood.

This is an opportunity for a positive reinforcement loop. As a knowledge worker, your work contributes to better AI models, which increases AI company revenues, which increases your rewards, so you are even more incentivized to contribute. Simultaneously, improving the AI model inside your work software directly improves the quantity and quality of your work outputs, further improving your contribution and thus the AI model. It’s a double reinforcement loop with the potential to become a runaway process leading to winner-take-all dynamics.

Treating data as capital not only unlocks more and better training data but it also enables rapid and cheap experimentation. Say, you want to try a new innovative product with an AI model at its core. If you take training data as an investment, you don’t need to pay for that data upfront. You only pay dividends once your product starts making a profit and only pay proportionally to that profit. If your idea fails, no problem, no one got hurt or lost money. Innovation is cheap and risk-free.

Trade secrets vs AI training

Now let’s turn to the conflict of interest between AI companies and Employers: companies whose knowledge workers produce the training data.

Employers don’t seem to have a problem with turning over their employees’ work to AI companies if they can get an AI service in exchange that does the same job as humans but better and cheaper.

The real conflict of interest originates from the fact that the AI model would distribute the Employer’s trade secrets and know-how to its competitors. If the AI company enables any other company, from fresh upstarts to large competitors, to perform the same strategies and processes, at the same quality, speed, and scale as the incumbent, that means it eliminates much of the competitive advantages of the incumbent.

In every company, there is know-how and processes that “don’t make their beer taste better”, they are just common processes. I bet companies would love to contribute (with the consent and participation of their knowledge workers) the data about these processes to an AI model in exchange for an ownership stake. It’s a mutually beneficial exchange. As for the know-how and processes that differentiate the Employer from their competitors, their competitive advantages, the only option is custom model training or white-label AI development in which the AI company helps create and operate the AI model but it’s exclusively used and fully owned by the Employer and its knowledge workers.

I hope this article sparked your interest in positive AI training data scenarios. Maybe you will contribute the next piece to this puzzle.

Thank you for reading,

Zsombor

Other articles from me:

GenAI is wealth transfer from workers to capital owners. AI models are tools to turn human capital (knowledge and skills) into traditional capital: an object (the model) that a corporation can own.

SAP is not volunteering my data to Figma AI and I am proud of SAP for that Should UX Designers contribute their designs to Figma to help them build better AI features? Who would this benefit? Figma investors? Designers? Designers’ employers?

The lump of labor fallacy does not save human work from genAI The fallacy only suggests that there will always be more work. It doesn’t suggest that humans would do the work — a significant detail.

The 80/20 problem of generative AI – a UX research insight. When an LLM solves a task 80% correctly, that often only amounts to 20% of the user value.

Source link

#Work #Data #Frontier #GenAI