Is GenAI’s Impact on Productivity Overblown?

Originally published on HBR.org / January 08, 2024 / Reprint H07YT4
By Ben Waber and Nathanael J. Fast

What/focus

This article questions the hype around generative AI in the form of large language models (LLMs) as a boon for collective productivity. However not all the research is positive, and much of the positive assessment of productivity is focused at the task level, and in particular how quickly tasks are performed. While changes in task completion speed are easy to measure, changes in accuracy are less detectable. The authors caution
against wholesale company-wide adoption of the technology, instead suggesting cautious experimentation.

 How (methods/details)

Two core problems are identified that have medium to long-term business implications: 1) the persistent ability of LLMs to produce information that is false, and 2) the likely long-term negative effects of using LLMs on employees and internal processes.

The term plausible fabrication is used to explain how the answers provided by LLMs may sound authoritative, but are not necessarily accurate. Extremely large LLMs like ChatGPT are big enough and trained on enough data (text) to learn the statistical properties of syntax. Basically they then use next word prediction to provide answers that are statistically likely to occur in public text. However the models have no concept of truth or fact, and it is difficult to restrict them to factual retrieval.

Related to this, LLM can get stuck in the past because the language they are trained on occurs in the past. However, products change, meaning
new chat logs need to be trained for call centres for example. Other changes can be more complex, requiring companies to implement extensive new processes to keep the LLM current. This calls for constant vigilance. At the same time the company may be losing the people who could contribute to retraining these tools. For example, a study reported a call centre losing top performers after the introduction of an LLM that
improved chat completion times. Automating such work can take away motivation and innovation.

Another issue is model collapse as the data LLMs are trained on is degenerating over time. The original models were trained on an internet’s worth of human-generated text. Now models are being trained on their own output, leading to loss of value over successive training cycles.

Perpetuating prejudice is another longer term issue, with LLMs recognised as reinforcing and amplifying biases, undermining the recognised benefits of a diverse and inclusive workforce and alienating marginalised employees. 

So what?

The technology is useful for certain classes of work but users and developers must be clear about when LLMs can be used effectively and
confidently. There are large classes of work where their unreliability makes them risky, for example summarising and synthesising evidence as required in the legal profession. 

Unthinking organisation-wide application of these models could lead to significant losses in productivity in the wider sense due to effects on employees and internal processes. LLMs also require constant  monitoring and updating, which is a significant cost. Assessing LLMS therefore requires a long term perspective as their pros and cons cannot be properly quantified at the task level or in the short term.