No.
If we were to believe the mainstream predictions made less than 10 years ago, the legal profession should have been revolutionised/transformed by the introduction of IBM Watson. Pick your favourite term, but the point was that if a fancy computer program can win at Jeopardy, then this program can also revolutionise/transform the legal profession.
IBM Watson turned out to be a total disaster.
It had zero impact on the legal profession. IBM tried to apply it in the medical field but had zero success. Total waste of time.
It is now predicted that large language models, or “LLMs,” will have a massive impact on the legal profession. This prediction is, however, based on some naïve and uninformed assumptions:
LLMs generate text. Lawyers consume and produce a lot of text.
Therefore:
LLMs will revolutionize/transform the work of lawyers.
This prediction is usually not made by lawyers but by computer scientists, journalists, and bankers. Professionals who are not necessarily familiar with what lawyers actually do and - how they do it.
Remember how Geoffrey Hinton, the father of deep learning, predicted that machine learning will obviate the need for radiologists? To date, not a single radiologist has been replaced. Professor Hinton excels at backpropagation and Boltzmann machines, not at radiology. I excel at drafting contracts, negotiating indemnities and drinking vodka. I shut up about the rest. Maybe we should stop listening to people from one profession making wild predictions about another. Maybe we should stop making predictions in the first place: we tend to overestimate the effect of a technology in the short run and underestimate its effect in the long run, (aka: Amara’s Law).
Equating legal work with text generation fails to acknowledge the reasons why lawyers “produce text,” not to mention the expertise required to do so. Lawyers do not generate text. Lawyers use language to achieve specific goals. It may take 10 minutes to write 5 sentences — but those 10 minutes often leverage 10 years of legal practice combined with 4–10 years of legal studies.
Moreover, once we examine the technical literature, the transformative potential of LLMs becomes less obvious.
Below are my observations on the “transformative” potential of LLMs in legal practice. They are pretty high-level and the technical details are simplified. No-one will read a blog post with 25000 words.
Two assumptions:
(0) technology sets factual limits of what can be done. If, for example, we establish that LLMs cannot understand text, then we need not debate whether LLMs can understand legal language.
(1) while certain aspects of legal work are repetitive, most tasks performed by lawyers require understanding, common sense, and legal knowledge.
What LLMs can do
LLMs owe their name to the large number of parameters in their underlying neural networks and to the vast amounts of data they are trained on. LLMs generate text by predicting the likelihood of a token (character or word) given either its preceding or surrounding context. As Wolfram puts it: they produce a “reasonable continuation” of the preceding text by estimating the probabilities of the next word, based on prior words — including those generated by itself.
Language models have been around for a while and many of us have been using them daily. Think: Google Translate. The latter operates in the background, so we don’t really appreciate its inner workings and how amazing it is. After a brief period of fascination, we have just accepted the fact that Google can translate, say, English into Mandarin pretty seamlessly. The public was introduced to a more advanced language model in November 2022, by the brilliantly designed ChatGPT — a (seemingly) simple chat interface that can be prompted to generate text that is coherent and plausible.
Fun fact 1: ChatGPT relies on a technological breakthough made by Google around December 2017: the introduction of transformers. The latter (in super-simple terms) are deep learning models capable of producing human-like text by, quite literally, “looking around” larger portions of text to evaluate the relationships, or dependencies, between words. Unlike traditional n-gram models, which can only predict words given the preceding sequence of 5 or fewer words, transformer-based models apply a so-called attention mechanism which evaluates relationships between multiple words, regardless of their respective positions. Transformers are super-amazing but they execel at one task:
Word Prediction.
Let this sink in.
LLMs excel at generating coherent and plausible text based on their ability to predict the next word in a sequence. This is particularly amazing given that LLMs do NOT know the rules of grammar, syntax and semantics.
But:
Generating text is not the same as understanding and using language.
Generating legalese is not the same as giving legal advice.
We need to understand that LLMs do not understand. We need to understand that LLMs have some inherent limitations that cannot be overcome with more data or more compute…
Understanding
In principle, a language model is “a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.” LLMs understand neither their training data nor the output they generate. They deal in statistics, not semantics. The problem is that meaning cannot be learned from text alone — irrespective of the amount of text LLMs “see” during their training.
Understanding requires grounding: a connection between text and reality, between the words and the world. For LLMs to learn to understand text, we would have to solve the symbol grounding problem. We would have to teach LLMs to establish associations between symbols, such as words, with their real-world referents, such as physical objects. The symbolic representations of objects and concepts are always arbitrary. For example, the word “cheese” is a string of letters, or text, with no inherent meaning. “Cheese” has meaning for us because we associate it with the physical object commonly schmeared on fancy sourdough. This association, this grounding, happens in our heads.
In the case of abstract referents, like “justice” or “obligation,” meaning does not derive from associations with physical objects but from relationships between concepts. Even then, however, such relationships require internal representations that exist independently from text. Without grounding, words are just strings of letters. Without solving the symbol grounding problem, LLMs will not progress beyond shallow and probabilistic pattern recognition.
LLMs cannot learn meaning because they do not have access to the world. Irrespective of the amount of text and the length of training, language models trained on text cannot connect words to the world. They can generate text, not solutions or purpose-driven communications. Knowing the statistical regularities of language differs from knowing how to use language. In fancier terms: formal linguistic competence differs from functional linguistic competence.
Lawyers do not generate text for the sake of filling pages with legalese. (ok, sometimes we do…especially some judges) Lawyers use language as a tool to advise, to defend, to adjudicate and to argue. In contrast, LLMs cannot use language to achieve goals or to solve problems in a situated context. To purposefully use language, LLMs would also require “functional grounding:” the ability to model, predict and control external physical and social processes.
I will spare you a very long paragraph describing the intricate relationships between knowledge, common sense and reasoning as well as a discussion of the different types of knowledge. The point is simple: using language involves reasoning and world knowledge. We cannot reason about something we do not know, and we cannot make use of knowledge without the ability to reason. Reasoning cannot be learned from text alone as the rules of reasoning cannot be written down or replicated by feeding LLMs with formal logic. A system devoid of world knowledge and common sense can hardly be expected to augment legal work, not to mention — revolutionise the legal profession.
Fun fact 2: building machines capable of reasoning is regarded as one of the most complex challenges in AI. Chollet states: anything that requires reasoning remains beyond the reach of deep-learning models, “no matter how much data you throw at them.”[1]
In sum: without basic world knowledge and common sense, we cannot even dream of any form of legal reasoning…
I know what you are going to say.
“But surely having been trained on so much text, LLMs must know a lot!”
During training, LLMs are given terabytes of text. Unfortunately, LLMs can only learn to extract statistical information about words, not facts about the world. Given the sheer size of the language corpora used in their training, they “acquire” and reflect some world knowledge. Such knowledge is, however, limited to concepts and facts explicitly stated in such corpora. While it may include mathematical (e.g., “two plus one is three”), factual (e.g., “Berlin is the capital of Germany”) or even legal (“contracts are enforceable agreements”) knowledge, it does not contain such basic information like “if you drop a glass of red wine, the glass with break and the wine will stain your carpet and your mother-in-law will kick you out of the house.” A lot of knowledge is implicit. It cannot be learned from text.
Fun fact 3. There have been attempts to create a knowledge base with the general rules of common sense and world knowledge. The famous CYC project, which contains rules stated in a domain-independent form to provide a universal “reasoning backbone,” has occupied a team of knowledge engineers for over 35 years! Arguably, LLMs could be integrated with CYC in the future. The practical success of such integration will, however, depend on the completeness and accuracy of CYC.
Fun fact 4: GPT-4 and ChatGPT (I mean the whole GPT-family from OpenAI) are autoregressive decoders. A small but important detail. Autoregressive decoders are unidirectional and can only use information from the left of the predicted token. GPTs are thus inherently limited to predicting the next word and, by definition, cannot understand the broader context of the generated text. Bidirectional autoencoders, such as BERT, predict masked words from a sentence, given context from before and after the missing word. They are more suitable for tasks that require an understanding of the entire input, such as sentence classification and information extraction. So, if LLMs are ever to be really useful to lawyers — I do not think any of the GPTs will be a winner…
The terabytes of text that models have been trained on contain a lot of nonsense, misinformation, and fiction. And — irrespective of the amount of text they were trained on - LLMs can’t tell fact from fiction. They will learn that “Berlin is the capital of Germany” and that “Mordor is southeast of Middle Earth.”
To clarify:
We often read that language models understand or reason. This seems to contradict the broader consensus that LLMs cannot do either. The terms “understand” and “reason” are often used colloquially or in technical contexts. Technical papers often equate both terms with the ability to perform specific, narrowly defined tasks such as textual entailment, pronoun disambiguation or document classification. If an LLM can perform any of the aforementioned tasks above random chance, it is said to “understand” or “reason.” We may also read that LLMs demonstrate “improved logic and reasoning skills” only to realize, upon examining the source, that such skills concern “computation-based word and math problems.”
We have established that LLMs do not understand anything (apart from word prediction!) and that they have no common sense and no world knowledge. They excel at generating text. That’s really it.
What does this mean in practice?
Hallucinations
LLMs often “hallucinate:” generate text that is nonsensical or factually incorrect. The main problem with hallucinations is that they should not be called hallucinations. Hallucinations are commonly associated with delusions or imaginary things and are easy to detect. We can normally tell when the other person talks nonsense or, after ingesting some illegal substance, well, hallucinates. The problem with LLMs is that the “hallucinated text” generally appears plausible, informative, and credible! Cute terminology notwithstanding, hallucinations are falsehoods or outright lies. Even OpenAI acknowledges that GPT-4 tends to “make up facts, to double-down on incorrect information. […] Moreover, it often exhibits these tendencies in ways that are more convincing and believable than earlier models (e.g., due to authoritative tone or to being presented in the context of highly detailed information that is accurate).” You don’t believe me? GPT-4 technical report. Page 19.
It gets worse.
LLMs generate true and false statements with identical confidence. They can produce incorrect answers followed by incorrect explanations but, when given the incorrect explanation alone, they recognize it as incorrect! Such “hallucination snowballing” derives from the fact that LLMs try to stay consistent: they aim to generate a plausible continuation of the text they have generated previously, including any hallucinations contained therein. Once they start producing fiction — they can’t stop! LLMs may acknowledge their mistake but will not always revise their answer. I speak from experience: on countless occasions I have pointed out to ChatGPT, very politely, that a particular answer was wrong. The model apologised, also very politely, acknowledged the correctness of my explanation, and continued to hallucinate…
Before you ask: I only enquired about legal topics that are pretty straightforward and uncontroversial.
Falsehoods/hallucinations are commonly attributed to the large amount of false information in the training data. An LLM trained on Lord of the Rings and The Witcher would confidently generate text about dragons, elves, and mosters! Having ZERO common sense and world knowledge, it assumes such “objects” exist! They are in the text, after all. Chomsky (yes. the Chomsky) observed that language models “are incapable of distinguishing the possible from the impossible. Whereas humans are limited in the kinds of explanations we can rationally conjecture, machine learning systems can learn both that the earth is flat and that the earth is round.” Their learning is ability is unconstrained by the laws of physics, common sense or causation etc.
The second reason for hallucinations is that many LLMs, such as the GPT-family, are specifically trained for open-ended text generation. As LLMs prioritise fluency over factuality, hallucinations may be a by-product of language generation. Their aim is to produce text not to use language for specific purposes that may require a solid grounding in reality. This is beneficial when writing fiction, love letters or poetry, but detrimental in such tasks as question answering and/or document summarization. The latter require models to be factual and/or faithful to the source text. Hallucinations can, in some instances, be equated with creativity and creativity could be useful in some legal context, such as when devising novel arguments or litigation strategies. The value of any “creative solutions” generated by an LLM must, however, be balanced against the cost of their verification.
Overreliance
The plausibility of the generated text creates the risk of overreliance: most users have a limited (if any!) understanding what LLMs are trained to do and tend to trust the generated content. If it looks good, it must be good. The problem is not that the machine-generated text is indistinguishable from human-generated text, but that its increased fluency creates an appearance of adequacy. Brooks (the famous roboticist, not the actor!) stated that performance in language generation is generally mistaken for competence.
The problem: factually incorrect output can still be — or appear to be — relevant, plausible and informative. As falsehoods become harder to detect, users are less likely to challenge or verify the generated responses. This can even happen in domains where users have expertise but simply fail to notice one falsehood “hidden” within a longer sequence of correct statements. The potential for disaster increases where users lack domain expertise. When faced with long paragraphs of coherent legalese, the average user will assume that the text contains correct legal information.
An aside: the idea of using language models to democratize access to the law is simply dangerous.
Of course, the risk of overreliance can be reduced with careful reviews and verifications of the output generated by the LLM. I think, however, that it is unrealistic to expect the average user to verify each generated sentence. Why would you verify something that looks correct? The superficial plausibility of the text discourages verification. Paradoxically, the need to carefully review and verify seems to defy the very purpose of deploying LLMS: saving time and optimizing human performance. After all, verification and review require competence and expertise…
It may take one minute to generate a “wonderful” legal opinion. But then we need to spend six hours to establish its accuracy… which brings me to the next point….
Explainability
You might say it is irrelevant whether LLMs actually understand anything as long as the generated text is correct. The problem is that ALL generated text must be verified and that such verification requires an explanation how it was arrived at. We cannot trust the output — even if it appears correct — if we cannot examine the underlying reasoning! We may agree with the output, but we may be wrong. We may also disagree with the output but if the underlying reasoning is sound we may have to accept it and revise our own assumptions. Much will depend on the level of our expertise in a given topic. Explainability is particularly important when evaluating responses to open-ended questions, which may not have a single answer. Verification requires explainability — unless we are willing to spend considerable resources to research & evaluate multiple reasoning paths that could have produced a given response. Interestingly, GPT-4 can generate plausible and convincing explanations even where the generated text is 100% wrong.
Another point: an important aspect of explainability is the indication of the sources relied on. While search engines, by definition, provide detailed references in their search result, LLMs often generate a single answer, without referring to any sources. Such references can be generated on request but, again, without any assurances as to their correctness. Quite the opposite: ChatGPT is known to provide non-existent sources to “support” its responses.
Hilarious.
Again, I know what you are going to say.
“But surely LLMs must be awesome because they have demonstrated unbelievable performance on some benchmarks!”
Yes, but…. Benchmarks measure performance on artificially created and narrowly defined tasks, which bear little resemblance to their real-world equivalents. Stellar benchmark performance does not mean that a model can perform in real-world scenarios or that it developed the ability to reason or to understand. Unfortunately, LLM research often focuses on achieving success on benchmarks, not on developing LLMs that actually understand language or reason. Moreover, excellent performance on benchmarks is often attributable to “technical shortcuts” as LLMs rely on heuristics to exploit certain biases in datasets without developing any understanding of the text. The LLM’s amazing performance falls back to chance when given adversarial examples, that is, inputs designed to fool the model. In sum: I find benchmarks very useful when comparing models in general but not when evaluating whether they can assist in legal work.
To examine whether superior performance on some specific benchmark translates into actual usability, whether it is actually relevant for legal tasks, we must investigate the benchmark itself: do the tasks comprised therein adequately represent the skill the benchmark is supposed to measure? Note that benchmarks are composed of dozens, sometimes hundreds of different component tasks each of which measures a different aspect of the skill measured by a given benchmark. At present, we have no benchmark that contains a sufficient number of diverse tasks that would represent general understanding or reasoning. Arguably, such benchmark would include hundreds if not thousands of discrete tasks. Arguably, such list of tasks may be impossible to create to begin with. The benchmarks in the legal domain seem to confirm this very point.
Of particular interest is LegalBench, which examines whether LLMs can be guided to execute tasks involving legal reasoning. LegalBench consists of an online repository where domain experts can submit tasks to evaluate different forms of legal reasoning. While this domain-specific effort seems commendable, it also confirms the complexity of legal reasoning. Most of the tasks included in the benchmark are extremely narrow, often confined to a yes/no question about the contents of single clause in a specific type of contract.
As an academic, I love LegalBench. As a (former) transactional lawyer — not so much. I am not sure whether it is worthwhile measuring the performance of a model on dozens of super-narrow and very domain specific legal tasks, instead of focusing on the fundamental tasks of causal or deductive reasoning. Why, for example, investigate whether LLMs can identify whether a clause describes how user information is protected if it is clear that none of the LLMs can reason by analogy? If LLMs struggle with the basic building blocks of reasoning, such as causality and common sense, testing more specific forms of legal reasoning seem questionable.
But there is an elephant in the room.
GPT-4 passed the Bar Exam!
Unlike benchmarks, the Bar Exam is a test designed for humans. It does not test performance on a single task but the general ability to solve legal problems and apply the law. Unsurprisingly, the fact that GPT-4 passed the Bar Exam has caused quite some panic in the legal profession. This result must, however, be approached with great caution. To date, GPT-4 has passed the exam once, in sanitized lab conditions, and its achievement has not been replicated. This may seem like a small detail but replicability is extremely important both from an academic and from a practical perspective given that GPT-4 has been created by a commercial entity. (thank you, Microsoft for 10,000,000,000 USD in investment!) Academic endeavors must be distinguished from promotional activities. To repeat: GPT-4 has passed the bar exam once. To evaluate the practical significance of this achievement, we would require insights as to how the model was trained and how it was used at time of inference. Citing competitive pressure, OpenAI provided no “details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.” Without such details, we cannot evaluate and trust the result. In particular, we must have assurances that the training data were not contaminated with data from the test set, that the model has not seen the test data during training.
Nonetheless, I do not want to downplay GPT-4’s performance because it is unquestionably impressive. Some components of the Bar Exam required, for example, the drafting of memos in the context of a complex legal scenarios, including domestic relations as well as criminal law and legal ethics. There is another interesting detail: GPT-4 excelled at the multiple-choice component of the exam, showing the largest improvement in the areas of contract law and evidence. Answering questions about contract law, however, differs from drafting or interpreting contracts. Apart from legal knowledge, the latter require an understanding of context and common sense. Knowing the rules is not synonymous with applying the rules against the background of a specific transaction. In practice, the drafting of contracts as well as the collection and evaluation of evidence require real-world knowledge. It would be interesting to observe GPT-4’s performance on tasks involving the interpretation of statutes and contracts…
Again, you will say:
“But, of course, we can solve all those problems with more…”
Many of you anticipate that “with more training data and more parameters,” LLMs will ‘scale up’ to acquire understanding, commonsense, and other essential abilities. After all, LLMs have already displayed emergent abilities that were not in their original training. Think of few-shot prompting and chain-of-thought prompting — these are amazing capabilities that the LLMs were not specifically trained for. We must not assume, however, that more training data or more powerful processors will remedy the shortcomings of LLMs. It is becoming clear that understanding, common sense, and reasoning will not emerge from more powerful LLMs. If something has been trained to generate text it will not suddenly start to reason. Scaling up cannot solve the grounding problem or equip LLMs with a world model required for common sense and causal reasoning.
Fun fact 5: many tasks requiring causal reasoning and general world knowledge exhibit flat scaling curves, that is, when we add compute and/or training data, the performance of the scaled model does not improve.[4] It has also been established that the quality of the generated output often decreases with increased model size,[5] that larger models are often less truthful than their smaller predecessors[6] and that the performance of transformer-based LLMs rapidly decays with increased task complexity.[7] Making models larger makes them more fluent but not more trustworthy or reliable.[8] Consequently, when we make a model bigger, “bad output” may not become less frequent but more difficult to detect.
At your own risk…
The use of LLMs in legal practice requires a cost-benefit assessment in terms of time and quality. When evaluating the speed with which an LLM generates its output, we must consider the time and human resources required for its verification. The risk of hallucinations must not be underestimated. What would be an acceptable “hallucination rate” for your law firm? How would it translate into cost savings?
Given the risk of overreliance, the question is not whether LLMs can perform a particular task but whether they should be used to perform such task.
What is the benefit of text generation that is unconstrained by common sense and world knowledge?
In the next post, I will focus on the technical difficulties of adapting LLMs to legal tasks, pre-training, fine-tuning (including RLHF), prompt-engineering, RAG etc…
Possible fun fact: did you know that GPT-4 is (most likely) not a single model but a hybrid? The implications of this might be pretty significant — especially when it comes to its stellar performance…
[1] F. Chollet, Deep Learning with Python (2018 Manning) 325
[4] Mirac Suzgun et al. “Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them.” ArXiv abs/2210.09261 (2022)
[5] Antonio Valerio Miceli-Barone et al., ‘The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python’ (2023)
[6] Stephen Lin et al., TruthfulQA: Measuring How Models Mimic Human Falsehoods (2022)
[7] Nouha Dziri et al., ‘Faith and Fate: Limits of Transformers on Compositionality’ (May 2023)
[8] R Thoppilan et al., LaMDA: language models for dialogue applications (2022)