The Limits of Modern AI: A Story

On this page

A Brief History of Artificial Intelligence

The dream of thinking machines goes back centuries, at least to Gottfried Wilhelm Leibniz, in the 17th century. Leibniz (right) helped invent mechanical calculators, independently of Isaac Newton developed the integral calculus, and had a lifelong fascination with reducing thinking to calculation. His Mathesis Universalis was a vision of universal science made possible by a mathematical language more precise than natural languages, like English. (The Mathesis was never finished, but as a posthumous consolation, it helped usher in modern symbolic logic in later work by George Boole, and others.)

Gottfried Leibniz
Gottfried Wilhelm Leibniz

In the 18th Century, the Enlightenment philosopher and proto-psychologist Étienne Bonnot de Condillac imagined a statue outwardly appearing like a man and also with what he called “the inward organization.” In an example of supreme armchair speculation, Condillac imagined pouring facts – bits of knowledge into its head, wondering when intelligence would emerge. Condillac’s musings drew inspiration from the early mechanical philosophy of Thomas Hobbes, who had famously declared that thinking was nothing but ratiocination calculating. Yet precise ideas of computation, along with the technology to realize it, weren’t yet available.

In the 19th century, Charles Babbage took the first real steps toward Artificial Intelligence as technology, as an engineering project aimed at building a thinking machine. Babbage was a world-famous scientist, a recognized genius, and a polymath. His blueprint for his Analytic Engine was the first design for a general-purpose computer, incorporating components that are now part of modern computers: an arithmetical logic unit, program control flow in the form of loops and branches, and integrated memory. The design extended work on his earlier Difference Engine – a device for automating classes of mathematical calculations and though the completion of decades of work was at hand, the British Association for the Advancement of Science refused additional funding for the project. The project languished, and Babbage himself is now a mostly forgotten chapter in the history of computing and Artificial Intelligence.

The idea of a general calculating machine was finally realized in the 20th Century, with the work of British mathematician and code-breaker Alan Turing. Turing, more than anyone else, also launched what we now call AI. Though the term “Artificial Intelligence” wasn’t coined officially for another five years, AI as an engineering discipline emerged in Turing’s seminal paper “Computing Machinery and Intelligence.” The year was 1950.

Turing left no doubt about his purpose. The paper begins: “I propose to consider the question ‘Can machines think?'” He then proposed the Imitation Game, or what’s now referred to as the Turing Test: pose questions to a machine and a human hidden in separate rooms, and if we can’t tell the difference in their responses, then the machine can be said to “think,” as we do. Early researchers and scientists accepted Turing’s proposal as a real-world challenge, and the quest to engineer a thinking machine began.

John McCarthy coined the phrase “Artificial Intelligence” in 1955, and by 1956 Artificial Intelligence, or AI, was officially launched as an independent research field at the now-famous Dartmouth Conference, a meeting of minds that included notable scientists such as Alan Newell and Herbert Simon (Carnegie Mellon), Marvin Minsky (MIT), and John McCarthy (MIT, and later Stanford).

Herbert Simon
Herbert Simon

Right from the get-go, the field was hyped. Conference organizers declared that “if a carefully selected group of scientists work on it together for a summer,” AI would be on its way to answering Turing’s question. This never happened, of course, but it didn’t stop the Dartmouth scientists from proclaiming that victory was close at hand – a now-familiar drumbeat from AI enthusiasts that continues into the present day.

Herbert Simon declared in 1957 that AI had arrived, with machines that as he put it “can think.” And MIT computer scientist Marvin Minsky too, by the 1960s, thought that the problems would be ironed out “within a generation.” All of this was of course wildly off the mark. (McCarthy later remarked, simply, that “AI was harder than we thought.”)

But why, exactly, was AI so hard? Why was it necessary to backpedal? According to long-time AI critic and then-MIT (now-Berkeley) philosopher Hubert Dreyfus, the research program launched at Dartmouth went through three distinctive, 10-year phases since the 1950s. Each phase highlighted a different research agenda, and each fizzled out after a decade of work following a pattern of registering early successes on easy, controlled problems, and then meeting failure attempting to scale the approaches to more realistic scenarios.

Phase One was known as Cognitive Simulation and began with Newell and Simon’s work on mimicking human intelligence on concrete tasks like game-playing and proving mathematical theorems. At the Rand Corporation in 1957, Simon and Newell, together with J.C. Shaw, implemented the Logic Theorist, which used simple means-ends analysis to prove theorems in math. The program worked, proving 38 of 52 theorems from Russell and Whitehead’s famous treatise on mathematical logic, Principia Mathematica (1910). But it didn’t scale to prove interesting theorems in other areas outside of the Principia. This latter task was left to Simon, Shaw, and Newell’s follow-up program, the General Problem Solver (GPS). The GPS was released in 1959 and solved more complicated problems like the well-known “missionaries and cannibals” puzzle, and others of similar difficulty. But like the earlier Logic Theorist, the GPS didn’t scale. Outside of its intended domain, it wasn’t useful, as Simon and Newell had hoped.

Further, even at this early stage of AI work, the field was already showing marked signs of the bluster that would come to dominate and embarrass later researchers and research efforts. Part of the hype was simple over-confidence. Mathematician and linguist Yehoshua Bar-Hillel dubbed this the “fallacy of the successful first step,” pointing out that early progress does not imply that subsequent steps of the same kind guarantee an eventual solution. It’s always possible that the full problem requires methods of an entirely different kind than those used to solve the initial, relatively easier parts of the problem. It certainly proved so with early efforts on simulating cognitive problem solving, as first the Logic Theorist, then GPS were officially “retired” by Simon, Newell, and Shaw a few years after they generated such optimism.

The other part of the hype was . . . well, hype. As AI scientist Drew McDermott points out, programs like the “General Problem Solver” were in fact specific algorithms for solving constrained problems, but their names tended to impart an air of generality and robustness that encouraged a misunderstanding about their actual modest, even uninteresting, capabilities. Says McDermott:

Many instructive examples of wishful mnemonics by AI researchers come to mind once you see the point. Remember GPS? By now, “GPS” is a colorless term denoting a particularly stupid program to solve puzzles. But it originally meant “General Problem Solver,” which caused everybody a lot of needless excitement and distraction. It should have been called LFGNS – “Local Feature-Guided Network Searcher.”

In fact, the tendency of AI researchers to endlessly hype and overrate their programs has been part and parcel of the field up to the present day (more on this later). Bar-Hillel’s first-step fallacy, along with McDermott’s observation about “wishful mnemonics” is in part understandable; AI after all is also a “brand” whose successes are money-makers for their designers as well as their companies or institutions. But boasting and trickery is also a tacit admission that on the merits, AI is often not as impressive as it’s billed. This brings us to early work on machine translation: the task of automatically translating one natural language into another.

Understanding natural language (such as English or Russian) was an immediate target of early work in AI. Like problem-solving, the initial plan was to get a toe-hold on the full problem by designing systems that could translate most or all of one language into another. Machine Translation (MT) work thus began in earnest. Not surprisingly, it met with immediate success, then began encountering difficulties as the range of unsolved problems continued to grow, not shrink, with further efforts. Language translation, like intelligent thinking itself, turned out to be harder to automate than anyone had suspected.

Hubert Dreyfus
Hubert Dreyfus

Dreyfus, the AI gadfly who wrote the definitive critique of AI, What Computers Can’t Do (1972), noted (with typical acerbity) that “. . . attempts at language translation by computers had the earliest success, the most extensive and expensive research, and the most unequivocal failure.”

Dreyfus was right. The so-called Fully Automatic High-Quality Translation (FAHQT) was a well-publicized bust. By 1966, the National Research Council had poured $20 million into high-visibility research on machine translation – part of the Cold War effort but had, effectively, nothing to show for it. The notorious Automatic Language Processing Advisor Committee (ALPAC) report, released in 1966, declared that the computer systems for MT were too slow, too expensive, and generally just didn’t work well. Many of the brightest stars in AI hoping to crack the natural language problem found themselves instead without funding, as the NRC killed its support in the wake of the report’s findings. A decade after a near ecstatic beginning at Dartmouth, AI suffered its first major and unequivocal blow.

The problem with early MT work wasn’t incompetence, but rather (once again) that AI turned out to be harder than anyone imagined. Bar-Hillel noted the essential challenge ignored in the designs of the early systems as early as 1960, in a now-famous example:

Little John was looking for his toy box. Finally he found it. The box was in the pen. John was very happy.

The basic problem here was obvious once one zeroed in on it: what meaning do we assign to the English word “pen”? As Bar-Hillel explained, there were at least two candidates:

(1) A certain writing utensil, or
(2) An enclosure where small children can play.

But the difference between (1) and (2) matters, of course, since one object has to “fit” into the other in the example. Can little John’s toy “box” fit in the “pen”, as with (1)? Well, no. A person can see what’s meant by definition (2) but our appreciation of word meaning is the very deficit of automated systems. Since the computer has no actual knowledge of the relative sizes of objects, it can’t decipher the correct meanings, and gets stuck (or produces the wrong answer). Relying only on shallow statistical facts about the words in the sentences, it lacks this deeper understanding. But without it, how is it to properly translate Bar-Hillel’s example, and countless other sentences in natural language?

After ALPAC and the failure of early efforts on MT, the second phase of AI work began in the mid-’60s and lasted until roughly the mid-’70s. Understandably, research now centered on the facts and rules the knowledge – to spoon-feed computers until they understood enough about the world to begin disambiguating the language we use to describe it. By the end of the 1960s, the quest to meet Bar-Hillel’s challenge with knowledge-based AI programs had begun.

Marvin Minsky and Seymour Papert at MIT led the way. Their approach became known as “symbolic,” after the famous Physical Symbol Systems Hypothesis (PSSH) formulated by Newell and Simon:

A physical symbol system has the necessary and sufficient means for general intelligent action.

The symbolic approach to AI, which philosopher John Haugeland later called Good Old-Fashioned Artificial Intelligence (GOFAI) was based on the insight that symbols in computer programs could be made to stand for anything, including objects in the physical world. Programs weren’t just manipulating numbers; or rather, they were just manipulating numbers, but these representations could be interpreted as things. The manipulations in code, then, became a way of reasoning about arbitrary things. The PSSH provided the theoretical foundation upon which so-called Symbolic AI could progress. (A separate, major approach was to model the brain with simple constructs resembling neurons, called “perceptrons.” We’ll get to this approach later, as it has resurfaced as the major paradigm in AI today.)

Minksy and Papert championed the development of methods for handling, processing, manipulating, and “dealing with” knowledge in isolated domains known as “micro-worlds.” Micro-worlds were supposed to provide the initial insights that would lead to more general programs that could scale up to real-world thinking. Successes in micro-worlds could be combined, perhaps, and like Legos, we could piece together a human-level general intelligence. Or, methods that proved useful in some specific domain could be generalized to larger scenarios. Micro-worlds were a first response to the back-to-the-drawing-board attitude adopted by AI – its proponents, its critics, and its funders in the wake of the ALPAC debacle.

blocks world

For instance, Terry Winograd at MIT – who was later to mentor Sergei Brin and Larry Page of the nascent Google at Stanford – proposed the Blocks World, a constrained domain consisting only of a set of wooden blocks, essentially, toy blocks in a room. In the Blocks World, a physical robot arm would pick up and stack the wooden blocks in response to English commands typed into a console by a human operator.

The entire world could be described by as few as 50 English words: nouns such as “block” or “cone,” verbs such as “move to” or “place on,” and adjectives such as “blue” or “green.” Via a program Winograd devised called SHRDLU, an operator could ask the robot to “pick up the green cone and place it on the blue block,” for example.

The system worked great on blocks. But the actual language processing Winograd’s SHRDLU system performed was woefully inadequate for anything outside this limited domain. In other words, the system didn’t scale. Winograd was later to admit that the problem was much more difficult than anyone had imagined, echoing other early researchers’ assessments, such as McCarthy’s. (Tellingly, Winograd today heads the “Human Computer Interaction” [HCI] group at Stanford, a much more applied area focusing on tractable issues like user interface design.)

Minsky himself was to experience a profound change of mood, admitting in 1982 to a reporter that “the AI problem is one of the hardest science has ever undertaken.” Yet in the late 1960s, the failure of the Blocks-World simply suggested to him yet other, quite similar, strategies. He and other researchers at the MIT Artificial Intelligence Lab were convinced that GOFAI was obviously right in spirit, with its theoretical underpinnings in centuries-old theories about the mind as representing the world it perceived and thought about. We must be representing the world somehow if we’re thinking about it at all, and so there must be some structure to the world that we can exploit in our programs. AI and language understanding – were must be about getting the right knowledge into the system, structured in the right way, so that relatively simple programming strategies could access and render usable this knowledge in the performance of intelligent tasks. In other words, thinking like a person is ultimately the challenge of representing and using the knowledge available to people.

Roger Schank, at Yale, was an early and forceful champion of the post–Blocks World approach. Schank thought that the proper focus of AI wasn’t blocks or other micro-worlds but rather social situations. Such situations were stereotypical. They involved a set of prior expectations, and therefore tractable for computation. His assumption that much of everyday life, ordering food at a restaurant, say – follows a pre-planned or typical sequence of events led to his work on the so-called Conceptual Dependency Theory. Schank used this framework to develop what he called “scripts” templates, basically, that captured expected scenarios like ordering food, or paying a parking ticket. Dreyfus would later call his scripts “predetermined, bounded, and game-like.” Schank defined them as follows:

We define a script as a predetermined causal chain of conceptualizations that describe the normal sequence of things in a familiar situation. Thus there is restaurant script, a birthday-party script, a football game script, a classroom script, and so on. Each script has in it a minimum number of players and objects that assume certain roles within the script . . . [E]ach primitive action given stands for the most important element in a standard set of actions.

Scripts were intended to give the computer a starting point: a set of expected actions, players, and scenes to work with:

Script: restaurant
Roles: customer; waitress; chef; cashier
Reason: to get food so as to go down in hunger and up in pleasure

From here, Schank reasoned, AI systems could scale up to handle more and more complicated, ambiguous situations.

At MIT, Minsky too propounded a version of scripts. He called them “frames.” Frames, like scripts, were supposed to represent everyday knowledge. Where micro-worlds were tractable but relatively uninteresting domains, frames were capable of capturing big pieces of real life – the typical events of attending a party, walking into a living room, eating out, and so on. Minsky explains:

A frame is a data-structure for representing a stereotyped situation, like being in a certain kind of living room, or going to a child’s birthday party . . .

We can think of a frame as a network of nodes and relations. The top levels of a frame are fixed, and represent things that are always true about the supposed situation. The lower levels have many terminals slots that must be filled by specific instances or data. Each terminal can specify conditions its assignments must meet . . .

Much of the phenomenological power of the theory hinges on the inclusion of expectations and other kinds of presumptions. A frame’s terminals are normally already filled with ‘default’ assignments.

Scripts, and frames, didn’t work. The problem wasn’t one of scale, like micro-worlds, but relevance – what might be relevant from one situation to another. Stereotyping left too much out. This was a knowledge problem again, but a “commonsense” one. Systems using scripts or frames to understand stories, their original application, didn’t need complex, scientific knowledge but rather simple, everyday knowledge even young children had acquired: “Barack Obama is President of the United States,” or “Barack Obama wears underwear,” or even “When Barack Obama is in Washington, his left foot is also in Washington.” But simple knowledge like this seemed endless; even worse, the bits and pieces that became relevant kept changing, depending on context.

By the end of the 1970s it was clear that our language, or rather the interpretation of language, lay at the root of the problem with scripts, and frames (indeed, language understanding was emerging as the key problem for all of AI.) Schank intended his scripts to be used by physical systems – robots – but his initial work was on programs run on mainframes or desktops that analyzed textual stories about social scenarios. This was language understanding, once again. But it became clear that language understanding was really a symptom of a more profound mystery – how do we grasp what’s relevant in varying situations? It seemed like no one in AI was directly grappling with this problem of relevance, but it was surfacing everywhere in the field, over and over. Bar-Hillel, once again, had been prescient here: “The number of facts we human beings know is, in a certain very pregnant sense, infinite.”

Infinitude was not a promising concept for a supposedly practical, engineering-based field. Indeed, the failure of scripts and frames highlighted problems in AI that were to lead scientists and engineers into even deeper, more puzzling, and seemingly more-intractable waters in the years ahead. By the 1980s, the so-called Frame Problem – the problem of grasping what is relevant and ignoring what is not, in real-time thinking – had added a seemingly mysterious and intractable conundrum to the already puzzling issue of how to give computer knowledge in AI.


Philosopher and Cognitive Scientist Daniel Dennett called it no less than a “smoking gun”; relevance, not just knowledge, was apparently at the very center of the very idea of Artificial Intelligence.

We’ll take a look at the Frame Problem and its discontents in a final section. First, however, we turn to the major, modern turn in AI to statistical, data-driven methods that disguised the Frame Problem and the deep (perhaps insoluble) problem it poses for AI: yesterday, today, and seemingly tomorrow.

Rise of Modern AI on the Web

The rapid emergence of the World Wide Web in the mid-1990s gave AI researchers a new tool that was previously unavailable data. While continuing GOFAI projects like former Stanford and Carnegie Mellon computer scientist Douglas Lenat’s “Cyc” project (short of “encyclopedia”), which focused on hand-coding more and more computer-readable knowledge in large knowledge bases intended to somehow solve issues with relevant knowledge, suddenly thousands and then millions of people were giving AI “big data” in the form of Web pages. Web pages were rich sources of text that could be analyzed by computational systems. This was an exploitable form of knowledge, but in the form of natural language, added by humans, and available to anyone to download, aggregate, and analyze. Empirical methods, as they came to be called, were computational approaches that exploited words and surface features of text, and such methods exploded in the 1990s and quickly replaced the deep knowledge-based efforts. By the late 1990s, empirical or data-driven methods seemed the only game in town.

Tim Berners Lee
Tim Berners Lee

The Web’s success also created an immediate need for computational systems to solve practical problems like search, retrieval, and classification. Suddenly, everyone online needed automated services to search, organize, and filter the rapidly exploding quantities of text data as new Web pages were added daily. (In the early days of the Web, a major concern was whether anyone could ever find relevant information:: it was seemingly a needle-in-a-haystack problem.)

Hand-crafted rules – the old efforts at engineering knowledge bases and rules to draw conclusions from them using human experts, clearly couldn’t be scaled quickly enough for such an effort. The Web was growing exponentially by the day; seemingly by the minute. Millions of pages were created each year, and there were suddenly hundreds of millions of web users, worldwide. In 1995 a scant 16 million people, total, used the web. By 2000 there were 361 million users. Into the next decade, there would soon be billions. The old focus on rule-based or expert systems seemed hopeless; statistical techniques were popular, again. And this time around, they had volumes of data to work on.

The approach made possible by volumes of data from the Web was known as “data-driven,” or “Empirical AI.” Empirical AI today is the only game in town; it has been, in fact, since roughly the turn of the century. To understand why we are likely still in a “winter” (one that is not yet recognized, but is likely coming), even today among the success of Web behemoths like Google, Yahoo!, Facebook, Twitter, and others, we’ll need to unpack the statistical or data-driven approach brought back to life by the modern Web. From here, the inherent limits for natural language processing and AI reveal themselves alive and well.

Modern or data-driven AI is also known as Empirical AI because it embraces a view of knowledge that has its roots in philosophical empiricism: knowledge about the world is largely acquired, or learned, from experience. By contrast, traditional AI, which philosopher and AI researcher John Haugeland called “Good Old Fashioned AI” (GOFAI), assumed that a significant part of human knowledge is not derived from experience but is “fixed” in advance in the capabilities of the brain or mind. Noam Chomsky’s work on transformational grammar in the 1950s and ’60s exemplified this rationalist approach. Chomsky was an outspoken critic of Empirical AI, which took root early in AI’s history with the work of Claude Shannon and others, but was later dismissed.

Chomsky, for his part, excoriated the use of statistical methods in natural language understanding, which he saw as hopelessly inadequate. Chomsky (right) himself played a large role in dismissing early statistical approaches to machine translation with his “poverty of stimulus” arguments against empiricist, learning-based approaches like that of the celebrated behaviorist B.F. Skinner, arguing that native linguistic faculties must account for language understanding.

Noam Chomsky
Noam Chomsky

He argued too, contra Shannon and the statistical tradition, that meaningless statements like “Colorless ideas sleep furiously” are useless for statistical inference (predicting the next word given a context of prior words) but are nonetheless grammatical. Chomsky saw this as proof that “counting” methods were in fact too simplistic to capture the deep underlying structure of natural language. Dismal results from early statistical systems further buttressed Chomsky’s critique, as statistical attempts at machine translation proved unfit for the task, just as he had predicted. Thus Empirical AI, at least as a success story, has a relatively recent origin.

By the mid-1990s however, the limitations of the rationalist or “GOFAI” approach had also become apparent. Commonsense reasoning – the everyday inferences humans make, seemed impossibly hard to code into machines. Several well-funded projects, including Japan’s “Fifth Generation” effort in the 1980s aimed at engineering robots with common sense, had all failed. By the 1990s, AI was experiencing one of its notorious “winters,” where funding had dried up on the heels of failed attempts to deliver on promises. The GOFAI projects still surviving represented, in essence, “Hail Mary” attempts to vindicate GOFAI, as evidence mounted that AI researchers had “stumbled into a game of three-dimensional chess, thinking it was tic-tac-toe,” as a philosopher and cognitive scientist Jerry Fodor put it.

Claude Shannon
Claude Shannon

Today, however, Claude Shannon’s ideas, not Chomsky’s, dominate efforts in AI. The story here is somewhat circuitous. In his early work at Bell Labs, Shannon (left) pioneered modern information theory, but he also made important contributions to the fledgling field of AI in the 1950s by showing that seemingly semantic problems in language understanding could sometimes be reduced to purely statistical analysis.

His famous “Shannon game” predicted English words from a small number of prior words, for one. Information theory in fact helped explain this: a natural language like English is redundant, so predicting the next letter in a word (or next word in a sentence) can be modeled as a statistical inference conditioned on prior context. Given “q” for instance, the next character in an English word sequence is easy to predict, it’s “u.” Shannon (left) showed the value of making simplifying assumptions about language, fitting it into a mathematical framework, and treating it as “mere” information. If one viewed language as a simple “Markov” process, where the next element in a sequence can be predicted by considering only a local context of n prior elements, problems that seemed difficult could be reduced to simple mathematics.

Shannon’s methods worked well enough predicting words and characters in sentences, but on complicated tasks like machine translation, the value of a statistical approach was less clear. In fact, a notorious early attempt at using Shannon-inspired statistical methods for machine translation had failed miserably, just as Chomsky had been insisting. As we discussed in the history of AI section above, “Fully Automated High Quality Machine Translation” (FAHQT) was an early Cold War attempt to use computers to translate captured Russian communications into English. The National Research Council poured millions of dollars into efforts at MIT, Carnegie Mellon, Stanford, and other universities, but the gamble didn’t pay off. Statistical analysis of simple word-to-word mappings failed to produce quality translations, and attempts at incorporating syntactic evidence (including Chomsky’s new “transformational” grammars) failed, too. The problem was enormously complex, with no workable solutions in sight. Thus, the initially promising statistical approach was sidelined for decades, and Shannon’s early work on AI was largely forgotten, or dismissed as irrelevant.

AI in the 1970s and 80s focused on “rationalist” problems thought to require real-world knowledge: commonsense reasoning, knowledge representation, and knowledge-based inference. Progress here was slow, as well. AI critic Hubert Dreyfus summed it up in his widely-read critique, What Computers Can’t Do: it seemed that competing approaches to AI were all getting their “deserved chance to fail.” By the 1980s, GOFAI was faltering on problems concerning relevance, as we’ve seen. The problems were clear enough to have names the Frame Problem, the problem of Commonsense Knowledge, the Qualification Problem, and the Ramification Problem (the latter two being off-shoots of the Frame Problem), but lacked workable solutions in spite of decades of focused efforts in the AI community. It was also becoming clear that Shannon-inspired ideas about empirical methods had been short-shrifted after the FAHQT debacle. For one thing, the increasing availability of data for training and testing empirical or “learning” methods was beginning to shift the scales, as erstwhile statistics-based failures were showing signs of success. This brings us back to Modern AI.

Work on machine translation, fittingly, helped revive Empirical AI. In the early 1990s, IBM began working on a statistical system known as “Candide.” Candide used an approach known as the “Noisy Channel Model.” The model took its name from Claude Shannon’s early work in communications theory, where messages were encoded, transmitted through a channel, and decoded into their original form for a receiver. The encoded messages, also known as the “signal,” were frequently distorted by noise in the transmission channel. Channel noise was a concern for Bell Labs where Shannon worked, and so his early work focused on reducing or eliminating “crackles” in telephone lines that affected the quality and comprehensibility of spoken communication for its customers. Shannon’s breakthrough at Bell was the modern theory of information, and the key to information in Shannon’s sense was redundancy.


If one added redundancy to a message, for example, the noise in the channel wouldn’t degrade it as much. Shannon proved that, in theory, all noise could be reduced to zero (given certain assumptions about the messages) by adding redundant bits to the message. This was known as the Noisy Channel Model in communications and information theory.

Marcel Proust
Marcel Proust

Researchers at IBM realized that this Noisy Channel Model could be applied to the task of translating messages from one natural language to another. For an input language of, say, English and a target language of, say, French, some system S can “decode” messages in English into French by assuming that the input language is full of noise. English is just “noisy” French, so to speak.

(French speakers may find this a pleasing idea!) Hence, the model is as follows. A sender intends to send a message in French, but during encoding, the message ends up looking like English. The “noisy” English message must then be “decoded” by finding the original French message in English and replacing the English message with the French version for the receiver. The “noisy channel” idea may seem odd, but the math is relatively straightforward. Given E as the “corrupted” language expression in English and “F” as the intended expression in French, the task of the translation system S is simply to find E* = argmax (E’) p(F|E’) p(E’).  This is simply the “maximum a posteriori probability” (known as “MAP”) English expression, given the (observed) French expression. As the IBM team noted, “[the] approach involves constructing a model of likely English sentences, and a model of how English sentences translate to French sentences. Both these tasks are accomplished automatically with the help of a large amount of bilingual text.” Indeed.

IBM’s Candide used transcripts of Canadian parliament proceedings in French translated into English. This amounted to millions of words available for learning the mappings, removing the “noise”, and it performed markedly better than the early statistical translation systems. It took another decade, but by the mid-2000s, empirical methods had been fully vindicated. Suddenly, too, the million-word corpus Candide used was relatively small fry. Google compiled a trillion-word corpus in 2006; Peter Norvig, the Berkeley AI scientist turned Director of Engineering at Google, remarked a few years later (in 2010) that corpora of this scale created different kinds of capabilities for empirical AI systems. Literally, problems that seemed impossible before suddenly had solutions.

Norvig was half right. What’s clear is that the availability of huge datasets breathed new life into work in AI, and given the manifest difficulties encountered by rationalist “GOFAI” efforts, Modern AI has been recast as an empirical discipline using “big data” – very large computer-readable datasets, to learn sophisticated models. Hence, machine learning and data are two sides of the same coin for Modern AI. Big data/machine learning–inspired approaches have moved the ball on scores of practical AI tasks, like machine translation, voice recognition, credit card fraud detection, spam filtering, information extraction, and even sci-fi projects like self-driving cars.

A sober look at AI today, however, reveals a more complicated picture than a linear narrative of imminent success, once the Web gave us volumes of analyzable data. What’s left out of the current litany of success stories are “hard” problems that do seem to require common sense, and knowledge. Ironically, what Modern AI frequently ignores, or pretends it is solving, are just those problems that GOFAI tackled, yet foundered upon, like commonsense knowledge. Such problems requiring real-world knowledge that is irreducible to statistical or ground-up learning methods remain unsolved today, as yesterday. Why this is so, and why it remains so, even with all the data and all the computational power available in the modern technological world, is an interesting question to which we will soon turn. First, however, we need to unpack more in more detail some concepts that thus far we’ve introduced but not fully explained. Machine learning, in particular, is at the heart of Modern (Empirical) AI, and so it’s to machine learning that we turn next.

Machine Learning and AI

The dream of computers capable of learning dates back to the beginning of the modern computation. Computer pioneer Alan Turing discussed learning machines in his seminal “Turing Test” paper in 1950, “Can a machine think?” And his statistician, I.J. Good, remarked famously in the 1960s that learning was the secret ingredient for creating truly intelligent machines, even “ultra-intelligent” ones having superhuman smarts:

. . . an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make.


Implicit in Good’s sci-fi suggestion is the concept of improvement on tasks over time: learning. Small wonder then that machine learning so captivates the modern mind, married as it is to data on the one hand, and to today’s superfast computers on the other, affordable systems that surpass the multi-million dollar supercomputers of a decade ago. Learning, though, is a term having many meanings. Its use in the context of machine learning is circumscribed by mathematics and by a field known as “computational learning theory.”

“Machine learning” means that a system improves on a specific task, according to a quantifiable measure, as a function of time. Carnegie Mellon’s Tom Mitchell, a professor of Computer Science and an expert in the subfield of machine learning, defines it as follows:

Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measure by P, improves with experience E.

For a computer program, to “learn from experience E” is to learn from data: computer programs can’t “experience” human language use directly, for instance, but written or transcribed communication can be digitized and datafied for them, as we’ve seen. A task T is a problem such as ranking search results, or facial recognition (pick one), and a performance measure P is our assessment of how good a machine actually gets on a particular task.

Tasks in natural language processing (NLP – a subfield of AI) are often cast as learning problems in just the sense that Mitchell describes. IBM’s Candide system was one; there are many others. Voice recognition (identifying morphemes from phonemes), handwritten character recognition, machine translation, collocation identification, syntactic parse generation, and extraction, and many other problems in NLP are reduced to classification or regression problems. Systems devised to solve these problems “learned” in Mitchell’s sense by optimizing a set of model parameters to increase performance over time. “Learning,” in other words, may sound mind-like, but its meaning in AI admits of a straightforward and unproblematic mathematical treatment.

Still, machine learning, because it does involve improvement in performance over time, is obviously relevant to the broader vision of AI as creating machines with human-like (or greater than human-like) intelligence. For one thing, learning systems are plausible counter-examples to an age-old objection to the possibility of true AI that computers can only “do what they’re programmed to do.” The objection was first suggested in the 19th century by Ada Lovelace (right), a brilliant and largely self-taught mathematician (and daughter of Lord Byron) who worked with Charles Babbage on his world-famous Analytic Engine, the monstrous mechanical calculator mentioned early that never fully worked but in important ways still anticipated the age of modern computation arriving in the next century.

Ada Lovelace
Ada Lovelace

Lovelace recognized the prima facie challenge to the possibility of “smart” computers, their seeming inescapable determinism: “The Analytical Engine has no pretensions to originate anything. It can do whatever we know how to order it to perform.”

Machine learning seems to offer a way out of the Lovelace objection. If learning, powered too by unimagined quantities of data, can improve as a function of time, why can’t it “improve” to become, eventually, human-like? In other words, machine learning might suggest that commonsense intuitions about the differences between minds and machines might be wrong. Maybe Lovelace was wrong, machine learning hadn’t yet been invented.

It turns out, however, that machine learning has known limitations; it isn’t a magic bullet that can render Lovelace-like objections otiose. Data, too, doesn’t save machine learning here. To explain exactly why, we turn to the most powerful machine learning approach known today, supervised machine learning – learning from data that has been explicitly labeled by humans.

Supervised Machine Learning – A Brief Tutorial

Supervised machine learning (SML) is a catch-all term describing all types of machine learning that receive “supervised,” or human-labeled data for training. Human-labeled data is considered the “gold standard,” so SML is considered superior to unsupervised or semi-supervised variants. A single training example is known as an “instance,” and a set of labeled instances is denoted L = { (1) . . . (L)}, where x is a vector containing a non-zero number of features extracted from the training data and y is the human inspected label. Given a task such as Part Of Speech (POS) recognition or POS “tagging,” for instance, a training instance might be (1) a sentence, tokenized into its constituent words stored as elements in a vector, along with (2) each word’s corresponding part of speech (POS) label, or tag:

[The cat is on the mat: DT/NN/VB/PP/DT/NN]

Here, the feature data is the sentence “The cat is on the mat,” and the label data are POS tags, using the well-known Brown Corpus abbreviations for parts of speech (DT is a “determiner”, and so on). In a typical learning task, training data consisting of thousands or even millions of such example sentences along with corresponding POS tags is extracted from a corpus, converted into training instances, and input to a learning algorithm.

The task of learning with training data amounts to approximating a function, or “model,” that best “fits” the feature-value pairs. Various learning algorithms can be used, depending on the particulars of a learning task. In general, classification approaches have discrete or non-continuous output; in the POS tagging case, the labels correspond to parts of speech for the target language (in this case, English). Regression, by contrast, learns models that output real or continuous values, like a real number in the interval [0–1].

Many language understanding tasks like POS tagging, entity recognition, or phrase-mapping tasks like machine translation are based on (or can be cast as) classification problems, and hence use traditional classification methods such as Hidden Markov Models (HMMs), or discriminative methods like Maximum Entropy (MaxEnt) or Conditional Random Fields (CRFs). CRFs are a class of powerful learning algorithms based on an undirected graphical model that solves known weaknesses in maximum entropy classification, like the “label bias problem,” but otherwise are equivalent to maximum entropy and other conditional models. So-called large-margin classifiers, like Support Vector Machines (SVMs), are “true” classifiers that can be trained on sequential data (like the word sequences for POS tagging) by using “tricks” like a pre-processing step, where a “sliding window” or other algorithmic technique converts the sequence of input into separate, binary classification problems. Neural networks (ANNs), Nearest Neighbor approaches, and decision trees are yet other methods under the rubric of SML.

A model (or “target function,” often denoted by F) is generated from the training phase. The model is then run in a test phase, where its performance is calculated on unseen data. Often this process is iterative so that the performance of the model is updated or averaged on non-overlapping segments of test data. If the learning problem is well-defined, the model accuracy will increase with additional input until it flattens (the accuracy graph is often asymptotic), and the model becomes saturated, and is released to be used in a production phase. Testing is a well-defined procedure in SML, as well.

To measure performance, a loss function is selected. “Loss,” tells a designer of the system (and a customer) how to score incorrect answers against system performance. “0/1” loss is typical; this means that a misclassified test point receives a loss of 1, and a correctly classified test point receives a loss of 0. Accuracy is typically calculated using an F-Measure, the harmonic mean between system precision (number of correct answers given number attempted) and recall (number attempted given total possible). For p = precision, r = recall, the F-measure is computed as (2pr) / (p + r). Using a straightforward loss function like 0/1 loss, a measure of the performance of the system is relatively easy to calculate.

A major problem with SML, and learning approaches generally, is known as “over-fitting,” where the model mistakes irrelevant “noise” in the training data as a signal. To mitigate over-fitting, testing is often performed iteratively in a process known as cross-validation. For instance, 90 percent of the total data (corpus) may be used for training a model, with 10 percent of unseen data set aside for testing. As this 10 percent may somehow “fit” the training data well by chance (i.e., the resemblance is spurious), the training data can be repeatedly divided into “90–10” training/testing splits, where each new split selects another 10 percent of previously unseen data. This is known as the “n-fold” cross-validation of the data. Thus, 10-fold cross-validation means that 10 independent, non-overlapping splits of the corpus were used to successively train and test a model. Performance scores (measured by an F-measure) are then averaged for the total number of splits used, and the mean performance is released for the model. In this way, the actual performance of the model on unseen data can be estimated (in practice, of course, the actual performance can never be fully known). When the model accuracy is adequate (or as good as can be achieved), the model is released into production. This is the entire supervised machine learning framework, in a nutshell.

It’s also the popular and widely used approach used everywhere today. Big-data powerhouses like Google, Facebook, Amazon, and many others use such an approach, powered by their volumes of customer and Web traffic data. Amazon’s recommendation system, for instance, uses data about book and product purchases for customers, where the “gold standard” labels are the record of what was purchased, instead of passed over. The system is thus “trained” by human customers on Amazon’s site. Facebook, too, classifies user posts according to topical and other information, and uses additional “label” information from the social graph itself – connections between users are a powerful, supervised “signal” that Facebook gets for “free,” so to speak.

Supervised learning and indeed machine learning generally have known limitations, in spite of their current popularity. Importantly, the provision of data, even gargantuan datasets or “big data”, can’t fix these limitations, which are fundamental to the approach itself, regardless of how much data is made available for training. We consider next these limiting factors.

Machine Learning and Its Discontents

Model Bias

All machine learning, regardless of specifics, suffers from model bias. Bias has a technical meaning in machine learning research and all models have a bias in this technical sense. In a broader but equally valid sense, however, machine learning approaches suffer from an “inductive” bias that we can call a “Frequentist Assumption.” The frequentist assumption in Empirical AI is so commonplace that it often goes unnoticed. “Frequentist” here means simply the “counts” or occurrences of target phenomena (features) in training data. The assumption is often merited. If we see examples of nouns following a definite determiner like “the,” for instance, we have inductive evidence that nouns typically follow determiners (“the car”).

The problem with frequentist assumptions in learning, however, is that unlikely events do happen. As Nassim Nicholas Taleb argued in his best-selling critique of inductive methods in financial analysis, The Black Swan (Random House, 2007), learning methods that are based on tallying frequencies in past data can be blind to unlikely events departing from such normal past behavior. When humans rely on such models, the results can be catastrophic, as in the 2007 – 8 stock market crash. The problem with computational methods here is quite blunt: learning models become less capable of recognizing or classifying unlikely events, the more they rely on a Frequentist Assumption. Yet this assumption, ironically, is often what is used to justify such methods as intuitive. It also often accounts for their successes, which may nonetheless get revealed as spurious.

To see this issue in the context of AI, take document classification, a well-known problem in a field known as “information extraction.” Given some set of topics T= {t1, . . . ,tn} and some set of documents D, for each d in D, we classify or assign, some tk in T representing the best or most relevant topic describing the content of d. A common application of document classification is Spam Detection: given an email document, classify it as “Good” or “Spam” (the classification is actually binary, hence “Spam” or “Not-Spam”).

Imagine now a relatively common scenario where a document, ostensibly about some popular topic like “Crime,” is actually a humorous, odd, or sarcastic story and is not really a serious “Crime” document at all. Consider a story about a man who is held up at gunpoint for two tacos he’s holding on a street corner (this is an actual story from Yahoo’s “Odd News” section a few years ago). Given a supervised learning approach to document classification, however, the frequencies of “crime” words can be expected to be quite high: words like “held up,” “gun,” “robber,” “victim,” and so on will no doubt appear in such a story. The Frequentist-biased algorithm will thus assign a high numeric score for the label “Crime.” But it’s not a “Crime” the intended semantics and pragmatics of the story is that it’s humor. Thus the classification learner has not only missed the intended (human) classification, but precisely because the story fits “Crime” so well given the Frequentist assumption, the intended classification has become less likely – it’s been ignored because of the bias of the model.

We might assign multiple (possibly even contradictory) labels to documents, so that our taco thief story receives “Crime” and “Odd” labels, of course. But note that this isn’t really correct, either. We don’t expect to see the story in actual criminal accounts if it’s really written with the pragmatic intent of making us laugh. And if we’ve already seen it in “Odd” news, we don’t want to read it as a serious crime story after that, anyway. This doesn’t really work; the algorithm has to choose, but given the seemingly innocuous Frequentist Assumption, it can’t.

The Frequentist bias actually affects Google’s search engine non-trivially, too. Web pages that contain blatant sarcastic mentions, like “Awesome weather!” when said in Minnesota in a blizzard in December can end up as search results for vacation spots. Again, algorithms that have “thrown in their chips” for building powerful models of language using a frequency bias are very ill-suited to handling such cases, and importantly they become less and less capable, the more powerful they become. Big data makes the Frequentist bias worse; more data means more bias to “learn.”

The flip side of model bias is, ironically, data sparseness. Texts on machine learning mention sparseness, but pass over it too quickly, given its importance. Christopher Manning and Hinrich Schütze, for instance, in their definitive text Foundations of Natural Language Processing (MIT Press, 1999), offer the following:

. . . for most words our data about their use will be exceedingly sparse. Only for a few words will we have lots of examples.

The sparseness of natural languages is often attributed to early observations by American linguist George Zipf. “Zipf’s Law” states that there exists some constant k such that f * r = k, where f is the frequency of a word and r is its position in a list ordered by frequency, known as its rank. Thus frequency and rank are inversely proportional; the 50th-most-common word (in some typical corpus) should occur three times more often than the 150th-most-common word, and so on. In practice, Zipf’s Law is only a rough description of the frequency of words in a language, but it is a useful approximation. At any rate, the vast majority of words in a corpus don’t occur very frequently: this is data sparseness.

Sparseness might seem to be an argument for big data, on its face, since larger corpora will contain more and more counts of words, regardless. But given that word frequency follows the power-law distribution that Zipf outlines (Benoît Mandelbrot later refined it, but the details aren’t of interest here), there is always an effectively infinite “long tail” of rare, low-frequency words. Sensitizing learning methods relying on Frequentist Assumptions to ferret out this long tail is a central concern of modern research in big-data methods and AI.

Long-tail worries become even more pronounced when considering sequences of words together, known as “n-grams.” A bigram, for instance, is two words in sequence, while a trigram is three, and so on. Statistical inference on n-grams is a common language processing task; in particular, the task of predicting the next word in a sequence given n prior words is given by a simple equation:

P (Wn|W1, . . . ,Wn-1)

This is the Markov principle stated as a conditional probability. Interestingly (though rarely discussed), the Markov assumption here is actually false, as non-local dependencies in language are commonplace: “John swallowed the large green . . .” is a sentence fragment where the semantics of the verb “swallowed” clearly influences the selection of nouns following the adjective “green.” As Manning and Schütze note, nouns like “mountain” or “train” would not be preferred here, given the selectional preferences on the verb. Bigram or trigram models don’t capture this non-local dependency. As a practical matter, however, n-gram Markov models often perform quite well on many useful language tasks, particularly when big-data corpora are available.

N-grams, however, are even rarer than individual words, so the big-data approach has limits, too. The IBM Laser Patent Text corpus has 1.5 million words, yet researchers report that almost a quarter (23%) of trigrams discovered after training, on test splits of the corpus, were previously unseen. Even massive corpora such as Google’s trillion-word compilation can’t fully mitigate the problem of sparseness. There will always be word combinations that are unseen in training data but which end up in test data – this is just a re-statement of the open-endedness and extensibility of language, along with empirical observations captured (if incompletely) by Zipf’s Law, stating that the vast majority of words in any corpus are actually rare.

A “count” based method like maximum likelihood estimation (MLE) assigns a zero probability to words that don’t appear in training data, making such a simple method unusable, in effect. For this reason, MLE methods are typically supplanted by a technique known as “smoothing.”

Data smoothing is a mathematical “trick” that distributes probability (called the probability “mass”) from training data to previously unseen words in test data, giving them some non-zero probability. Smoothing methods in statistics are relatively well-known. Laplace’s Law, for instance, is a simple smoothing technique known as “adding one” that is equivalent to assuming a “uniform prior,” such that every n-gram is equally likely, prior to training (the Bayes assumption). Suppose C (w1 . . . wn) is the frequency of the n-gram denoted by w1 . . . wn in training data, N is the number of training instances, and B is the number of values in the (multinomial) target feature distribution. We then have:

PLap (w1 . . . wn) = ( C (w1 . . . wn) + 1 ) / (N + B)

Other, more-advanced smoothing techniques are also used, such as Lidstone’s Law and the Jeffrey-Perks Law (the former case involves adding some positive quantity less than 1; in the latter, the value is ½, which is theoretically the same value expected from MLE, and so is also known as the Expected Likelihood Estimation, or ELE). These “laws”, however, don’t capture observed knowledge about the actual probabilities of rare events. Rather, they “smooth” or lessen the blind spot in statistical inferences when observations of target phenomena aren’t (and can’t be) complete. They’re guesses, in other words, and have roughly the same value as guesses in cases when unexpected occurrences crop up. There aren’t any known “fixes” other than smoothing techniques for pure statistical learning methods. Such methods typically perform well for tasks that exploit frequencies well, like part of speech tagging, and fail miserably for more complicated natural language tasks requiring knowledge and inference, like resolving anaphoric or other references. The former cases are typically touted as examples of the superiority of learning-based approaches, while the latter is typically ignored or dismissed for good reason.

Model Saturation

Model saturation is another known limitation of machine learning approaches. Machine learning models saturate when they plateau on a learning problem, and cannot learn additional information as the model is “full,” and its parameter values are then fixed. All models saturate, and once they do, further training data is useless, as the model has fit the data as best it can, and no further provision of examples will improve it.

financial crash

Saturation is often equated with model over-fitting when a system begins mistaking “noise” in the training data for actual “signal,” as it’s often put. Over-fitting has received extensive attention in statistical and learning theory and remains a live problem in spite of decades of attention.

Complex phenomena like earthquakes, or financial markets, are fertile ground for generating “over-fit” models, as the inherent complexity of such systems, varying pressures on vast areas under the earth’s crust, or the feedback loops and human choices forming the modern economy, make the discerning signal from noise very difficult.

Ironically, very complicated models from a mathematical standpoint (so-called “non-parametric density approximation models”) are often more vulnerable to over-fitting: signal plus noise is often more complex than signal alone. The key point is: over-fit models don’t generalize to unseen examples except by happy circumstance, as the real underlying distribution representing the signal (not the noise with it) was never learned. There is a growing awareness of the problem of over-fitting in the machine learning community (cf. Nate Silver’s The Signal and the Noise [Penguin, 2012]).

But as important as over-fitting is, it’s not the same as saturation. Saturation can occur even when models “fit” the data well, showing good generalization performance on unseen data, but nonetheless can’t learn further patterns due to the design constraints on the learning models themselves (choice of parameters, features, etc.). Given the importance of saturation, its relative lack of media attention or discussion even in the AI research community is puzzling. Peter Norvig, to his credit, acknowledges the issue. In a 2012 interview for The Atlantic, Norvig conceded:

We could draw this curve: as we gain more data, how much better does our system get? . . . And the answer is, it’s still improving, but we are getting to the point where we get less benefit than we did in the past.

Norvig here fingers the natural tendency of model performance to level off as more and more data is added, approaching a final accuracy (often asymptotically), as more and more data yields less and less by way of results. Well-defined tasks that are relatively easy to learn often show relatively high performance before saturating (for instance, part of speech tagging), but other more knowledge-dependent tasks quickly vanquish learning models well before human-level performance can be reached. Regardless, when models saturate as their performance approaches some limit, more data doesn’t help, and the performance of the models is fixed, whether impressive or not. Saturation means that learning doesn’t continue indefinitely; it often terminates before interesting or powerful results can be achieved, a point that is for obvious reasons under-mentioned by AI and big data enthusiasts today.

Bias, sparseness, and saturation are all known limits on machine learning/big data approaches that receive little attention in today’s hype-driven climate. Yet even thornier problems in natural language wait. Language understanding has plagued efforts in AI since its inception, and we turn to it next in our final section, as we bring to a close our discussion of the scope and limits of Modern AI.

Modern AI and the Hard Problem of Language

Limitations to Empirical AI demonstrate that nothing “magical” is afoot with recent successes from powerhouses like Google, Amazon, Facebook, and other data-driven companies. Big data works well on some problems, where bias, sparseness, and saturation aren’t (or aren’t yet) an issue. In such cases, the provision of data to develop statistical or learning methods does indeed help; one might even say that successes here have been greater than originally anticipated. When Frequentist Assumptions aren’t valid, however, or data for a particular task is inherently sparse, or models have already saturated, Modern AI has little to say. Indeed, it can become part of the problem, as we’ve seen with the inductive bias (Frequentist Assumption).

John Haugeland
John Haugeland

But such limitations are actually mellow, compared with problems in language interpretation that require world or commonsense knowledge. AI researcher John Haugeland (right), in a 1979 article titled “Understanding Natural Language,” anticipated the centrality of the problem for AI, still today. Haugeland argued that holism is what makes understanding language difficult. When language is holistic, compositionality fails: understanding the words in a sentence doesn’t “scale up” to an understanding of the sentence itself – the parts don’t equal the whole. Humans resolve these perplexities by piecing together what the discourse means, using knowledge of the world (semantics) and knowledge about how people communicate (pragmatics). Computers today still lack such capabilities.

The semantic and pragmatic mysteries in language are not new. Again, early voices like that of Yehoshua Bar-Hillel were prescient. Bar-Hillel, in addition to casting a general and notorious skepticism on automated language-processing efforts generally, also pointed out decades ago that simple sentences such as “The box is in the pen” can’t be understood using statistical methods. The problem, again, begins with ambiguity: words like “pen” are polysemous, or “many sensed.” A pen might mean a writing instrument, or a small enclosure for holding children or animals, depending on the context. But the context here isn’t statistical in nature. It requires knowledge about the relative sizes of boxes and pens in the real world. Such examples may be relatively uncommon in natural language texts, but that they occur at all spells trouble for machine translation – uncommon yet relevant disambiguations are precisely the issue that statistical methods seem ill-equipped to even address, let alone “solve.” A decade later, Haugeland posed similar questions about the necessity of somehow using relevant knowledge to tame holism, and drew similarly skeptical conclusions.

Decades-old debates may seem otiose today, until one realizes that a modern, world-class system like Google Translate gets Hillel’s simple sentence wrong, too: “pen” is translated by Google as “a writing instrument.” (The reader can verify this for himself, of course.) The failure is particularly telling for Modern AI proponents because Google’s translation system is touted as a state of the art precisely because it maps phrases to different languages using data: books and other pages on the Web that have been translated into other languages. It’s particularly poignant, then, that a simple sentence offered by a researcher in 1966 must be accounted a failure. Again, the problem is not one of the data, but relevant knowledge. How is a system trained on data to be made to “see” that the pen in the example must be an enclosure, rather than a writing instrument? Traditional researchers like Haugeland called this an example of commonsense “holism,” because one must have some general, usable knowledge about the relative sizes of “boxes” and “pens” in varying contexts, as humans surely do. Computers, apparently, even Google Translate does not.

Situational holism is even trickier for Modern AI and its embrace of data and shallow learning approaches. Haugeland offers a pair of seemingly simple examples that also confound today’s approach:

(1) When Daddy came home, the boys stopped their cowboy game. They put away their guns and ran out back to the car.
(2) When the police drove up, the boys called off their robbery attempt. They put away their guns and ran out back to the car.

If we translate Haugeland’s examples from English to German, then back to English using Google Translate, we get something of a “gist” for (1):

(1′) When Dad came home, heard the boys with her cowboy game. They laid down their arms and ran back to the car.

The illusion of understanding is quickly shattered, however, when one notes that Google translates (2) using the same phrase: “laid down their arms.” As the two situations are radically different, that Google assigns them the same idiomatic phrase perfectly illustrates the differences between data and knowledge. Much of the “power” of today’s approaches is an illusion, quickly shattered by seemingly mundane language examples, offered by researchers from the ’60s and ’70s.

Once one realizes the differences between grasping what’s relevant, actual, usable, context-dependent knowledge and induction that requires “counting up” previous examples and patterns, the real scope and limits of Modern AI become clear. Examples are suddenly easy to offer: “Mary saw a dog in the window. She wanted it.” Does “it” refer to the dog or the window? A six-year-old quickly uses context-sensitive or “abductive” inference to pick out a relevant sliver of world knowledge and resolve the anaphor “it” to its correct antecedent. Modern AI would fail on this seemingly trivial example, too, however.

In the final section of our story about the limits of AI, we can revisit the “smoking gun” of AI that emerged in the 1970s and ’80s and has now been pushed aside in the excitement and hype of Modern AI and its public success on the Web. It’s to a final look at relevant knowledge in AI, to the Frame Problem and its continuing mystery, that we now turn.

The Frame Problem

Haugeland wrote his “Understanding Natural Language” in November of 1979. By that time, another, and perhaps worse problem than language holism was emerging in the embattled field of AI. It was known as the “Frame Problem.” The Frame Problem affected systems designed to perform intelligently in so-called “real-time” the unceasing stream of events in our everyday experience. Real-time problems affected robotics systems, as well as AI programs designed to reason about events (such programs were often designed for robotics systems). The problem here was time, or rather the fact that things changed in time. Whereas Haugeland’s analysis of language was “static,” because holism pertains to what words mean when they’re written down and hence no longer in flux, the Frame Problem was rather dynamic; it was about reasoning in the changing world around us.

Haugeland pointed to the features of our language that challenged, and seem destined to continue challenging our attempts at developing an AI. With the Frame Problem, the challenge extended to, quite literally, the very world around us. While AI systems confronted holism in language, they confronted holism plus the additional complexity of time in real life. This was the Frame Problem. And, like Haugeland, it was another technically trained philosopher (not an official AI engineer), who first exposed the Frame Problem in its full mystery, and started a discussion that was to have profound implications for future work in AI.

Daniel Dennett is an unlikely spokesperson for what amounts to an impossibility proof in AI. An Oxford-educated philosopher who also directs the Center for Cognitive Science at Tufts University, Dennett (right) is known mainly for his support of atheistic evolution theory and his materialist theories about the human mind.

Daniel Dennett
Daniel Dennett

In 1984, however, Dennett realized that AI researchers working on the Commonsense Reasoning problem had stumbled into a puzzle about how humans managed to think, no less than the machines humans dreamt of building. They had uncovered, said Dennett, the Frame Problem.

Dennett didn’t coin the term. It was John McCarthy, once at Dartmouth, by now at Stanford, who introduced the Frame Problem in a 1969 paper addressing some technical issues using logical systems having a temporal operator, a variable for time, such as “t”, that is. But as Dennett saw it, McCarthy’s technical discussion of the Frame Problem was really the tip of an epistemic iceberg. In a widely read and controversial paper titled “Cognitive Wheels: The Frame Problem of AI,” Dennett expanded the technical discussion about McCarthy’s Frame Problem to include the philosophical question about how intelligent agents, any intelligent agents, whether human or machine understand what’s relevant when the world is constantly changing around them. Relevance, after all, is not just holistic. It changes. What’s relevant one moment becomes irrelevant the next. Grasping relevance in the real world seemed like “a magic trick,” as Dennett put it. And it was no wonder, then, that AI scientists were discovering impasse after impasse with their robots. We didn’t know how to solve the Frame Problem, Dennett argued in effect. How could we know how to program a machine to do it, then?

Look Before You Leap

Dennett describes the Frame Problem as how to get a computer to “look before it leaps, or, better, to think before it leaps.” Intelligence, at least partly, is about considering the consequences of our actions. We all know how to think about consequences, to greater or lesser degrees, but the skill is notoriously difficult to program. Dennett tells the story:

Once upon a time there was a robot, named R1 by its creators. Its only task was to fend for itself. One day its designers arranged for it to learn that its spare battery, its precious energy supply, was locked in a room with a time bomb set to go off soon. R1 located the room, and the key to the door, and formulated a plan to rescue its battery. There was a wagon in the room, and the battery was on the wagon, and R1 hypothesized that a certain action which it called PULLOUT (Wagon, Room, t) would result in the battery being removed from the room. Straightaway it acted, and did succeed in getting the battery out of the room before the bomb went off. Unfortunately, however, the bomb was also on the wagon. R1 knew that the bomb was on the wagon in the room, but didn’t realize that pulling the wagon would bring the bomb out along with the battery. Poor R1 had missed that obvious implication of its planned act.

Poor R1, indeed. But the problems get worse. For, Dennett continues, the hapless robot designers soon discover that simple “solutions” to R1’s shortcomings introduce yet other perplexities. They propose another, more advanced robot, only to fall deeper into the Frame Problem:

. . . `The solution is obvious,’ said the designers. `Our next robot must be made to recognize not just the intended implications of its acts, but also the implications about their side-effects, by deducing these implications from the descriptions it uses in formulating its plans.’ They called their next model, the robot-deducer, R1D1. They placed R1D1 in much the same predicament that R1 had succumbed to, and as it too hit upon the idea of PULLOUT (Wagon, Room, t) it began, as designed, to consider the implications of such a course of action. It had just finished deducing that pulling the wagon out of the room would not change the colour of the room’s walls, and was embarking on a proof of the further implication that pulling the wagon out would cause its wheels to turn more revolutions than there were wheels on the wagon, when the bomb exploded.

What’s going wrong here? Why doesn’t R1D1 simply ignore all this, and zero in on the relevant implications of its actions? To put things another way: lots of things change, but only some changes really “matter,” so why can’t it just “get” this? Dennett continues:

. . . `We must teach it the difference between relevant implications and irrelevant implications,’ said the designers, `and teach it to ignore the irrelevant ones.’ So they developed a method of tagging implications as either relevant or irrelevant to the project at hand, and installed the method in their next model, the robot-relevant-deducer, or R2D1 for short. When they subjected R2D1 to the test that had so unequivocally selected its ancestors for extinction, they were surprised to see it sitting, Hamlet-like, outside the room containing the ticking bomb, the native hue of its resolution sicklied o’er with the pale cast of thought, as Shakespeare (and more recently Fodor) has aptly put it. `Do something!’ they yelled at it. ‘I am,’ it retorted. `I’m busily ignoring some thousands of implications I have determined to be irrelevant. Just as soon as I find an irrelevant implication, I put it on the list of those I must ignore, and . . .’ the bomb went off.

“All these robots suffer from the frame problem,” Dennett concludes, a bit anti-climatically.

After the publication of “Cognitive Wheels,” Dennett was accused of turning an engineering question into a philosophical discussion. His accusers were right; this is exactly what he had done. But he had a point, and in the end, no one could deny it. The Frame Problem emerged out of AI research, but it really exposed the deep mystery of intelligence generally:

It appears at first to be at best an annoying technical embarrassment in robotics, or merely a curious puzzle for the bemusement of people working in Artificial Intelligence (AI). I think, on the contrary, that it is a new, deep epistemological problem, assessable in principle but unnoticed by generations of philosophers, brought to light by the novel methods of AI, and still far from being solved.

Regardless of whether the Frame Problem was “solvable,” in other words, it was certainly a profound mystery, and one that AI researchers needed to accept in order to make real progress.

John Haugeland agreed. Like Dennett, he believed that McCarthy’s original engineering puzzle ended up threatening the very possibility of AI, once one understood it correctly. “What should I think?” he asked about the Frame Problem in 1989, a decade after the publication of “Understanding Natural Language.” Haugeland dedicated a chapter – “An Overview of the Frame Problem”, to the Frame Problem in his widely read Artificial Intelligence: The Very Idea (MIT Press, 1985). Haugeland, like Dennett, believed that relevance was at the center of the Frame Problem. And, like Dennett, he was profoundly perplexed about how to deal with the relevance of the tools of machines. If one knew what to think about, then the techniques and tools of AI applied. Things could proceed. But the very issue with the Frame Problem was figuring out what to think about in the first place—zeroing in on the relevant facts that might sit like needles in large haystacks of irrelevant ones. Such was the world around us. But it was this world—not the micro, toy worlds of earlier AI efforts—that AI scientists wanted to conquer. Only how?

Haugeland began his discussion of the Frame Problem with Bar-Hillel’s “box is in the pen” example. If AI systems do not “know” the relative sizes of objects such as boxes and pens, why not simply put the knowledge in? This is in fact exactly what AI later attempted, but Bar-Hillel ridiculed the approach as hopeless, even silly:

Understandable as this reaction is, it is very easy to show its futility. What such a suggestion amounts to, if taken seriously, is the requirement that a translation machine should not only be supplied with a dictionary but also with a universal encyclopedia. This is surely utterly chimerical and hardly deserves any further discussion.

Haugeland realized that Bar-Hillel was really warning the AI community that putting the knowledge in wasn’t enough. He was saying, in effect, that scientists hadn’t understood the real problem yet. The AI systems didn’t just need knowledge about the world, they had to know what knowledge was relevant, from one situation to the next. This question went to the very core of the notion of intelligence, it wasn’t just a question of engineering approaches. A knowledge base chock full of commonsense facts, in other words, would still be plagued with the problem of relevance. In fact, the problem would be worse, since now determining which bits of knowledge were relevant would involve a vastly larger search space. The Frame Problem, in other words, was Bar-Hillel’s ghost coming back to haunt AI, decades later.

Haugeland tried to break the Frame Problem down, to show exactly why it was so hard. He began by discussing system design issues. The two systems issues, he said, were internalization and control. “Internalization” refers to how to represent knowledge in a system so that it’s accessible and usable. “Control” refers to the algorithms that can find the right bits of knowledge in different contexts – those that can determine relevance, in other words.

AI researchers assumed machine translation was “hard” because most of our knowledge about sentences in a language was hidden; what we saw was only the tip of an iceberg. What was hidden was the part, as he put it, that “AI scientists have to build.” Hence, machine translation failed, originally, because it didn’t adequately address internalization. There wasn’t enough of the hidden part of the iceberg available to the early systems. Bar-Hillel had argued that the task of building the rest of the “iceberg” would be difficult because such knowledge would be contextual, relevance wasn’t a “nice to have” feature of AI, it was the entire story when it came to understanding language.

What Haugeland understood, however, was that Bar-Hillel’s worries about internalizing knowledge were only half the story. Real-time AI systems also were faced with questions of control: how to access relevant facts in changing environments. And the question of control had not been addressed at all by AI, except in toy domains where the issue of relevance didn’t arise in the first place. Not so with the Frame Problem. The Frame Problem was, in effect, fully Bar-Hillel’s critique of internalization plus a new puzzle about how to think about what one has internalized in order to make use of it. Machine translation was hard, in other words, but true AI was vastly, infinitely, harder:

In a typical logical system, subsequent steps are constrained by previous ones, but seldom uniquely determined. Hence, something else has to determine which premises actually get invoked, and which inferences get drawn. In other words, over and above an internal formal language, a functioning AI system also needs some determinate control, to guide and choose the actual processes in real-time.

Control was a mystery added on to internalization. Control was inference – what to believe based on what you already believe, and what you see in front of you. No one had the slightest idea how real-time inference could be reliably programmed. We still don’t, we should say because the sort of inference required to make real-time, relevance-based decisions based on everyday observations and background knowledge is also hopelessly removed from induction. As we’ve seen, the frequency assumptions inherent in machine learning approaches dominating Modern AI work against recognizing this sort of relevance. The more data (evidence) Modern AI brings to bear on a particular problem today, like recommending articles or products based on a user’s past choices, personalizing content, or major language engineering efforts like Google Translate, the more Modern AI is likely to miss unexpected outcomes that are the whole point: the hallmark of intelligent thinking.

The Modern AI argument thus faces a fatal flaw in its approach. While performance on many language-engineering tasks has indeed increased in recent years, the inevitable errors resulting from systems using inductive approaches will no doubt be from the very difficult or unexpected (or statistically rare, but nonetheless valid given circumstances) examples whose solution requires a solution to the Frame Problem. For instance, in modern language engineering, the disambiguation of Bar-Hillel’s historic “the box is in the pen” example may succeed perhaps nine out of 10 times today (Google cites a 90 percent success rate on sense disambiguation using its trillion word corpus, for instance). Yet the single example of the rare meaning of “pen” as a writing instrument in that sentence, in spite of scores of frequencies of it meaning “a small enclosure” in data, is exactly the rarer yet correct inference we need such modern systems to make, to show any real progress on the question of human thinking. Such answers aren’t forthcoming in today’s systems, ironically because the value of Modern AI lies precisely in exploiting the “good enough” approach at the expense of outliers: getting the majority of answers correct for its users.

It’s no wonder to come full circle now that the Turing Test remains unsolved today. In the Turing Test, a system that recognizes most of what you mean (failing “only” on cases involving an appreciation of contextual relevance), isn’t intelligent at all. And given the unbounded, fluid nature of human conversations, where context constantly changes in real-time in a feedback loop of meaning (“Oh, that’s interesting, it reminds me of such-and-such . . .”), the Frame Problem is precisely the same “smoking gun” today as it was yesterday. How to access facts in changing environments just is passing the Turing Test, to paraphrase Haugeland’s trenchant, earlier treatment of the issue. The “outlier” errors when models saturate today contain “hard” problems: those requiring an escape from induction and grasping of relevance in spite of frequency-based evidence. The Frame Problem thus bedevils modern efforts even more than older approaches our success is, in a real sense, a beguiling yet inevitable illusion.

So, the question remains: how far have we really come? By any reasonable, unbiased assessment, the success of Modern AI uses methods that don’t illuminate core challenges, let alone make progress on their solutions. Indeed, the lessons of AI seem specifically to indict data-driven methods as too shallow to really capture what’s required for human thinking.

blade runner1

The Turing Test is one, and perhaps still the most important example of a clear casualty of the Modern AI approach. For all the data-driven might of powerhouse companies like Google, the really interesting part of AI, what has captivated and frustrated researchers since its inception, remains untouched, now more than ever.

And so we’ve arrived here: as unpopular as the view might be, a critical look at Modern AI licenses the skeptical, not the optimistic conclusion about progress. We’re likely heading for another winter, not imminent machine intelligence. We might also draw a conclusion that has been lurking just under the surface of our discussion all along: human minds and the machines we build are not the same. Turing, for all his brilliance, may have succeeded at proposing a test that no one can solve. That may be the lingering story of AI into our futures, indefinitely.

Frequently Asked Questions

Question: What is AI and how did it come to be?

Answer: AI, or artificial intelligence, refers to the development of computer systems that can perform tasks that would typically require human intelligence, such as understanding natural language, recognizing images, and making decisions. The history of AI can be traced back to the 1950s, when computer scientists first began exploring the idea of creating intelligent machines.

Question: What is Machine Learning and how does it relate to AI?

Answer: Machine learning is a subset of AI that involves the development of algorithms and statistical models that enable computers to learn from and make predictions or decisions without being explicitly programmed. It is one of the most important and rapidly growing areas of AI.

Question: How does Machine Learning and AI impact our society and the economy?

Answer: Machine learning and AI have the potential to significantly impact our society and economy by improving efficiency and productivity, creating new jobs, and leading to new and innovative products and services. However, it also raises concerns related to job displacement, privacy, and data security.

Question: What are some potential applications of AI and machine learning?

Answer: AI and machine learning have a wide range of potential applications, including natural language processing, computer vision, self-driving cars, healthcare, finance, and more. Additionally, it can be used to automate repetitive tasks, to improve customer service and to make predictions about future events.

Question: What are some of the ethical concerns related to AI and machine learning?

Answer: Some ethical concerns related to AI and machine learning include issues such as bias, transparency, privacy, and accountability. Additionally, it raises concerns about the impact of AI on job displacement and the potential for misuse of the technology.

Ruben Harutyunyan

Back to top