Skip to Content

Your LLM Gets Its Data From Where??

cake with LLM frosting.
Knowing why AI does what it does means knowing where the LLM data came from. It’s like finding one grain of sugar in a cake. Is that what makes the cake taste a certain way? No, it’s all the ingredients. [Getty/Studio Science]

Off-the-shelf LLMs pull in data from all over the internet. That can be a good thing, except when it’s not.

Do you trust generative AI to help you with tasks at work? While AI-powered tools are great at projecting confidence and certainty in their responses, chances are you know little about the origin of the AI’s data. Did it come from a trusted source, like a peer-reviewed journal in your industry, or a dubious one, like a smart-sounding comment scraped from social media? 

If your generative artificial intelligence (AI) uses an off-the-shelf large language model (LLM) like OpenAI’s ChatGPT or Google’s Gemini, the data you’re relying on is a grab bag of stuff pulled from a multitude of sources.  

“Data traceability is super important. Sometimes knowing that you don’t know where the data came from is just as important as being able to trace the data,” said William Dressler, regional vice president of AI and data architecture, and the head of innovation in the global AI practice, at Salesforce. 

You can’t know why the AI does what it does until you understand how the LLM was trained and where its data came from. “It’s like finding one grain of sugar in a cake,” said Dressler. “Is that what makes the cake taste a certain way? No, it’s all the ingredients.” 

So, where is your AI data coming from? Let’s take a look at some of the main sources.  

Corpus data

Corpus data includes written or spoken data from books, newspapers, articles, websites (including blogs), academic papers, and more. As the name implies, there’s a lot of it. Wikipedia, where anyone can write and edit an entry, is a major data source. It’s estimated that Wikipedia makes up between 3%-5% of the scraped data used to train off-the-shelf LLMs. 

The good: Corpus data captures a diverse snapshot of writing styles and subject matter, helping models understand different writing styles. LLMs trained on this data are versatile, and can generalize their knowledge to a range of topics. Corpus data also makes LLMs more scalable; the volume of data means you can train ever-larger models with more parameters, which can yield better performance. Why? They’re better at capturing complex relationships and nuances in the data. With a larger corpus, the model can learn more and make better predictions.

The bad: Because this data is collected from such a broad array of diverse sources, it can lack the specificity to be truly useful in particular situations such as legal, healthcare or finance. Second, if this data is not meticulously curated, it can contain biased, unethical or offensive information that can be reputationally damaging for a company. Finally, as with all data pulled from the internet, it can just be plain wrong, leading to erroneous outputs like hallucinations. In short, it must be fact-checked

Discover Agentforce

Agentforce provides always-on support to employees or customers. Learn how Agentforce can help your company today.

Web scraping

Web scraping involves using code, or web crawlers, to automatically retrieve information from websites. This can include everything from Reddit forums to social media posts, computer code, product pages, and even personal information. The main difference between scraped and corpus data is how it’s collected. Corpus data is often selected and collected systematically, and curated for specific research or analysis, while scraped data uses tools to absorb as much information as possible. 

The good: Like corpus data, scraped data comes from diverse sources and therefore covers a broad range of topics. It also reflects real human language, including colloquialisms and slang, making the LLM capable of producing content that represents different voices and tones. Two distinct benefits: Scraping tools can be designed to run in real-time, meaning the information is always up to date, and they can also be tailored to specific domains or areas of expertise. 

The bad: Scraped data may include copyrighted or trademarked material. So the LLM’s outputs may be similar or even identical to scraped data. This is at the heart of lawsuits against OpenAI and Nvidia. The risk of copyright infringement exists with corpus data too, but is more pronounced with scraped data due to its nature. Picture an enormous fishing net, indiscriminately grabbing everything it can whether it’s relevant to the task or not. There are other issues. According to the Organization for Economic Cooperation and Development, LLM outputs based on scraped data “can mimic the style of artists or the likeness of people whose works, images, or voices were used to train the LLM.” These outputs could be used to falsely portray people. 

Jared Kaplan, co-founder and chief science officer of Anthropic, talks about how LLMs work, trusted AI, and more.

Public datasets

Public datasets are those created and shared by organizations, researchers, government agencies and others, and made available to the public for free. These include repositories like the nonprofit Common Crawl, which covers a wide range of topics, and includes multimodal data like text, audio and video.  

The good: Anyone can access and use this data without restrictions or fear of copyright infringement. Public datasets usually conform to standard formats and structures, making it easier for you to work with. 

The bad: Some public datasets, like the government census (collected every 10 years) or public health (which is trying to fix its old, siloed data) may be incomplete or outdated. They may also have biases and be limited in scope, as they may not represent the entire population or over-index on certain groups. Further, lack of documentation about collection methods can make it hard for users to dig in and analyze the data correctly. The rigor and detail of the documentation can vary, and not all public datasets provide comprehensive details about the data collection method. 

Your most valuable AI data is your own

Your own data (service and sales histories, demographic, behavioral and personal information about customers, etc.) is the lifeblood of your organization. To tailor off-the-shelf LLMs to your company’s needs, you’ll need to incorporate your own company data into the AI model or prompt. 

The good: The biggest upside is relevance. By feeding your company-specific data into the LLM, you tailor the model to understand and generate content directly relevant to your industry, products, and target audience. This results in more useful, company-specific outputs. 

The bad: Dropping proprietary company data into an LLM can present privacy and security risks if the information is retained by the prompt and then unintentionally exposed. Another challenge is integrating your data with the LLM, which can be time-consuming and costly. You first need a solid data foundation, including identifying which data sources to unlock and how they will feed your LLM.

These challenges can be mitigated with two AI-era innovations: a vector database, which alleviates privacy concerns, helps unify data, and saves time and money. 

Another effective approach is retrieval augmented generation, or RAG. It’s an AI framework that lets you automatically embed your most current and relevant proprietary data directly into the LLM prompt. This means all available data, including unstructured data like emails, PDFs, chat logs, social media posts, and other information that could lead to a better AI output.

Data traceability is just one piece of the puzzle

Generative AI is still so new that many organizations may be just now fully understanding the importance of data provenance. LLM data comes from a multitude of sources. Understanding what those are will give you confidence that the LLM and its outputs are trustworthy. 

“If you don’t have trust in your AI, if you’re not comfortable with it, you cannot have great adoption or get great results,” Dressler said. “It might actually become a liability.”

Get the latest articles in your inbox.