What Does AI-Ready Content Mean? Role of Data Structure

The History Factory Podcast · S6E7: What Does AI-Ready Data Really Mean?

Share this episode

Tune into episode two of our Chroniqle™ AI miniseries on the importance of AI-ready content for knowledge management you can trust. Host Erin Narloch and Fred D’Silva, History Factory’s head of technology enablement, are back, discussing the role of data structure in providing AI models with information they can read, History Factory’s data conversion process for Chroniqle that keeps humans in the loop, and the broader implications of AI readiness.

Transcript:

Erin Narloch 00:11

Hi, History Factory listeners. We’re continuing on with the Chroniqle, AI-ready archives conversation today by taking on the topic of: what do we mean when we say, “AI-ready archives?” During this conversation, Fred really provides us with a detailed answer of what that means and how we’re approaching it at History Factory. I really hope you enjoy today’s conversation. Let’s get into it.

Erin Narloch 00:45

Hey, Fred. Great to be with you today talking about Chroniqle in this three-part series. I thought for today’s conversation we can talk about the concept of AI-readiness and knowledge management. I do want to shout out that we did a whole podcast on this back in Season 5, Episode 10, “The AI Data Gap: Why Historical Assets Matter More Now Than Ever.” This was a conversation between Jason and Chris. So, listeners today, if you like what you’ve heard us talk about, give that previous podcast a listen. I’m sure there’s much more you can learn from those two experts. I thought today we could start by me asking you the question, you know, what does AI-ready assets mean, or AI-ready archives mean, to you? I would love to hear your thoughts on that, Fred.

Fred D’Silva 01:44

Yeah, for sure. That’s actually such a big topic in the industry as a whole today, from a technology standpoint, just because what that means, essentially, is how ready are your assets to be able to be used by, like, these generative AI models, like, something like a ChatGPT or a Google Gemini? Essentially, it’s—these models are able to synthesize based on text you give it or content you give it, but not all content is created equal. So, generally, what you have to do is you have to have data that’s structured in a way that makes it easier for these AI models to read and understand nuance more—nuance meaning, like, structure, meaning, like, context behind a paragraph, context behind a quote, numbers inside of a table—all of that is vital information for the structure of a document itself, but how the AI reads that data, especially in today’s world, is very important. And I do want to mention that a lot of these systems make it so easy and quick to use, where you don’t really know all the nuance that’s happening in the background. You obviously have to convert a document into, like, a format that AI systems can read, but a lot of that is invisible to the user using, like, say, something like a ChatGPT. But the challenge is that that conversion process is very important in making sure your numbers and your tables and your pictures are conveyed meaningfully and correct to these AI systems. So, all that to say, AI-ready assets really means: how compatible is your data with these AI systems? And is it compatible in a level where you can plug it into, like, say, any AI system, and it could read and understand all of those nuances—context—correctly, or will you run into issues where information gets lost?

Erin Narloch 04:18

Yeah, I think that, you know, AI-readiness is that bedrock to get you prepared for this technological revolution we’re in. And I think of the work that we do to prepare, you know, knowledge bases for ingestion into Chroniqle, and that process of conversion is so important. Can you just talk a little bit about that, and how we think about it at History Factory?

Fred D’Silva 04:47

Yeah, for sure. So, the best way to describe it is: your AI system takes whatever you give it and then responds based off what you give it. So, the concept of, ‘you put garbage in and you’ll get garbage out.’ Obviously, with data being such a gold mine, and History Factory being built on the concept of verifiable truth and authenticity within our organization, we really wanted to challenge ourselves in making a system that can be—I should say, that people have a confidence in using. So, a lot of the times when, like, say, you use something like a ChatGPT, or even a Copilot, you ask it a question, and it might hallucinate some sources. It might, like, mix or mingle around some percentage or some number, or it might conflate, like, an origin story. And a lot of those challenges is not only due to how, like, those models search that information, but when it searches those sources of information, it might be searching and intaking that information in a way that those contexts get lost, or the original source might have not formatted that data correctly in a way that AI can read it. So, the process that we use in Chroniqle is essentially converting it into—say, like, your PDF, your Word docs, your PowerPoints, even your transcripts. We essentially go through a process of converting that original data format—whether it’s, like, a PDF, a Word doc, like, something that is well-formatted and well-defined—we have to get it into a format that is, quote, unquote, ‘plain text.’ So, it’s just text that AI models and AI systems can easily parse or recognize. And the challenge is: how do you take, like, say, a magazine spread, where there’s, like, content at the upper left corner, upper right corner, middle of the page, or there might be, like, a picture in the bottom left that, like, conveys the story of what that article has? When a human reads something like that, we immediately know, like, where our eyes should be looking at to read that data correctly. The challenge is when you use, like, these well-known systems called, like, OCR, Optical Character Recognition systems, they’ll essentially extract that text out. It might not extract, like, picture descriptions. It might not get, like, tables correctly. And when you just do, like, that simple OCR in getting the text, you have a bunch of text but, like, the order of how that information is displayed might not be correct, or a lot of the times is incorrect when you see that OCR. So, when you have, like, something garbled up like that, where the sectioning is incorrect, tables aren’t conveyed correctly, there might be percentages missing, and you feed that to, like, an AI system, when you ask it about the data, it’s going to give you incorrect data. And that’s what a lot of these systems are doing is they’re doing kind of the very fast approach in extracting text out, and these contents of these files out, where it could make a lot of mistakes. And so, with Chroniqle, we actually go through a whole conversion pipeline—or orchestration, if you will—where we pass it through, like, kind of like a first draft of converting using some well-known, like, Large Language Models, or even, like, machine-learning services that, like, Microsoft Azure or Google Cloud might provide. We get, like, that first pass of that document, but obviously that document, when it’s converted over, will still have those mistakes where a table might have, like, an extra cell in it or extra row, or, like, some of the numbers might have shifted over, or some of the text is completely lost because it might have been handwritten. We essentially employ a staff at History Factory to then go through a validation process of correcting those errors, basically correcting the tables so that there isn’t an extra row, making sure, like, the numbers are in the right place, making sure that the content is structured in a way that flows, just as if a human was reading that original magazine. So, instead of having, like, a content box at the upper left corner, upper right corner, middle of the page—like you would in a PDF—we’re now organizing those content blocks in the correct order, where it’s going as if, like, a human was reading it. All that to say, we finally get that converted document after it’s been validated, and now it’s in a format that a machine could read, but then, obviously, like, that document now you could include, say, like, in your ChatGPT, like, custom bot, you can attach it as a document to, like, a chat, and you could start using that. But in the form of something like Chroniqle, where you have hundreds and hundreds and hundreds of documents and you want the AI to work on it, you’ll need a much bigger system that can handle that. And that’s where—that’s another differentiator of Chroniqle. But at least now you get to that first step of having, like, an AI-ready document that can be used in any system.

Erin Narloch 11:13

Yeah, I think that’s a wonderful way of describing the process that we take at History Factory, and it’s also a service we offer, like, in conjunction with, separate from, Chroniqle, is, like, that preparing the archives to be AI-ready. I know that’s something Chris and the team in archives has done and continues to work on. Well, I think you’ve provided me with just great insight into why AI-readiness is so important. And I think, just to tack on at the very end of this conversation, something that, you know, this underpins is this concept of knowledge management and how, you know, compatible knowledge is between systems, and readable knowledges between systems, and how we can really help to reinforce the knowledge an organization has, you know, within their, quote, unquote, ‘four walls,’ and really reinforce that. And I think taking on AI-readiness and thinking of archives as really being foundational for knowledge management within an organization is really, really important. Thanks for your time today, Fred.

Fred D’Silva 12:28

Of course. Thanks for having me, Erin.

Erin Narloch 12:30

Yeah.

Erin Narloch 12:30

I really enjoyed that conversation, and I hope you join us for our next installation on Chroniqle and AI-ready archives at History Factory. Alright, until next time.

View Transcript

S6E7: What Does AI-Ready Data Really Mean?: Solving the “Garbage In, Garbage Out” Challenge

Share this episode

Transcript:

More Episodes