S5E10: The AI Data Gap | History Factory

The History Factory Podcast · The AI Data Gap: Why Historical Assets Matter More Than Ever

Share this episode

As artificial intelligence continues to reshape how organizations access and apply information, many companies are overlooking one of their most powerful resources: corporate archives. Chris Juhasz, senior director of archives at History Factory, explains how archival collections—when properly digitized, structured and made accessible—can serve as a trusted foundation for AI tools and enterprise-wide knowledge systems. These records hold critical context, institutional memory and source-verified data that large language models often lack.

Rather than viewing archives as static repositories, Chris makes the case for treating them as dynamic, intelligence-enabling platforms. For brands looking to lead in the AI era, activating heritage may be the smartest move they haven’t made yet.

Transcript:

Jason Dressel 0:11
Today on The History Factory Podcast, corporate archives and their role in the new era of AI with history factory’s Chris Juhasz.

Jason Dressel 0:25
I’m Jason Dressel, and welcome to The History Factory Podcast, the podcast at the intersection of business and history today, I’m joined by my longtime friend and colleague, Chris Juhasz, senior director of archives, and we’re here to talk about archives from a very different dimension, and that is in the context of AI as companies race to train smarter models and produce content that is accurate, relevant and enriching, we’ll explore how historical assets, often overlooked, may become some of the most valuable data sets of all. So let’s jump right into the conversation with Chris.

Jason Dressel 1:09
Mr. Chris Juhasz, welcome to The History Factory Podcast.

Chris Juhasz 1:12
Well, thanks for having me, I’m happy to be here.

Jason Dressel 1:15
Yeah, thrilled to have you, so amazing. This is your first time on but Chris, why don’t you, for our listeners, why don’t you just provide a little bit of a background on who you are and your role here at History Factory.

Chris Juhasz 1:27
Of course. Well, I’ll start by saying I’m a professional archivist. First and foremost, I have held certification from The Academy of Certified Archivists. I’m educated as a librarian so I came right from library school to History Factory in 2003. I’ve been with this organization for 22 years now. I started out as an entry level archivist and now I have the privilege of leading our archival team here in our archival facility in Virginia. So my work centers on, you know, overseeing the day to day here, and helping out where I can with project operation. But also I’m heavily involved in strategy work, helping our clients plan archival projects and programs and doing a lot of the upfront work that happens before an archival program comes into being.

Jason Dressel 2:26
Yeah, and maybe we can start there. We’ve had a number of different archivists on the podcast over the years, and particularly this season. So I can maybe pull the curtain back a little bit. Since we don’t have to, I don’t have to appear so smart with you. So maybe for our listeners, let’s start with the big picture. When we talk about corporate archives, what are we talking about in terms of what an archive is and how it’s used by an organization?

Chris Juhasz 2:58
Sure, I like to think of archives broadly, is pretty much any, any information source that gives an organization the means to remember and to interpret and understand and communicate the past. So the specific kinds of information kept as archives very widely, you know, based on what the organization does. But what those information sources all have in common is the potential to be used for lots of new purposes, from creating new communications to maintaining organizational culture, clarifying legal issues, enriching learning processes so you know, it’s, it’s material that’s kept because it’s, it’s special, because there are ways that we anticipate using it in the future, beyond just for some, you know, compliance need.

Jason Dressel 3:55
Yeah, yeah, and so, and that’s a great kind of setup for what we really wanted to talk about was this notion of the intersection of what is happening with AI and what is happening with information and data. And you know, most organizations assume that they’re, well, I shouldn’t say they assume most organizations are working very hard to essentially leverage and maximize the data at their disposal to build out their AI capability. And I’m kind of curious what your thoughts are in the context of what they are overlooking in that search for data to build?

Chris Juhasz 4:49
Yes, so I think many organizations are not accounting for the fact that there’s a significant amount of important calls. And historical data missing from, you know, the past three or so decades of digitally created content that they’re using in their AI integration. So I’m talking about information that’s still stored in physical formats and isn’t accessible to AI models. I think organizations might not be thinking about how much domain specific knowledge is locked up in hard copy, text and images and video. So you know, if they have repositories of older information, like archival collections, or any stores of physical business records that are still in that, you know, kind of period of inactive retention they should be considering how to tap into them as part of their AI projects.

Jason Dressel 5:48
So are you talking specifically about data that is digital, data that is associated with analog formats that are essentially older than the last 30 or 40 years, when obviously most of the information in an organization is born digital. Or are you talking specifically about data that needs to be created because it essentially represents a huge treasure trove of material, sort of pre, pre, pre, pre digital native, pre native digital era format, if that makes sense.

Chris Juhasz 6:29
Primarily, yeah, primarily talking about the latter, although you’re right about the former, because I think there is some electronic data that is being overlooked as well. I think the major, the major Miss here, or missed opportunity here is, is in this, you know, physical format. You know, this material that is actually on paper and on tape and printed on photographic paper. You know, all these are the things that are completely unavailable to AI and require a bit more strategic thinking and work to get them to a place where they are able to be leveraged or fed into those models.

Jason Dressel 7:21
So, why? Why? So, what would the business case be for organizations doing that? What are the sort of implications of this gap in the data, or the discrepancy in terms of, you know, the data that organizations are using to build and train large language models, and now we’re hearing about these models called Small language models, but regardless of the sort of specific types of deployment, if you have organizations that are basically looking to build proprietary data models that are being used to enable AI capability, what are the implications if they’re not using some of the sources of data that you’re talking about?

Chris Juhasz 8:08
Sure, yeah, so the data that organizations are using to create that kind of contained environment for AI that you’re talking about are largely coming from available text and images that have been sort of scraped from legacy systems and file shares. And this is a potential problem because, because these data are too limited in scope and coverage, in addition to being repetitive and rife with inconsistencies and biases. But we’re not going to talk about that today. We’re primarily talking about the issue of limitations of scope and coverage. AI models trained on, you know, insufficient or what I call chronologically biased data sets can experience performance problems that like reduced accuracy, biased decision making, and, you know, lower ability to generalize. So AI systems, as we all know now, or at least we’re all becoming to recognize that these systems are heavily dependent on the data they’re taught with, and that significantly impacts their performance and decision making capabilities. So the higher quality, more diverse and better curated data are essential to developing these effectively for private AI implementation. So yeah, it’s pretty straight, pretty straightforward. The Garbage in, garbage out, the more high quality data you feed into your model, the better its outputs are going to be.

Jason Dressel 9:33
Yeah, I’m curious, have you and our team seen examples of this in the real world, like with our clients or elsewhere. Have we seen scenarios where this kind of garbage in, garbage out or or these big gaps in data have led to less optimal outcomes in terms of how someone has used AI?

Chris Juhasz 9:57
Yeah, I mean, I should say that our use of or. Use of AI tools in our research practice, I’d say, is still relatively new and and untested, and most of what we’re doing is with the large models, you know, the commercially available large models, but we are noticing that these large models can be, can be unreliable sources for answers to questions about, you know, when something first occurred, or about the time span of an organization’s activities and relationships. One example, we had a client who had a partnership with a non profit organization, and we had physical archival records clearly indicating the start date or the origins of that relationship. But, you know, chat GP told us that it began, you know, almost 20 years later than what we knew to be true. And we’re also seeing some limits to how well the models can answer questions about why something occurred, what events led to an event, or what were contributing factors and consequences, so, you know, and we’ve also encountered some instances of flaws and inconsistencies in reasoning. So if you ask a question one way, say about an organization’s history of an organization’s, you know, potential mergers with another organization, the model will say it never happened. If you ask it a different way, it will pinpoint the date of the actual merger between those two companies. So there’s, there’s a lot of, you know, unreliability that we’re noticing in the large models, and I think a lot of that is attributed to, you know, the the the amount of data and the quality of the data that that is that’s being used to train those models.

Jason Dressel 11:41
So what? So what kinds of assets make sense, like from, from your view, where are some of the gaps that and opportunities, where companies, you know, whether it’s an initiative being driven by a, you know, a CIO or a knowledge management group or a digital team, what are the kinds of categories of materials that they are sort of overlooking or missing that could help solve for some of these less, less ideal outcomes.

Chris Juhasz 12:13
We think, you know, tech text based materials the best starting place when I when I say text based material, and talking about things like, you know, publications, newsletters, reports, you know that high level communications, like your executive memos, press releases, the proceedings of board and executive meetings, you know, financial statements, speeches, transcripts, any extensive source of sort of concentrated and trustworthy content is, is a good starting place. And after that, the next you know, the biggest decision factor is, is, is about the physical characteristics of that original material. It’s important to identify older printed and textual records that are in good condition from which, you know, we can make good input images for OCR. OCR is the technology that converts, you know, images of text into computer readable data, so documents that are higher contrast, have, you know, no irregular fonts, fewer imperfections and simpler layouts can be recognized with fewer errors. That’s important, because OCR accuracy is the key. It’s really the key to creating that high quality AI ready data from analog format. So yeah.

Jason Dressel 13:37
And what, and what does that process look like? I mean, from a is it one thing we talk to our clients about, right is, like, it’s not just like scanning materials. Like, what is the process that has to take place to actually transform, you know, analog content into AI ready data?

Chris Juhasz 13:53
Yeah, well, you know, we’ve already, once you’ve identified the right original source material, and you have scanned it and that text has been extra extracted with the OCR software, that resulting data needs to be pulled into shape, it has to be structured so that it can be stored in a predictable and standardized format. So basically, AI models don’t know how to interpret and process unstructured information correctly, and there’s a number of ways to prepare this legacy data for AI and but we’re we’re experimenting with one approach that involves converting the OCR text to something called a markdown format for kind of kind of portability purposes. So Markdown is basically a plain text with simple symbols and conventions for preserving the structure of the original document so you get a machine parsable version of the document that still maintains the kind of fidelity of it all. Its formatting characteristics, and then the Markdown is combined with metadata, which we’ll say are kind of labels that that are intended to teach AI about the meaning of the data, the logical meaning of the data, and this is critical to AI’s ability to understand the context of the data and to generate meaningful results from it, though. So the Markdown and the metadata are then outputted to something called a JSON file, which is a machine readable kind of exchange format that’s widely used for integration with AI applications.

Jason Dressel 15:41
Yeah, awesome, and it’s changing so fast. We’re obviously like everyone else, we’re investing in our own AI capabilities to essentially create systems that will be able to perform better and deliver higher quality content. I think one of the things that most of us recognize with AI is it’s as incredible as it is, it’s still not great in context, right? And you sort of touched on that before, in that context, in that ability for it to function better, really is, to your point, focused on, you know, the better data that you put into it, both in terms of, sort of the way you can put in high quality, accurate content, but also data that can help it with that sort of contextual piece. Is really critical. We’re seeing that. But I’m curious, just you know, as we continue on this journey, what are the things you’re sort of the most excited about in terms of where this is heading over the next year or two?

Chris Juhasz 16:44
Yeah, it is definitely an exciting moment. We, you know, we’re very much aware that this is kind of a turning point for our profession on the archive side, and certainly for our clients, it’s a major turning point. But I think the thing that excites me the most is that we believe it’s going to result in greater motivation and interest in digitizing historical collections. I mean, we’re going to be able to take advantage of this confluence of the need to support mission critical projects around AI and to and to help help them be successful, which will also enable us to achieve, you know, preservation and access goals that we as archivists and and and and librarians have had for years, and we’ll be able to achieve those goals potentially faster and on a much larger scale than we may have previously imagined. It is still very much an expensive and time consuming process to digitize physical material and then and then, to make it, to make it ready for AI but, but I think there’s going to be an increased, you know, impetus to get that work done. And that is exciting to me. I think also it represents, you know, a moment where there may be renewed appreciation for the role that librarians and archivists play in Information Access processes and our value in supporting these kinds of tectonic changes in information technology. So I think, as we talked about the metadata element, the idea that archivists and librarians have always known, have always been good stewards of information by being able to add that context and to create access points and to enable understanding, I think there’ll just be a new light shined on that, which is, you know, exciting to me as well. And then the other one might be a little the last thing that I have on the list of sources of excitement, which may be a little bit far fetched and maybe a little a little maybe less realistic, but, but I do have some hope that there will be the possibility of collaboration between organizations to share their data in a way, of course, that that preserves the privacy, but that also improves, you know, availability and diversity of data, and then, you know, thereby benefits AI research and development and and, of course, this, this, if it were to happen, could also potentially further drive investment in this kind of digital transformation of archival collection. So there’s a number of things that make me optimistic and excited about the future as a result of the shifts toward AI.

Jason Dressel 19:38
Yeah, it’s without getting too in the weeds, the shift to just the proliferation of digital content over the last 30 years has been in many ways, a havoc creating outcome for traditional archives and library management processes. And we certainly, you know, as a company that’s been around for almost 50 years, we certainly have made dramatic shifts in how we do our work. And it feels like with AI, suddenly the archives and librarian library sciences, suddenly, it’s like, I don’t know if I would say, level the playing field, but it feels like, you know, we have some new tools that that are going to help us meet the scale, because that’s been one of the challenges, right? It’s just the sheer scale of content, and with AI, it feels like there’s, hopefully, some new tools that that help, help solve for.

Chris Juhasz 20:43
Yeah, absolutely. I mean, I think, I think the kind of manual approaches and processes that we’ve always relied upon, obviously, technology is going to find a way to eliminate those as barriers to getting AI, you know, running effectively. And so I think the role that will play is not so much continuing to apply the same manual methodologies and approaches, but to be able to be contributors to the design of automation tech, automated approaches to this work, and that’s certainly something that we’re experimenting with here a lot more Now as well, is using AI as a tool for automating metadata extraction as an example, because creating metadata is one of the most time consuming aspects of you know, enabling understanding of this content and and AI holds an enormous amount of potential to streamline and Quicken that process and and still, you know, generate the kind of, the kinds of data and the amount, the amount and richness of data that’s necessary for for material to be to be usable, so both by people and by machines.

Jason Dressel 22:01
So what would be so for folks listening? Who are, you know, in organizations working in the space of being accountable for archives or historical materials? Maybe, maybe for many companies, we know that there may be a more of an informal existence of an archives program, but if you’re, for instance, in corporate communications or marketing or another group where you have some role or accountability with respect to archives and and management of these materials, what would you recommend in terms of you know how to better access those materials and sort of, again, the intersection of how that’s relevant to to what companies are doing with respect to AI.

Chris Juhasz 22:54
So, I mean, the first thing is getting a handle on it. I mean, consolidation of the information is the first step, no matter what the reason is, if material is not centralized and in one place, it will be very hard to get it to a state where, you know, it can become part of the data being used to teach an AI system. So starting first with getting everything in one place and consolidating is an important consideration. And beyond that, it isn’t necessary to jump right into a lot of time intensive and expensive metadata capture. You can begin by simply applying something that we would call ontology or taxonomy, organize the content to get the baseline amount of metadata that’s necessary for an AI system to contextualize and understand the relationships between objects and, you know, pieces of information. So centralization and basic organization are still kind of the foundational elements, and definitely worth starting down the path toward those, those goals sooner rather than later. Yeah, and then beyond that, then beginning to think about how to invest dollars smartly in converting some of that information to digital and that that does require, you know, some, some of the analytical expertise of a person who works with the store, an organization that works with historical materials, to be able to to be able to, you know, know where the most you know sort of benefit is going to be derived, which materials are going to give you the most sort of, you know, value for dollar.

Jason Dressel 24:47
Yeah, but to your point, it’s opening up the playing field with respect to, you know, where funding for archives programs can potentially come from in enterprises. Because suddenly, in the context of AI. And data. You know, I think the point you’ve been making is there’s this very large especially for, you know, older legacy companies, meaning companies like over 40 you know, as organizations continue to need data, and even now, you know, creating synthetic data to continue to feed these AI beasts. It may really open up, you know, the resources in terms of the places where archives and digitization can be funded, which, which is also pretty exciting?

Chris Juhasz 25:37
Yeah, absolutely. I think there’s going to be, I mean, you said it that the scarcity of data, to become a problem. At least most of the things that I’m reading are suggesting that that these models will begin to collapse under their own weight, and it’ll be the same scenario for these private implementations, for organizations, if they don’t have a robust enough data set to learn from that there, there, they will ultimately decline in performance and their their effectiveness will will degrade over time. So I think there’s going to be a movement toward unlocking, you know, this information that is in physical formats, and making more of an investment in efforts to feed these models with good quality, you know, reliable, reliable data sources. And that’s going to require digitization. I think that going to digitization on the scale that’s going to be necessary is going to require some collaboration and some partnership, potentially with outside sources. So I think that all of these things are probably going to materialize in the near term.

Jason Dressel 26:49
Awesome. Well, Chris, thank you so much. Keep up the good work, and we’ll see you soon.

Chris Juhasz 26:55
Yeah, thanks very much. Appreciate it.

Jason Dressel 27:02
Thanks again to Chris Juhasz. Thanks again to all of you for listening to The History Factory Podcast. If you’re interested in learning more about archives or any of the topics that we cover here on the podcast, drop us a line at [email protected] or spend more time on our website, historyfactory.com. I’m Jason Dressel. Thanks again, and be well.

View Transcript

S5E10: The AI Data Gap: Why Historical Assets Matter More Than Ever

Share this episode

Transcript:

More Episodes

S5E10: The AI Data Gap: Why Historical Assets Matter More Than Ever

Share this episode

More on this Topic

Transcript:

More Episodes

More on this Topic

Why Archives Matter

Why Milestones Matter

Guide to Moving Your Historical Assets