How Data Teams Are Making Unstructured Data Usable for AI

Technology news headlines are dominated by AI companies, hardware vendors, and cloud providers racing to build faster processors, larger LLMs, and new automated agentic AI services for every function and industry. Meanwhile, inside the enterprise, data storage managers and data engineers are working on a more fundamental problem: making unstructured data usable in lakehouses so AI pipelines have something worth running on.

The scale of the problem is heady. Gartner estimates that 80% of enterprise data is unstructured. IDC puts the figure as high as 80 to 90% and projects it grows three times faster than structured data. Yet most of it remains dark to AI, because it lacks the consistent schema that analytics and AI tools require to understand content and context.

Ingesting raw, unfiltered unstructured data at scale is prohibitively expensive and slow. Moving a single petabyte can take weeks or months. Most large enterprises manage 10 petabytes or more, scattered across multiple sites, storage and cloud silos.

More importantly, ingesting raw data leaves the gnarly problem of rationalizing it into a schema.  Bridging structured, semi-structured and unstructured data cost-effectively and accurately for use in data lakehouse platforms like Databricks and Snowflake remains a persistent barrier to delivering ROI on analytics and AI.

The Substantial, Unknown Opportunity for Customers, R&D and Operations

Before examining the methods, it is worth illustrating what becomes possible when structured and unstructured data are successfully merged in lakehouse and AI environments.

Customer relationship predictions. Merging a customer’s purchasing history with unstructured data from their emails and social media interactions to predict churn or future needs.

Operational intelligence. Combining automated IoT sensor logs with written maintenance notes or video footage from a production line to flag operational bottlenecks before they escalate.

Advanced financial forecasting. Enhancing traditional predictive revenue models by appending market sentiment scores extracted from unstructured sources like earnings call transcripts.

Financial services compliance. A data governance lead can build a unified data estate map by querying unstructured content across NAS, S3, and object stores and joining it with structured catalog data from Snowflake, Databricks and other data platforms. The result is a single view of where sensitive data lives, who owns it, and how it flows across systems.

Healthcare AI. A machine learning engineer at a healthcare provider can curate a high-quality dataset for fine-tuning a radiology model by querying DICOM files and associated reports, then joining with structured patient cohort data from the EHR to scope the right subset for RAG pipeline ingestion.

Raw, Unclassified, Expensive: The Unstructured Data Ingestion Challenge

Existing ingestion techniques were built primarily for structured and semi-structured data such as JSON and CSV. Applied to unstructured data, they share a common limitation: they copy raw data in bulk, without classification or segmentation. This creates cost and quality problems at scale when it comes to ingesting the right data into the data lakehouse. Here are few common challenges:

File-based batch ingestion from cloud storage. Files are staged in cloud object storage (S3, Azure Blob, GCS) and loaded into the platform on a scheduled or incremental basis. Data lands as raw strings, binary blobs, or semi-structured formats and is then refined through processing layers. It is straightforward but moves everything, regardless of relevance or quality.

External table references and in-place querying. Files remain in cloud storage and are queried via external table references from Databricks, Snowflake, or similar platforms. This avoids data duplication and is practical for large unstructured stores where ingestion overhead and storage costs are a concern. It is limited to cloud-native storage environments and still requires users to do the difficult work of preprocessing to provide structured descriptions of unstructured data before lakehouse tools can act on it.

Real-time streaming ingestion. Tools like Apache Kafka and Amazon Kinesis support event-driven and log data ingestion in near real time. However, streaming is not well suited to traditional unstructured file types such as documents, images, video, or audio. It addresses a specific subset of the unstructured data problem.

Each of these approaches can ingest raw unstructured data and pass the processing work downstream into the lakehouse. The challenge is that downstream processing is expensive, time-consuming, and produces inconsistent results when the data arriving has no classification, no enriched metadata, and no quality filtering.

The Core Challenges Data Teams Face

These limitations are especially acute for enterprises managing large unstructured data storage footprints, because they copy all raw data from source regardless of its value or relevance. This adds time and hassle for data engineers and analysts who need to focus on creating data models and pipelines for analytics projects.

A looming issue is that while unstructured data is diverse and contains immense potential value, it has no schema, which is required for analytics. That means there is no context for search, query, analysis, or governance. In most enterprises, 50 to 90% of unstructured data consists of duplicates, rarely accessed files, and non-authoritative copies created for backups or project work. Most of this file and object data has not been enriched with metadata that would help qualify it for projects.

Ingesting petabytes from multi-vendor storage environments, particularly NAS, is the other beast in the shadows as it is operationally complex. It can take months to complete and generates significant ongoing compute and storage costs inside the lakehouse. This is not ideal, as it includes data that should never have been ingested in the first place.

The result is AI pipelines fed with low-quality inputs that clog up tools and storage. Poor data quality erodes model performance and undermines the business case for AI investment.

Emerging Approaches to Deliver Unstructured Data to the Lakehouse

The limitations of bulk ingestion have driven development of new architectural approaches that prioritize metadata and classification over raw data movement.

Metadata-first classification and curated ingestion. Rather than moving files, new approaches first classify, extract and enrich metadata without moving any raw data. Files are cataloged with system metadata (size, format, location, timestamps), content metadata (extracted text, sensitive data classifications), and custom tags (DICOM headers or image metadata such as copyright, artist and GPS coordinates).

A structured, query-ready representation of the data estate is built without requiring physical data movement. Data teams query the metadata layer and retrieve only the subset of data that meets their criteria for a specific use case. This dramatically reduces ingestion volume and costs and improves data quality before AI pipelines ever see the data.

Open table formats as a metadata bridge to data lakehouses.Apache Iceberg has emerged as an industry standard for expressing structured, versioned, query-ready metadata over data stored in object stores. Iceberg provides ACID transactions, schema evolution, and efficient scan planning through a metadata tree that allows query engines to identify exactly which files to read without scanning entire directories. Platforms like Databricks, Snowflake, Microsoft Fabric, Google BigLake, and AWS Redshift all support Iceberg, making it a practical interoperability layer between storage and analytics. Unstructured data catalogs represented as Iceberg tables containing metadata can be queried directly in these platforms using standard SQL, without requiring bulk data copies being loaded.

Selective, governed data delivery. Rather than ingesting everything and filtering later, emerging approaches let IT and governance teams define which subsets of the data estate are exposed to analytics and AI tools. Specific dataset subsets, defined by file type, classification, sensitivity, or business unit, are made available to data engineering and AI teams while preserving access controls and governance policies.

AI-assisted metadata extraction. Transformer-based models and LLMs are increasingly used to extract semantic metadata from unstructured content at scale, including content type classification, entity recognition, and contextual tagging. This makes it practical to enrich large file repositories automatically, building the metadata layer that makes downstream query and curation reliable.

Conclusion: Classify First, Deliver Precisely

Together, these approaches shift the architecture from “ingest everything, process later” to “classify first, extract consistent schema for unstructured data, deliver precisely.” For enterprises managing petabyte-scale unstructured data estates, that shift has a positive impact on AI pipeline quality, the ability to unify analytics on structured and unstructured data, and infrastructure cost. As unstructured data takes the center stage to fuel enterprise AI, storage teams have an elevated role to play by bringing structured, high-quality, governed access to unstructured data for data engineering and AI teams.

Picture of Krishna Subramanian

Krishna Subramanian

Krishna Subramanian the co-founder of Komprise.
Stay Ahead with TechVoices

Get the latest tech news, insights, and trends—delivered straight to your inbox. No fluff, just what matters.

Nominate a Guest
Know someone with a powerful story or unique tech perspective? Nominate them to be featured on TechVoices.

We use cookies to power TechVoices. From performance boosts to smarter insights, it helps us build a better experience.