Apache ICEBERG V3: Move the ecosystem to unification

Apache Iceberg V3, now approved Apache Iceberg Community, introduces advanced new features and data types. Iceberg V3 includes main improvements, such as deletion vectors, line lines and new types for semi -structural data and geospatial cases of use. These functions allow customers to effectively process and ask data. In addition, these improvements consist of Delta Lake, Apache Parquet and Apache SparkSo customers can interrupt between delta and apache iceberg Without rewriting data or lines delete files.

In this blog post we cover the latest development in Iceberg V3:

  • Dental vectors
  • Line
  • Semi -structured data and geo -space types
  • Interoperability through Delta Lake, Apache Parquet and Apache Spark

Dental vectors

Iceberg V3 is a new format for rows levels to improve reading performance: Delei vectors. At the line level, the never reduces the amplification of the notation of entries by optimizing how the deleted lines are stored and monitored by the leading for fast ETL and ingestion. In Iceberg V2, the engines were not obliged to delete the files during entries. The intention was for customers to use asynchronous maintenance. However, many customers did not plan maintenance services, so their tables had too many non -cooking files. This led to a slow reading performance when the engines had to merge many files at the reading level.

ICEBERG V3 represents a new deletion vector vector format and new compaction requirements for deleting files. This new format avoids translation between Parquet files and representations in the memory used to apply Deleles. In addition, the Museain engines The only deletion vector on the file at the time of the enrollment. This requirement improves the performance and statistics of data files. This also facilitates comparison of previous and current delays, which simplifies the processing of changing the table line as a current.

Line

Another main function of Iceberg V3 is a line line that is used to simplify incremental processing. With a straight line, the engines find changes at the line level by corresponding to the lines versions across communications.

ICEBERG V3 represents a line line using the line level metadata: line ID and sequence number when the line was last modified or added. IDS identify the same line across versions. The sequence numbers are annotated when the lines have last changed – not only moved between files. This allows engines to process changes selectively and simplify Downstream updates using faster and cheaper work flows.

Information about the line ID is particularly beneficial when combined with incremental processing, such as objects of materialized views. These objects are optimized to calculate or change data from the last processing cycle.

Semi -structured data and geo -space types

Iceberg V3 also adds new data types for semi -structured data and geo -space data.

The semi -structure data is difficult because it has different schemes that do not fit into the structured columns of the table. One solution is to extract individual fields from these data into a structured format. However, this creates extremely wide tables with many columns and zero values ​​due to inconsistent schemes. Another alternative is to store JSon in columns of strings. This is rich, resulting in poor reading performance that the engines must have from these Tangs San Data. Without semi -eaten data types, the engines cannot push the filters down, so they have to read each line in each data file. ICEBERG V3 represents VARIANT Effectively reproduce the data of the semi -structure. VARIANT It encodes the data structure to improve performance while maintaining the flexibility of the scheme.

Similarly, it is difficult to work and ask effectively, geo -space data – information connected with places on the Earth’s surface, such as roads, parks or city borders -. Without geo -space types, customers had to use binary columns to store geodata places. However, this representation did not support geographical search because binary columns cannot be filtered to appear in the area. Iceberg V3 solves this problem by introducing new types of geometry and geographic data. Geometry types are designed for flat spatial data, while geography types are designed for global data that represent the curvature of the country. For these types, customers can easily find data using border boxes that restore geographic regions and efficiently search for geo -space objects.

Interoperability with Delta Lake, Apache Parquet and Apache Spark

New features and data types ICEBERG V3 expand the functions and improve performance. These properties of Apache Iceberg are also important because they push interoperability between Lake formats.

Historically, customers were forced to choose between the two of the most popular Lake: Delta Lake and Apache Iceberg. This is because most platforms support only one format. The rewriting data can be expensive and impractical on the scale, which is long -term. The formats are very similar: both are layers of metadata on the peak of parquet data files that provide the semantics of the table. However, small differences in table formats cause customers.

Iceberg V3 unified the data layer across formats. Thanks to data unification, customers can interview DELTA and ICEBERG without having to rewrite or delete data. This is because Iceberg V3 has compatible implementations across Delta Lake, Apache Parquet and Apache Spark:

  • Delection vectors use the same binary coding across table formats
  • The line level in Iceberg V3 is compatible with line tracking in Delta Lake
  • VARIANT and types of geodata develop in communities upstream Apache Parquet and Apache Spark ™ that reach for Apache Iceberg and Delta Lake Lake Lake

The Iceberg V3 has compatible functions across Open-Source projects, avoiding customers to choose the format. Instead, customers can work freely among the formats on one copy of their data.

More information about Iceberg V3

Iceberg V3 moves the entire industry forward to a more efficient, capable and interoperable world. We will make Iceberg V3 into the Databricks Platform Data Intelligence Platform and look forward to other suppliers accepting Iceberg V3. Open-source is the basic value in the databricks, where we actively contribute functions such as Deleční vectors to Iceberg V3. To support the prosperous open source code, we support and support contributions to Apache Iceberg. For new contributors, we recommend from the “good first number”.

To learn about how we plan to integrate the Iceberg V3 features into our managed table and the future of open table formats, register for data and AI summit on 9-12. June 2025.

Leave a Comment