In 2020, Snowflake announced a major roadmap. Two years later, the vendor of a multicloud data warehouse released almost all of the listed features. During Snowflake Summit 2022, he provided cover for at least two years.
First, there are performance optimizations, one of the supplier’s original promises. On AWS, Snowflake showed new instances based on Graviton 3 processors. Performance is said to be approximately 10% better than Intel processor-based instances typically deployed in this cloud. More importantly, the supplier re-optimized storage compression by 30%. Data ingestion latency was reduced by 50%, while replication latency was reported to be reduced by 55%.
Then, there is a need for Snowflake to offer better data consistency while ensuring a high level of performance across most data formats, coming from the relational world.
Apache Iceberg: better consistency for analytical processing
In response to this request, Snowflake plans to natively support the Apache Iceberg table format, which will soon be a private preview. This open source technology, which competes with Delta Lake – developed by Databricks – (a table format that Snowflake supports on its external tables) should facilitate analytical processing on large volumes of data (at the petabyte scale). ).
First, Apache Iceberg is a guarantor of data consistency. This should allow multiple applications to rely on the same data while tracking changes to files in a table. This is due to the corruption phenomena that contaminate most data stores. Designed as an alternative to Apache Hive, the open source project should provide better performance, a scalable schema, the ability to retrieve point-in-time data, and support ACID processing from these tables that stored in customer’s Snowflake buckets. .
Snowflake already offers external tables in the Iceberg format to perform transfers, ingest from cloud systems or to apply in-place processing to data that cannot be transferred to the cloud data warehouse. In addition, the supplier announces in private preview the possibility of storing external tables on on-prem systems, in the first place with Dell and Pure Storage.
With the support of native Apache Iceberg, company spokespersons promise that all features (management, encryption, replication, compression, etc.) of the platform will be compatible with this type of table. Here, Snowflake engineers chose to combine Parquet’s data format with Iceberg’s metadata and catalog metadata. The publisher did not say whether its Iceberg tables will support ORC, Avro, JSON or other extensions. For information, the format of this table is agnostic to the format of the data it encapsulates. More importantly, Iceberg is compatible with a variety of data transformation engines, including Dremio, Trino, Flink or even Apache Spark.
Christian KleinermannSenior VP, Product Management, Snowflake
In view of the capacities listed, Apache Iceberg is a breeding ground for the deployment of a Data Mesh. A path the supplier intends to explore.
Importantly, Iceberg will avoid the use of Snowflake’s proprietary tables. “Some of our customers have told us they want a certain amount of data to be readable in open file formats,” said Christian Kleinerman Senior Vice President, Product Management at Snowflake. “It gives us a way of interoperability. It’s very important to us,” he added.
Unistore and Hybrid Tables: translytic according to Snowflake
Snowflake now wants to support analytical and transactional data processing, such as MongoDB or Google using AlloyDB.
To do this, the publisher introduces a private preview of Unistore, a feature based on “Hybrid Tables”, which are literally vehicles for HTAP (Hybrid transactional/analytical processing) capacity. Actually, Unistore is a row-oriented engine that can host transactional processing on Hybrid Tables. It also makes it possible to perform analytical processing on transactional data.
More importantly, this machine makes it possible to specify at least one primary key and one foreign key, making it possible to reduce duplicate inserts. If the user has successfully activated the key system, the coercion mechanism sends an alert if the data is already present in the data warehouse. In principle, it should, among other things, make it possible to rationalize data ingestions and transfers, in order to avoid unwanted copies.
“We have several customers, including Novartis, UiPath, IQVIA, Crane or Adobe who have tested this feature. The feedback has been pretty positive,” Christian Kleinerman assured.
Maintaining data consistency, translytic approach … Snowflake seems to have the cards in hand to offer its customers a true multicloud data lake to support most workloads.
However, we still have to wait. The provider only supports unstructured data on general availability since April. Instead of supporting data formats or types, the editor supports files from object stores (Azure Blob Storage, Amazon S3, and Google Cloud Storage) and URLs.
Better management, tools to manage costs
While some customers expect this type of functionality, others are more interested in data management and cost management capabilities. Thus, a UI dedicated to data management will “soon” be in private preview, as would be the case for a column-level lineage mechanism. As promised by the vendor, a system for masking data by labels will be available shortly in public preview. Data classification is usually available “as soon as possible”.
Regarding the application of the FinOps approach, Snowflake intends to introduce a function called “Resource Groups”. This way, computing and storage resources can be linked to tables or data objects to track their cost.
Christian KleinermannSenior VP, Product Management, Snowflake
Some users have been waiting for the replication features promised by Snowflake for a year. Client Redirect will enter general availability soon. With a secure URL connection, Client Redirect allows you to failover to another region of the same cloud or to another cloud. Ideally, processing is delayed only momentarily during an incident that causes an instance to crash.
In the same vein, as soon as possible in public preview, account replication should make it possible to adapt the same mechanism to user accounts. “With this feature, you’re not just copying data, but you’re replicating all sorts of metadata about account users, roles, repositories, and all the definitions surrounding the data,” Christian says. Kleinerman.
In the private preview, the vendor introduced pipeline replication. This last option will be very useful when Snowflake launches its streaming pipelines. In fact, the publisher has announced a private preview of Snowpipe Streaming. This system should make it possible to perform data ingestion via microbatch from serverless environments. In this case, Snowflake changed its use of Kafka Connect to improve its data ingestion capabilities.
In the same vein, Materialized Views is a packaged pipeline system that does. This should make it possible to prepare material insights in a declarative manner and conduct incremental updates.