#rdd-api
#rdd-api

3 days ago

Django

Snowflake Supports Directory Imports

fromTheregister

Artificial intelligence

Snowflake's ongoing pitch: bring AI to data, not vice versa

Django

3 days ago

Snowflake Supports Directory Imports

Easier package imports into Snowflake functions and procedures from stage directories and SnowGit directories streamline development and deployment.

fromTheregister

Snowflake's ongoing pitch: bring AI to data, not vice versa

Snowflake is enhancing its platform for AI integration through strategic partnerships and acquisitions, focusing on customer ROI and data management efficiency.

Timesliced reservoir sampling: a new(?) algorithm for profilers

Random sampling from an unknown-length event stream can effectively identify relevant information without storing all data.

Databricks pitches Lakewatch as a cheaper SIEM - but is it really?

Translating benefits into buy-in from CIOs and CISOs may be challenging for Databricks despite its intent and acquisitions.

Venture

Databricks secures $1.8 billion in funding

fromNew Relic

Business intelligence

Optimize Databricks: Full Visibility with New Relic

Information security

Databricks pitches Lakewatch as a cheaper SIEM - but is it really?

Translating benefits into buy-in from CIOs and CISOs may be challenging for Databricks despite its intent and acquisitions.

Venture

Databricks secures $1.8 billion in funding

fromNew Relic

Business intelligence

Optimize Databricks: Full Visibility with New Relic

Data Extraction and Classification Using Structural Pattern Matching in Scala

Scala pattern matching enhances code readability and extensibility in real-world data engineering use cases.

Scala

I Thought Scala Was Vibe Coding

Scala emphasizes immutability, expression-oriented programming, powerful pattern matching, and Option-based null safety for more concise, safer, and more composable JVM code.

Scala

6 days ago

Data Extraction and Classification Using Structural Pattern Matching in Scala

Scala pattern matching enhances code readability and extensibility in real-world data engineering use cases.

Scala

I Thought Scala Was Vibe Coding

Inside Netflix's Graph Abstraction: Handling 650TB of Graph Data in Milliseconds Globally

Netflix engineers developed Graph Abstraction to manage large-scale graph data in real time, enabling fast queries and supporting various internal services.

Science

fromNature

Drowning in data sets? Here's how to cut them down to size

The Square Kilometre Array Observatory will generate massive data, but storage and retention pose significant challenges for researchers.

Spark Internals: Understanding Tungsten (Part 1)

Apache Spark revolutionized big data processing but faces challenges due to JVM memory management and garbage collection issues.

Java

Spark Internals: Understanding Tungsten (Part 2)

Catalyst Optimizer and Tungsten work together in Apache Spark to optimize data execution and manage raw binary data.

Java

Spark Internals: Understanding Tungsten (Part 1)

Apache Spark revolutionized big data processing but faces challenges due to JVM memory management and garbage collection issues.

Java

Spark Internals: Understanding Tungsten (Part 2)

Catalyst Optimizer and Tungsten work together in Apache Spark to optimize data execution and manage raw binary data.

Migrating to the Lakehouse Without the Big Bang: An Incremental Approach

Query federation enables safe, incremental lakehouse migration by allowing simultaneous queries across legacy warehouses and new lakehouse systems without risky big bang cutover approaches.

Information security

Databricks launches Lakewatch: agentic SIEM on the Lakehouse

Lakewatch is an open SIEM platform that consolidates security, IT, and business data, enabling rapid threat detection and response using AI agents.

Uber Launches IngestionNext: Streaming-First Data Lake Cuts Latency and Compute by 25%

Uber's IngestionNext platform shifts to a streaming-first system, reducing data ingestion latency from hours to minutes for analytics and machine learning.

Snowflake's new 'autonomous' AI layer aims to do the work, not just answer questions

Project SnowWork is Snowflake's autonomous AI layer that automates data analysis tasks like forecasting, churn analysis, and report generation without requiring data team intervention.

Migrating from Apache Airflow v2 to v3

Airflow 3 represents a clear architectural direction for the project: API-driven execution, better isolation, data-aware scheduling and a platform designed for modern scale. While Airflow 2.x is still widely used, it is clearly moving toward long-term maintenance (end-of-life April 2026) with most innovation and architectural investment happening in the 3.x line.

Software development

Building Consistent Data Foundations at Scale

Building consistent data foundations through intentional architecture, engineering, and governance is essential to prevent fragmentation, support AI adoption, ensure regulatory compliance, and enable reliable organizational decisions at scale.

AWS Expands Aurora DSQL with Playground, New Tool Integrations, and Driver Connectors

Amazon Aurora DSQL introduces usability enhancements, including a browser-based playground and integrations with popular SQL tools for improved developer experience.

Databricks launches Genie Code to automate data science and engineering tasks

Databricks launched Genie Code, an AI agent that automates data science and engineering tasks within its lakehouse platform to accelerate ML workflows and enterprise data operations.

Unified Databricks Repository for Scala and Python Data Pipelines

Databricks repositories require structured setup with Gradle for multi-language support, dependency management, and version control to scale beyond manual notebook maintenance.

#ai-agent-evaluation

Business intelligence

Databricks buys Quotient AI to boost enterprisegrade AI agent performance

Artificial intelligence

Databricks acquires Quotient AI in push for agent reliability

Databricks buys Quotient AI to boost enterprisegrade AI agent performance

Databricks acquired Quotient AI to enable enterprises to deploy AI agents reliably in production with continuous evaluation, monitoring, and performance improvement capabilities.

Databricks acquires Quotient AI in push for agent reliability

Databricks acquired Quotient AI to embed agent evaluation and reinforcement learning capabilities into its platform, addressing the critical challenge of maintaining reliable AI agents in production environments.

more#ai-agent-evaluation

#scala-interview-preparation

Update your databases now to avoid data debt

Multiple major open source databases reach end-of-life in 2026, requiring teams to plan upgrades and migrations to avoid security risks and higher costs.

Data science

100 Scala Interview Questions and Answers for Data Engineers

Scala

100 Scala Interview Questions for Senior Developers

Data science

100 Scala Interview Questions and Answers for Data Engineers

more#scala-interview-preparation

Scala

100 Scala Interview Questions for Senior Developers

Python

fromTreehouse Blog

Python for Data: A SQL + Pandas Mini-Project That Actually Prepares You for Real Work

Effective data analysis requires combining SQL and Python skills in integrated projects that mirror real-world workflows, not learning them in isolation.

#mariadb-acquisition

MariaDB taps GridGain to keep pace with AI-driven data demands

MariaDB's acquisition of GridGain aims to create an integrated platform combining relational database reliability with in-memory computing speed to compete with hyperscaler offerings.

DevOps

MariaDB acquires GridGain for agentic AI data

MariaDB taps GridGain to keep pace with AI-driven data demands

MariaDB's acquisition of GridGain aims to create an integrated platform combining relational database reliability with in-memory computing speed to compete with hyperscaler offerings.

DevOps

MariaDB acquires GridGain for agentic AI data

more#mariadb-acquisition

How Datadog Cut the Size of Its Agent Go Binaries by 77%

Datadog reduced its Agent binary from 1.22 GiB by auditing imports, using build tags, isolating optional code, and eliminating reflection pitfalls to remove unnecessary dependencies and compiler bloat.

Elastic Releases Version 9.3.0 With Enhanced AI Tools and OTel Support

Elastic 9.3.0 introduces AI workflow automation, 12x faster vector indexing via NVIDIA GPU acceleration, and OpenTelemetry integration for vendor-neutral observability across hybrid cloud environments.

Dataiku introduces platform for scalable enterprise AI

Dataiku launches Platform for AI Success with three new products designed to move AI initiatives from pilots to measurable business outcomes through unified orchestration across cloud providers.

Running Ray at Scale on AKS

Microsoft and Anyscale provide guidance for running managed Ray service on Azure Kubernetes Service, addressing GPU capacity limits, ML storage challenges, and credential expiry issues through multi-cluster, multi-region deployment strategies.

fromComputerWeekly.com

4 weeks ago

Edge AI: What's working and what isn't | Computer Weekly

Edge AI deployment success depends on identifying efficient, narrow use cases with manageable risks rather than pursuing sophisticated, large-scale models across all applications.

Why AI requires rethinking the storage-compute divide

AI workloads require continuous processing of unstructured multimodal data, causing redundant data movement and transformation that wastes infrastructure costs and data scientist time.

Netflix Automates RDS PostgreSQL to Aurora PostgreSQL Migration Across 400 Production Clusters

Netflix automated RDS to Aurora PostgreSQL migrations across 400 production clusters through infrastructure-level orchestration, eliminating manual intervention while maintaining data integrity and CDC pipeline correctness.

Databricks Introduces Lakebase, a PostgreSQL Database for AI Workloads

Databricks Lakebase is a serverless PostgreSQL OLTP database that separates compute from storage and unifies transactional and analytical capabilities.

Startup companies

Etleap Launches Iceberg Pipeline Platform to Simplify Enterprise Adoption of Apache Iceberg

Managed Iceberg pipeline platform unifies ingestion, transformation, orchestration, and table operations inside customers' VPCs, enabling enterprise Iceberg adoption without building custom stacks.

Web development

3 months ago

DuckDB's WebAssembly Client Allows Querying Iceberg Datasets in the Browser

DuckDB-Wasm enables browser-based, serverless end-to-end query, read, and write access to Iceberg REST catalogs and object storage without infrastructure setup.

#streamlit

Python

Integrating Streamlit with Snowflake for Live Cloud Data Apps (Part 1) - PyImageSearch

Python

Integrating Streamlit with Snowflake for Live Cloud Data Apps (Part 2) - PyImageSearch

Python

Integrating Streamlit with Snowflake for Live Cloud Data Apps (Part 1) - PyImageSearch

Python

Integrating Streamlit with Snowflake for Live Cloud Data Apps (Part 2) - PyImageSearch

more#streamlit

350PB, Millions of Events, One System: Inside Uber's Cross-Region Data Lake and Disaster Recovery

Uber has built HiveSync, a sharded batch replication system that keeps Hive and HDFS data synchronized across multiple regions, handling millions of Hive events daily. HiveSync ensures cross-region data consistency, enables Uber's disaster recovery strategy, and eliminates inefficiency caused by the secondary region sitting idle, which previously incurred hardware costs equal to the primary, while still maintaining high availability. Built initially on the open-source Airbnb ReAir project, HiveSync has been extended with sharding, DAG-based orchestration, and a separation of control and data planes.

Tech industry

The Complete Guide to Optimizing Apache Spark Jobs: From Basics to Production-Ready Performance

Optimize Spark jobs by using lazy evaluation awareness, early filter and column pruning, partition pruning, and appropriate join strategies to minimize shuffles and I/O.

#spark

Software development

How I Fixed a Critical Spark Production Performance Issue (and Cut Runtime by 70%)

Data science

How I Fixed a Critical Spark Production Performance Issue (and Cut Runtime by 70%)

Software development

How I Fixed a Critical Spark Production Performance Issue (and Cut Runtime by 70%)

Data science

How I Fixed a Critical Spark Production Performance Issue (and Cut Runtime by 70%)

more#spark

Sumo Logic launches data pipeline apps for Snowflake and Databricks

Snowflake offers a fully managed data platform, but Sumo Logic users often lack insight into performance, login activity, and operational health. The Sumo Logic Snowflake Logs App analyzes login and access activity to identify anomalies or suspicious behavior. It also optimizes data pipelines with insights into long-running or failing queries. Teams can centralize log data to facilitate correlation across applications, cloud services, and data platforms.

Information security

Autonomous Big Data Optimization: Multi-Agent Reinforcement Learning to Achieve Self-Tuning Apache Spark

A Q-learning agent autonomously learns and generalizes optimal Spark configurations by discretizing dataset features and combining with Adaptive Query Execution for superior performance.

Are You Missing a Data Frame? The Power of Data Frames in Java

DataFrames and data-oriented programming promote modeling immutable data separately from behavior, making Java suitable for DataFrame-style data manipulation comparable to Python.

Why your next microservices should be streaming SQL-driven

Streaming SQL with UDFs, materialized results, and ML/AI integrations enables continuous, stateful processing of event streams for microservices.

#instructed-retriever

Artificial intelligence

Databricks says its Instructed Retriever offers better AI answers than RAG in the enterprise

Artificial intelligence

Databricks says its Instruction Retrieval offers better AI answers than RAG in the enterprise

Artificial intelligence

Databricks says its Instructed Retriever offers better AI answers than RAG in the enterprise

more#instructed-retriever

Artificial intelligence

Databricks says its Instruction Retrieval offers better AI answers than RAG in the enterprise

ClickHouse, the open-source challenger to Snowflake and Databricks

ClickHouse is a high-performance columnar OLAP database rapidly adopted by AI and enterprise users, now valued at $15B and acquiring Langfuse.

fromDevOps.com

Why Data Contracts Need Apache Kafka and Apache Flink - DevOps.com

Data contracts formalize schemas, types, and quality constraints through early producer-consumer collaboration to prevent pipeline failures and reduce operational downtime.

Databricks makes serverless Postgress service Lakebase available

Databricks today announced the general availability of Lakebase on AWS, a new database architecture that separates compute and storage. The managed serverless Postgres service is designed to help organizations build faster without worrying about infrastructure management. When databases link compute and storage, every query must use the same CPU and memory resources. This can cause a single heavy query to affect all other operations. By separating compute and storage, resources automatically scale with the actual load.

Software development

Starburst: Chewing through data access is key to AI adoption

AI adoption is bottlenecked by lack of access to contextual, current, and governed data; without that, AI cannot reliably increase productivity.

Cloudflare Introduces Aggregations in R2 SQL for Data Analytics

R2 SQL now supports SUM, COUNT, AVG, MIN, and MAX, as well as GROUP BY and HAVING clauses. These aggregation functions let developers run SQL analytics directly on data stored in R2 via the R2 Data Catalog, enabling them to quickly summarize data, spot trends, generate reports, and identify unusual patterns in logs. In addition to aggregations, the update introduces schema discovery commands, including SHOW TABLES and DESCRIBE.

Software development

fromFortune

Want to get AI agents to work better? Improve how they retrieve data, Databricks says | Fortune

Engineering complete AI-agent workflows and providing access to correct information are essential for moving AI agents beyond pilot phase.

Beyond the Warehouse: Why BigQuery Alone Won't Solve Your Data Problems

Data warehouses like BigQuery perform well initially but become slow, costly, and disorganized at scale, undermining low-latency operational use and innovation.

Databricks adds MemAlign to MLflow to cut cost and latency of LLM evaluation

By replacing repeated fine‑tuning with a dual‑memory system, MemAlign reduces the cost and instability of training LLM judges, offering faster adaptation to new domains and changing business policies. Databricks' Mosaic AI Research team has added a new framework, MemAlign, to MLflow, its managed machine learning and generative AI lifecycle development service. MemAlign is designed to help enterprises lower the cost and latency of training LLM-based judges, in turn making AI evaluation scalable and trustworthy enough for production deployments.

Artificial intelligence

fromComputerworld

Great R packages for data import, wrangling, and visualization

A set of R packages (dplyr, purrr, readr/vroom, datapasta, Hmisc) streamline data wrangling, importing, and analysis with faster, standardized, and reproducible tools.

The Complete Database Scaling Playbook: From 1 to 10,000 Queries Per Second

Database scaling to 10,000 QPS requires staged architectural strategies timed to traffic thresholds to avoid outages or unnecessary cost.

AI is changing the way we think about databases

Developers have spent the past decade trying to forget databases exist. Not literally, of course. We still store petabytes. But for the average developer, the database became an implementation detail; an essential but staid utility layer we worked hard not to think about. We abstracted it behind object-relational mappers (ORM). We wrapped it in APIs. We stuffed semi-structured objects into columns and told ourselves it was flexible.

Software development

3 months ago

Migrating from Historical Batch Processing to Incremental CDC Using Apache Iceberg (Glue 4...

Use Apache Iceberg Copy-on-Write tables in AWS Glue 4 to migrate from full historical batch reprocessing to incremental CDC, reducing redundant computation, I/O, and costs.

Databricks shows how AI strengthens the SaaS model

The rise of generative AI is often seen as an existential threat to the SaaS model. Interfaces would disappear, software would fade away, and existing players would become irrelevant. However, new figures from Databricks paint a different picture. Rather than undermining SaaS, AI appears to be increasing its use. This week, Databricks reported a revenue run rate of $5.4 billion, a 65 percent year-on-year increase. More than a quarter of that now comes from AI-related products.

Artificial intelligence

When Kafka Lag Lies: A Production Debugging Story

Uncommitted Kafka offsets can cause persistent consumer-group lag even when ingestion is low, databases are idle, and no errors are observed.

Snowflake debuts Cortex Code, an AI agent that understands enterprise data context

Cortex Code enables developers to use natural language to build, optimize, and deploy governed, production-ready data pipelines, analytics, ML workloads, and AI agents.

fromthenewstack.io

Why Most APIs Fail in AI Systems and How To Fix It

Over the past few years, I've reviewed thousands of APIs across startups, enterprises and global platforms. Almost all shipped OpenAPI documents. On paper, they should be well-defined and interoperable. In practice, most fail when consumed predictably by AI systems. They were designed for human readers, not machines that need to reason, plan and safely execute actions. When APIs are ambiguous, inconsistent or structurally unreliable, AI systems struggle or fail outright.

Software development

fromTheregister

Nvidia says DGX Spark is now 2.5x faster than at launch

Nvidia's DGX Spark and GB10 systems gain significant software-driven performance improvements and broader software integrations, boosting prefill compute performance for genAI workflows.

fromCIO

5 perspectives on modern data analytics

Data/business analytics is the top IT investment priority, yet analytics projects often fail due to poor data, vague objectives, and one-size-fits-all solutions.

AWS Adds Intelligent-Tiering and Replication for S3 Tables

S3 Tables now support Intelligent-Tiering automatic cost optimization and cross-region/account Apache Iceberg table replication without manual synchronization.

Edge AI: The future of AI inference is smarter local compute

Edge AI shifts computation from cloud to devices, enabling low-latency, cost-efficient, and privacy-preserving AI inference while facing performance and ecosystem challenges.

OpenAI Scales Single Primary Postgresql to Millions of Queries per Second for ChatGPT

OpenAI scaled a single-primary PostgreSQL to millions of queries per second by optimizing instance size, query patterns, read replicas, and offloading write-heavy workloads.

#ai

Artificial intelligence

With AI, the database matters again

Artificial intelligence

AI makes the database matter again

Artificial intelligence

With AI, the database matters again

Artificial intelligence

AI makes the database matter again

more#ai

Building Embedding Models for Large-Scale Real-World Applications

What happens under the hood? How is the search engine able to take that simple query, look for images in the billions, trillions of images that are available online? How is it able to find this one or similar photos from all that? Usually, there is an embedding model that is doing this work behind the hood.

Artificial intelligence

Teradata unveils enterprise AgentStack to push AI agents into production

Teradata positions Enterprise AgentStack as a vendor-agnostic execution layer across hybrid environments, contrasting platform-tied AI approaches from Snowflake and Databricks.

fromCointelegraph

What Role Is Left for Decentralized GPU Networks in AI?

What we are beginning to see is that many open-source and other models are becoming compact enough and sufficiently optimized to run very efficiently on consumer GPUs,

Artificial intelligence