#data-preparation
#data-preparation

Artificial intelligence

Snowflake's ongoing pitch: bring AI to data, not vice versa

Django

2 days ago

Snowflake Supports Directory Imports

Easier package imports into Snowflake functions and procedures from stage directories and SnowGit directories streamline development and deployment.

Snowflake's ongoing pitch: bring AI to data, not vice versa

Snowflake is enhancing its platform for AI integration through strategic partnerships and acquisitions, focusing on customer ROI and data management efficiency.

DevOps

Observability warehouses, the next structural evolution for telemetry

DevOps

Snowflake updates developer tools, adds observability features

2 days ago

Observability warehouses, the next structural evolution for telemetry

Observability is essential for real-time insights in cloud systems, helping to reduce downtime and improve performance.

fromTNW | Artificial-Intelligence

DevOps

Snowflake updates developer tools, adds observability features

more#observability

Productivity

2 days ago

Why probability, not averages, is reshaping AI decision-making

ChanceOmeters measure uncertainty directly, improving decision-making by providing odds rather than relying solely on averages.

fromeLearning Industry

3 days ago

How Many AI Tools Are There? A Data-Backed Look At The Expanding AI Landscape

The AI tools ecosystem is rapidly expanding, with thousands of tools available across various categories, creating both opportunities and complexities for businesses.

Scala

5 days ago

Data Extraction and Classification Using Structural Pattern Matching in Scala

Scala pattern matching enhances code readability and extensibility in real-world data engineering use cases.

Marketing tech

fromEMARKETER

4 days ago

Brands want personalization at scale, but their data stack keeps getting in the way

Limited platform integration is the top barrier to personalization for 42% of brand marketers and 47% of agency marketers in North America.

#ai

5 days ago

Software development

AI software development: It works, but it's finicky

Data science

Datadog bets DIY AI will mean it dodges the SaaSpocalypse

Artificial intelligence

AI makes the database matter again

Artificial intelligence

With AI, the database matters again

5 days ago

AI software development: It works, but it's finicky

AI can write code, but expert developers are essential to correct its errors and ensure quality.

Datadog bets DIY AI will mean it dodges the SaaSpocalypse

Datadog is releasing an AI model to enhance its observability tools and mitigate risks from customers building their own solutions.

Artificial intelligence

AI makes the database matter again

Artificial intelligence

With AI, the database matters again

Databricks pitches Lakewatch as a cheaper SIEM - but is it really?

Translating benefits into buy-in from CIOs and CISOs may be challenging for Databricks despite its intent and acquisitions.

Venture

Databricks secures $1.8 billion in funding

fromNew Relic

Business intelligence

Optimize Databricks: Full Visibility with New Relic

Information security

Databricks pitches Lakewatch as a cheaper SIEM - but is it really?

Translating benefits into buy-in from CIOs and CISOs may be challenging for Databricks despite its intent and acquisitions.

Venture

Databricks secures $1.8 billion in funding

fromNew Relic

Business intelligence

Optimize Databricks: Full Visibility with New Relic

Drowning in data sets? Here's how to cut them down to size

The Square Kilometre Array Observatory will generate massive data, but storage and retention pose significant challenges for researchers.

4 days ago

How Apache Kafka flexed to support queues

Apache Kafka has cemented itself as the de facto platform for event streaming, often referred to as the 'universal data substrate' due to its extensive ecosystem that enables connectivity and processing capabilities.

Scala

Java

Spark Internals: Understanding Tungsten (Part 1)

Apache Spark revolutionized big data processing but faces challenges due to JVM memory management and garbage collection issues.

Information security

Databricks launches Lakewatch: agentic SIEM on the Lakehouse

Lakewatch is an open SIEM platform that consolidates security, IT, and business data, enabling rapid threat detection and response using AI agents.

fromNature

How I squeeze fresh science from public data

Utilizing existing data can lead to significant discoveries and collaborations in research.

Python

Spyder: Your IDE for Data Science Development in Python - Real Python

Spyder is an open-source Python IDE optimized for data science, offering powerful plotting, profiling capabilities, and integration with the data science ecosystem.

Snowflake's new 'autonomous' AI layer aims to do the work, not just answer questions

Project SnowWork is Snowflake's autonomous AI layer that automates data analysis tasks like forecasting, churn analysis, and report generation without requiring data team intervention.

Update your databases now to avoid data debt

Multiple major open source databases reach end-of-life in 2026, requiring teams to plan upgrades and migrations to avoid security risks and higher costs.

Migrating from Apache Airflow v2 to v3

Airflow 3 represents a clear architectural direction for the project: API-driven execution, better isolation, data-aware scheduling and a platform designed for modern scale. While Airflow 2.x is still widely used, it is clearly moving toward long-term maintenance (end-of-life April 2026) with most innovation and architectural investment happening in the 3.x line.

Software development

Building Consistent Data Foundations at Scale

Building consistent data foundations through intentional architecture, engineering, and governance is essential to prevent fragmentation, support AI adoption, ensure regulatory compliance, and enable reliable organizational decisions at scale.

Real-Time Data Validation in Healthcare Streaming: Building Custom Schema Registry Patterns with...

In a single streaming pipeline, you might be processing HL7 FHIR messages with frequent specification updates, claims data following various payer-specific formats, provider directory information with inconsistent taxonomies, and patient demographics with privacy redaction requirements. Our member eligibility stream processes roughly 50,000 records per minute during peak enrollment periods.

Healthcare

Databricks launches Genie Code to automate data science and engineering tasks

Databricks launched Genie Code, an AI agent that automates data science and engineering tasks within its lakehouse platform to accelerate ML workflows and enterprise data operations.

Migrating to the Lakehouse Without the Big Bang: An Incremental Approach

Query federation enables safe, incremental lakehouse migration by allowing simultaneous queries across legacy warehouses and new lakehouse systems without risky big bang cutover approaches.

Django

Introduction to Python SQL Libraries Quiz - Real Python

A 9-question interactive quiz assesses proficiency in Python SQL libraries for database connectivity, query execution, and cross-database scripting with SQLite, MySQL, and PostgreSQL.

Unified Databricks Repository for Scala and Python Data Pipelines

Databricks repositories require structured setup with Gradle for multi-language support, dependency management, and version control to scale beyond manual notebook maintenance.

fromEntrepreneur

How AI Is Revolutionizing Disaster Recovery

AI can transform static disaster recovery runbooks into continuously validated, automatically updated procedures that keep pace with evolving infrastructure and prevent costly recovery delays.

#scala-interview-preparation

Dataiku introduces platform for scalable enterprise AI

Dataiku launches Platform for AI Success with three new products designed to move AI initiatives from pilots to measurable business outcomes through unified orchestration across cloud providers.

Scala

100 Scala Interview Questions for Senior Developers

Data science

100 Scala Interview Questions and Answers for Data Engineers

Scala

100 Scala Interview Questions for Senior Developers

more#scala-interview-preparation

Data science

100 Scala Interview Questions and Answers for Data Engineers

Python

fromTreehouse Blog

Python for Data: A SQL + Pandas Mini-Project That Actually Prepares You for Real Work

Effective data analysis requires combining SQL and Python skills in integrated projects that mirror real-world workflows, not learning them in isolation.

MariaDB acquires GridGain for agentic AI data

MariaDB acquires GridGain Systems to combine relational database technology with in-memory computing, enabling sub-millisecond performance for agentic AI applications.

Django

Automate Python Data Analysis With YData Profiling Quiz - Real Python

An interactive 8-question quiz assesses proficiency in YData Profiling for automating Python data analysis tasks including report generation, dataset comparison, and time series preparation.

Why AI requires rethinking the storage-compute divide

AI workloads require continuous processing of unstructured multimodal data, causing redundant data movement and transformation that wastes infrastructure costs and data scientist time.

Python

Automate Python Data Analysis With YData Profiling - Real Python

YData Profiling generates interactive exploratory data analysis reports with summary statistics, visualizations, and data quality warnings from pandas DataFrames in just a few lines of code.

4 weeks ago

The revenge of SQL: How a 50-year-old language reinvents itself

SQL has experienced a major comeback driven by SQLite in browsers, improved language tools, and PostgreSQL's jsonb type, making it both traditional and exciting for modern development.

fromFast Company

Beware of data hubris

Organizations are drowning in dashboards, KPIs, performance metrics, behavioral traces, biometric indicators, predictive scores, engagement rates, and AI-generated forecasts. We have more data than we know what to do with. We pretend that the mere presence of data guarantees clarity. It does not. That's data hubris—the arrogant belief that because something can be measured, it can be mastered.

Business intelligence

fromLondon Business News | Londonlovesbusiness.com

Building a trusted AI data analyst for revenue operations - London Business News | Londonlovesbusiness.com

AI data analysts must enforce financial controls and governed metric definitions to produce revenue-grade insights that finance leaders will trust for decision-making.

UX design

fromscikit-learn Blog

Enhancing user experience through interactive inspection

Scikit-learn added interactive HTML model inspections, including parameter tables, funded by a Wellcome/CZI EOSS grant to improve model inspection and UX.

Startup companies

Etleap Launches Iceberg Pipeline Platform to Simplify Enterprise Adoption of Apache Iceberg

Managed Iceberg pipeline platform unifies ingestion, transformation, orchestration, and table operations inside customers' VPCs, enabling enterprise Iceberg adoption without building custom stacks.

Miscellaneous

Klarrio uses open source expertise to build foundational data platforms

Klarrio builds compliant, scalable open-source data platforms and platform-engineering foundations, integrating and securing underlying infrastructure so customers can focus on analytics and data science.

Tech industry

Snowflake plugs PostgreSQL into its AI Data Cloud

Snowflake now offers a native PostgreSQL DBaaS in its AI Data Cloud to run transactional workloads alongside analytics and AI under unified governance.

fromThe Drum

Deeper data delivers more inspired partnership decisions

Imagine you're selecting an influencer to work with on your new campaign. You've narrowed it down to two, both in the right area, both creating the right sort of content. One has 24.6 million subscribers, the other 1.4 million. Which do you choose? Now imagine you could find out the first had 8.7 million unique viewers last month, while the second had 9.9 million. Do you want to change your mind?

Marketing

Web development

DuckDB's WebAssembly Client Allows Querying Iceberg Datasets in the Browser

DuckDB-Wasm enables browser-based, serverless end-to-end query, read, and write access to Iceberg REST catalogs and object storage without infrastructure setup.

Sumo Logic launches data pipeline apps for Snowflake and Databricks

Snowflake offers a fully managed data platform, but Sumo Logic users often lack insight into performance, login activity, and operational health. The Sumo Logic Snowflake Logs App analyzes login and access activity to identify anomalies or suspicious behavior. It also optimizes data pipelines with insights into long-running or failing queries. Teams can centralize log data to facilitate correlation across applications, cloud services, and data platforms.

Information security

Databricks Introduces Lakebase, a PostgreSQL Database for AI Workloads

Databricks Lakebase is a serverless PostgreSQL OLTP database that separates compute from storage and unifies transactional and analytical capabilities.

Tech industry

fromComputerworld

New Tableau AI features and Slack integration aim for data accessibility

Tableau added AI-powered personalization, automation, natural-language data stories, data mapping, and Slack integration to make data more accessible and actionable for business users.

AI-augmented data quality engineering

SHAP for feature attribution SHAP quantifies each feature's contribution to a model prediction, enabling: LIME for local interpretability LIME builds simple local models around a prediction to show how small changes influence outcomes. It answers questions like: "Would correcting age change the anomaly score?" "Would adjusting the ZIP code affect classification?" Explainability makes AI-based data remediation acceptable in regulated industries.

Artificial intelligence

fromSitePoint Forums | Web Development & Design Community

How to group data by selected ti,e slice?

Group rows into arbitrary time buckets by computing a bucket index (FLOOR(UNIX_TIMESTAMP(o_time)/interval_seconds)) and grouping by the bucket start timestamp.

#streamlit

Python

Integrating Streamlit with Snowflake for Live Cloud Data Apps (Part 1) - PyImageSearch

Python

Integrating Streamlit with Snowflake for Live Cloud Data Apps (Part 2) - PyImageSearch

Python

Integrating Streamlit with Snowflake for Live Cloud Data Apps (Part 1) - PyImageSearch

Python

Integrating Streamlit with Snowflake for Live Cloud Data Apps (Part 2) - PyImageSearch

Product Spotlight on Analytics

Taelor Sutherland is Associate Editor at Security magazine covering enterprise security, coordinating digital content, and holding a BA in English Literature from Agnes Scott College.

fromSitePoint Forums | Web Development & Design Community

How Machine Learning Works

Machine learning uses data-driven algorithms and structured workflows to discover patterns, build predictive models, and deploy solutions across industries.

AI is changing the way we think about databases

Developers have spent the past decade trying to forget databases exist. Not literally, of course. We still store petabytes. But for the average developer, the database became an implementation detail; an essential but staid utility layer we worked hard not to think about. We abstracted it behind object-relational mappers (ORM). We wrapped it in APIs. We stuffed semi-structured objects into columns and told ourselves it was flexible.

Software development

ClickHouse, the open-source challenger to Snowflake and Databricks

ClickHouse is a high-performance columnar OLAP database rapidly adopted by AI and enterprise users, now valued at $15B and acquiring Langfuse.

How to use Pandas for data analysis in Python

When it comes to working with data in a tabular form, most people reach for a spreadsheet. That's not a bad choice: Microsoft Excel and similar programs are familiar and loaded with functionality for massaging tables of data. But what if you want more control, precision, and power than Excel alone delivers? In that case, the open source Pandas library for Python might be what you are looking for.

Python

fromTreehouse Blog

Portfolio Projects for Entry-Level Data Roles

Most beginner data portfolios look similar. They include: A few cleaned datasets Some charts or dashboards A notebook with code and commentary Again, nothing here is wrong. But hiring teams don't review portfolios to check whether you can follow instructions. They review them to see whether you can think like a data analyst. When projects feel generic, reviewers are left guessing:

Data science

Autonomous Big Data Optimization: Multi-Agent Reinforcement Learning to Achieve Self-Tuning Apache Spark

A Q-learning agent autonomously learns and generalizes optimal Spark configurations by discretizing dataset features and combining with Adaptive Query Execution for superior performance.

Are You Missing a Data Frame? The Power of Data Frames in Java

DataFrames and data-oriented programming promote modeling immutable data separately from behavior, making Java suitable for DataFrame-style data manipulation comparable to Python.

Google tests BigQuery feature to generate SQL queries from English

Google allows natural language expressions inside SQL comments to speed translation of intent into executable queries, reducing query-writing time and easing analytics workflows.

Starburst: Chewing through data access is key to AI adoption

AI adoption is bottlenecked by lack of access to contextual, current, and governed data; without that, AI cannot reliably increase productivity.

How I Fixed a Critical Spark Production Performance Issue (and Cut Runtime by 70%)

"The job didn't fail. It just... never finished." That was the worst part. No errors.No stack traces.Just a Spark job running forever in production - blocking downstream pipelines, delaying reports, and waking up-on-call engineers at 2 AM. This is the story of how I diagnosed a real Spark performance issue in production and fixed it drastically, not by adding more machines - but by understanding Spark properly.

The Complete Guide to Optimizing Apache Spark Jobs: From Basics to Production-Ready Performance

Optimize Spark jobs by using lazy evaluation awareness, early filter and column pruning, partition pruning, and appropriate join strategies to minimize shuffles and I/O.

fromLondon Business News | Londonlovesbusiness.com

Building a trusted AI data analyst for revenue operations - London Business News | Londonlovesbusiness.com

AI data analysts must produce revenue-grade, finance-controlled outputs aligned to governed revenue semantics to deliver trusted, fast revenue insights.

Why "Data Scientist" is Becoming "AI Engineer" and What That Actually Means

The title "data scientist" is quietly disappearing from job postings, internal org charts, and LinkedIn headlines. In its place, roles like "AI engineer," "applied AI engineer," and "machine learning engineer" are becoming the norm. This Data Scientist vs AI Engineer shift raises an important question for practitioners and leaders alike: what actually changes when a data scientist becomes an AI engineer, and what stays the same? More importantly, what skills matter if you want to make this transition intentionally rather than by accident?

Artificial intelligence

#python

fromReuven Lerner

Python

Build YOUR data dashboard - join my next 8-week HOPPy studio cohort

fromTheServerSide.com

Data science

Why Java devs should switch to Python or R for data science | TheServerSide

fromReuven Lerner

Python

Build YOUR data dashboard - join my next 8-week HOPPy studio cohort

fromTheServerSide.com

Data science

Why Java devs should switch to Python or R for data science | TheServerSide

more#python

Databricks makes serverless Postgress service Lakebase available

Databricks today announced the general availability of Lakebase on AWS, a new database architecture that separates compute and storage. The managed serverless Postgres service is designed to help organizations build faster without worrying about infrastructure management. When databases link compute and storage, every query must use the same CPU and memory resources. This can cause a single heavy query to affect all other operations. By separating compute and storage, resources automatically scale with the actual load.

Software development

Extracting AI-Ready Data From Organizational Documents

Poor document extraction corrupts retrieval; preserving document structure at ingestion produces reliable embeddings and trustworthy RAG outputs.

fromComputerworld

Great R packages for data import, wrangling, and visualization

A set of R packages (dplyr, purrr, readr/vroom, datapasta, Hmisc) streamline data wrangling, importing, and analysis with faster, standardized, and reproducible tools.

#instructed-retriever

Artificial intelligence

Databricks says its Instructed Retriever offers better AI answers than RAG in the enterprise

Artificial intelligence

Databricks says its Instruction Retrieval offers better AI answers than RAG in the enterprise

Artificial intelligence

Databricks says its Instructed Retriever offers better AI answers than RAG in the enterprise

more#instructed-retriever

Artificial intelligence

Databricks says its Instruction Retrieval offers better AI answers than RAG in the enterprise

fromCIO

5 perspectives on modern data analytics

Data/business analytics is the top IT investment priority, yet analytics projects often fail due to poor data, vague objectives, and one-size-fits-all solutions.

fromFortune

Want to get AI agents to work better? Improve how they retrieve data, Databricks says | Fortune

Engineering complete AI-agent workflows and providing access to correct information are essential for moving AI agents beyond pilot phase.

Beyond the Warehouse: Why BigQuery Alone Won't Solve Your Data Problems

Data warehouses like BigQuery perform well initially but become slow, costly, and disorganized at scale, undermining low-latency operational use and innovation.

Databricks shows how AI strengthens the SaaS model

The rise of generative AI is often seen as an existential threat to the SaaS model. Interfaces would disappear, software would fade away, and existing players would become irrelevant. However, new figures from Databricks paint a different picture. Rather than undermining SaaS, AI appears to be increasing its use. This week, Databricks reported a revenue run rate of $5.4 billion, a 65 percent year-on-year increase. More than a quarter of that now comes from AI-related products.

Artificial intelligence

Snowflake debuts Cortex Code, an AI agent that understands enterprise data context

Cortex Code enables developers to use natural language to build, optimize, and deploy governed, production-ready data pipelines, analytics, ML workloads, and AI agents.

Snowflake launches Cortex Code agent for understanding data context

Cortex Code is an AI agent that converts complex data engineering, ML, and analytics tasks into natural-language workflows integrated into Snowflake and developer tools.

Teradata unveils enterprise AgentStack to push AI agents into production

Teradata positions Enterprise AgentStack as a vendor-agnostic execution layer across hybrid environments, contrasting platform-tied AI approaches from Snowflake and Databricks.

fromComputerworld

Tableau re-engineers dashboards, adds new analytics tools for business analysts

Tableau 2022.3 adds Data Guide and Table Extension, dynamic dashboards, event auditing, and performance/cost optimization to simplify self-service analytics for business users.

Google BigQuery Adds SQL-Native Managed Inference for Hugging Face Models

BigQuery lets data teams deploy and run Hugging Face or Vertex AI open models with plain SQL, auto-provisioning compute and managing endpoints.

Migrating from Historical Batch Processing to Incremental CDC Using Apache Iceberg (Glue 4...

Use Apache Iceberg Copy-on-Write tables in AWS Glue 4 to migrate from full historical batch reprocessing to incremental CDC, reducing redundant computation, I/O, and costs.

Databricks adds MemAlign to MLflow to cut cost and latency of LLM evaluation

By replacing repeated fine‑tuning with a dual‑memory system, MemAlign reduces the cost and instability of training LLM judges, offering faster adaptation to new domains and changing business policies. Databricks' Mosaic AI Research team has added a new framework, MemAlign, to MLflow, its managed machine learning and generative AI lifecycle development service. MemAlign is designed to help enterprises lower the cost and latency of training LLM-based judges, in turn making AI evaluation scalable and trustworthy enough for production deployments.

Artificial intelligence

Why Most Machine Learning Projects Fail to Reach Production

Most ML projects fail to reach production because of problem choice, data/labeling issues, model-to-product gaps, offline-online mismatches, and non-technical blockers.