#data-preparation

[ follow ]
#snowflake
Django
fromMedium
2 days ago

Snowflake Supports Directory Imports

Easier package imports into Snowflake functions and procedures from stage directories and SnowGit directories streamline development and deployment.
Artificial intelligence
fromTheregister
1 week ago

Snowflake's ongoing pitch: bring AI to data, not vice versa

Snowflake is enhancing its platform for AI integration through strategic partnerships and acquisitions, focusing on customer ROI and data management efficiency.
#observability
DevOps
fromTechzine Global
2 days ago

Observability warehouses, the next structural evolution for telemetry

Observability is essential for real-time insights in cloud systems, helping to reduce downtime and improve performance.
Business intelligence
fromeLearning Industry
3 days ago

How Many AI Tools Are There? A Data-Backed Look At The Expanding AI Landscape

The AI tools ecosystem is rapidly expanding, with thousands of tools available across various categories, creating both opportunities and complexities for businesses.
Scala
fromMedium
5 days ago

Data Extraction and Classification Using Structural Pattern Matching in Scala

Scala pattern matching enhances code readability and extensibility in real-world data engineering use cases.
Marketing tech
fromEMARKETER
4 days ago

Brands want personalization at scale, but their data stack keeps getting in the way

Limited platform integration is the top barrier to personalization for 42% of brand marketers and 47% of agency marketers in North America.
#ai
Data science
fromTheregister
1 week ago

Datadog bets DIY AI will mean it dodges the SaaSpocalypse

Datadog is releasing an AI model to enhance its observability tools and mitigate risks from customers building their own solutions.
#databricks
Information security
fromInfoWorld
1 week ago

Databricks pitches Lakewatch as a cheaper SIEM - but is it really?

Translating benefits into buy-in from CIOs and CISOs may be challenging for Databricks despite its intent and acquisitions.
Information security
fromInfoWorld
1 week ago

Databricks pitches Lakewatch as a cheaper SIEM - but is it really?

Translating benefits into buy-in from CIOs and CISOs may be challenging for Databricks despite its intent and acquisitions.
Science
fromNature
1 week ago

Drowning in data sets? Here's how to cut them down to size

The Square Kilometre Array Observatory will generate massive data, but storage and retention pose significant challenges for researchers.
fromInfoWorld
4 days ago

How Apache Kafka flexed to support queues

Apache Kafka has cemented itself as the de facto platform for event streaming, often referred to as the 'universal data substrate' due to its extensive ecosystem that enables connectivity and processing capabilities.
Scala
Java
fromMedium
2 weeks ago

Spark Internals: Understanding Tungsten (Part 1)

Apache Spark revolutionized big data processing but faces challenges due to JVM memory management and garbage collection issues.
Information security
fromTechzine Global
1 week ago

Databricks launches Lakewatch: agentic SIEM on the Lakehouse

Lakewatch is an open SIEM platform that consolidates security, IT, and business data, enabling rapid threat detection and response using AI agents.
Data science
fromNature
1 week ago

How I squeeze fresh science from public data

Utilizing existing data can lead to significant discoveries and collaborations in research.
Python
fromRealpython
2 weeks ago

Spyder: Your IDE for Data Science Development in Python - Real Python

Spyder is an open-source Python IDE optimized for data science, offering powerful plotting, profiling capabilities, and integration with the data science ecosystem.
Business intelligence
fromInfoWorld
2 weeks ago

Snowflake's new 'autonomous' AI layer aims to do the work, not just answer questions

Project SnowWork is Snowflake's autonomous AI layer that automates data analysis tasks like forecasting, churn analysis, and report generation without requiring data team intervention.
DevOps
fromInfoWorld
2 weeks ago

Update your databases now to avoid data debt

Multiple major open source databases reach end-of-life in 2026, requiring teams to plan upgrades and migrations to avoid security risks and higher costs.
fromInfoWorld
2 weeks ago

Migrating from Apache Airflow v2 to v3

Airflow 3 represents a clear architectural direction for the project: API-driven execution, better isolation, data-aware scheduling and a platform designed for modern scale. While Airflow 2.x is still widely used, it is clearly moving toward long-term maintenance (end-of-life April 2026) with most innovation and architectural investment happening in the 3.x line.
Software development
Data science
fromMedium
2 weeks ago

Building Consistent Data Foundations at Scale

Building consistent data foundations through intentional architecture, engineering, and governance is essential to prevent fragmentation, support AI adoption, ensure regulatory compliance, and enable reliable organizational decisions at scale.
fromMedium
1 month ago

Real-Time Data Validation in Healthcare Streaming: Building Custom Schema Registry Patterns with...

In a single streaming pipeline, you might be processing HL7 FHIR messages with frequent specification updates, claims data following various payer-specific formats, provider directory information with inconsistent taxonomies, and patient demographics with privacy redaction requirements. Our member eligibility stream processes roughly 50,000 records per minute during peak enrollment periods.
Healthcare
Artificial intelligence
fromInfoWorld
3 weeks ago

Databricks launches Genie Code to automate data science and engineering tasks

Databricks launched Genie Code, an AI agent that automates data science and engineering tasks within its lakehouse platform to accelerate ML workflows and enterprise data operations.
Data science
fromMedium
3 weeks ago

Migrating to the Lakehouse Without the Big Bang: An Incremental Approach

Query federation enables safe, incremental lakehouse migration by allowing simultaneous queries across legacy warehouses and new lakehouse systems without risky big bang cutover approaches.
Django
fromRealpython
3 weeks ago

Introduction to Python SQL Libraries Quiz - Real Python

A 9-question interactive quiz assesses proficiency in Python SQL libraries for database connectivity, query execution, and cross-database scripting with SQLite, MySQL, and PostgreSQL.
Software development
fromMedium
1 month ago

Unified Databricks Repository for Scala and Python Data Pipelines

Databricks repositories require structured setup with Gradle for multi-language support, dependency management, and version control to scale beyond manual notebook maintenance.
DevOps
fromEntrepreneur
3 weeks ago

How AI Is Revolutionizing Disaster Recovery

AI can transform static disaster recovery runbooks into continuously validated, automatically updated procedures that keep pace with evolving infrastructure and prevent costly recovery delays.
Business intelligence
fromTechzine Global
3 weeks ago

Dataiku introduces platform for scalable enterprise AI

Dataiku launches Platform for AI Success with three new products designed to move AI initiatives from pilots to measurable business outcomes through unified orchestration across cloud providers.
#scala-interview-preparation
Python
fromTreehouse Blog
1 month ago

Python for Data: A SQL + Pandas Mini-Project That Actually Prepares You for Real Work

Effective data analysis requires combining SQL and Python skills in integrated projects that mirror real-world workflows, not learning them in isolation.
DevOps
fromTechzine Global
3 weeks ago

MariaDB acquires GridGain for agentic AI data

MariaDB acquires GridGain Systems to combine relational database technology with in-memory computing, enabling sub-millisecond performance for agentic AI applications.
Django
fromRealpython
1 month ago

Automate Python Data Analysis With YData Profiling Quiz - Real Python

An interactive 8-question quiz assesses proficiency in YData Profiling for automating Python data analysis tasks including report generation, dataset comparison, and time series preparation.
Artificial intelligence
fromInfoWorld
1 month ago

Why AI requires rethinking the storage-compute divide

AI workloads require continuous processing of unstructured multimodal data, causing redundant data movement and transformation that wastes infrastructure costs and data scientist time.
Python
fromRealpython
1 month ago

Automate Python Data Analysis With YData Profiling - Real Python

YData Profiling generates interactive exploratory data analysis reports with summary statistics, visualizations, and data quality warnings from pandas DataFrames in just a few lines of code.
Data science
fromInfoWorld
4 weeks ago

The revenge of SQL: How a 50-year-old language reinvents itself

SQL has experienced a major comeback driven by SQLite in browsers, improved language tools, and PostgreSQL's jsonb type, making it both traditional and exciting for modern development.
fromFast Company
1 month ago

Beware of data hubris

Organizations are drowning in dashboards, KPIs, performance metrics, behavioral traces, biometric indicators, predictive scores, engagement rates, and AI-generated forecasts. We have more data than we know what to do with. We pretend that the mere presence of data guarantees clarity. It does not. That's data hubris—the arrogant belief that because something can be measured, it can be mastered.
Business intelligence
UX design
fromscikit-learn Blog
2 months ago

Enhancing user experience through interactive inspection

Scikit-learn added interactive HTML model inspections, including parameter tables, funded by a Wellcome/CZI EOSS grant to improve model inspection and UX.
Startup companies
fromInfoQ
1 month ago

Etleap Launches Iceberg Pipeline Platform to Simplify Enterprise Adoption of Apache Iceberg

Managed Iceberg pipeline platform unifies ingestion, transformation, orchestration, and table operations inside customers' VPCs, enabling enterprise Iceberg adoption without building custom stacks.
Miscellaneous
fromTechzine Global
2 months ago

Klarrio uses open source expertise to build foundational data platforms

Klarrio builds compliant, scalable open-source data platforms and platform-engineering foundations, integrating and securing underlying infrastructure so customers can focus on analytics and data science.
Tech industry
fromTheregister
1 month ago

Snowflake plugs PostgreSQL into its AI Data Cloud

Snowflake now offers a native PostgreSQL DBaaS in its AI Data Cloud to run transactional workloads alongside analytics and AI under unified governance.
fromThe Drum
2 months ago

Deeper data delivers more inspired partnership decisions

Imagine you're selecting an influencer to work with on your new campaign. You've narrowed it down to two, both in the right area, both creating the right sort of content. One has 24.6 million subscribers, the other 1.4 million. Which do you choose? Now imagine you could find out the first had 8.7 million unique viewers last month, while the second had 9.9 million. Do you want to change your mind?
Marketing
Web development
fromInfoQ
2 months ago

DuckDB's WebAssembly Client Allows Querying Iceberg Datasets in the Browser

DuckDB-Wasm enables browser-based, serverless end-to-end query, read, and write access to Iceberg REST catalogs and object storage without infrastructure setup.
fromTechzine Global
2 months ago

Sumo Logic launches data pipeline apps for Snowflake and Databricks

Snowflake offers a fully managed data platform, but Sumo Logic users often lack insight into performance, login activity, and operational health. The Sumo Logic Snowflake Logs App analyzes login and access activity to identify anomalies or suspicious behavior. It also optimizes data pipelines with insights into long-running or failing queries. Teams can centralize log data to facilitate correlation across applications, cloud services, and data platforms.
Information security
Data science
fromInfoQ
1 month ago

Databricks Introduces Lakebase, a PostgreSQL Database for AI Workloads

Databricks Lakebase is a serverless PostgreSQL OLTP database that separates compute from storage and unifies transactional and analytical capabilities.
Tech industry
fromComputerworld
2 months ago

New Tableau AI features and Slack integration aim for data accessibility

Tableau added AI-powered personalization, automation, natural-language data stories, data mapping, and Slack integration to make data more accessible and actionable for business users.
fromInfoWorld
1 month ago

AI-augmented data quality engineering

SHAP for feature attribution SHAP quantifies each feature's contribution to a model prediction, enabling: LIME for local interpretability LIME builds simple local models around a prediction to show how small changes influence outcomes. It answers questions like: "Would correcting age change the anomaly score?" "Would adjusting the ZIP code affect classification?" Explainability makes AI-based data remediation acceptable in regulated industries.
Artificial intelligence
#streamlit
Information security
fromSecuritymagazine
1 month ago

Product Spotlight on Analytics

Taelor Sutherland is Associate Editor at Security magazine covering enterprise security, coordinating digital content, and holding a BA in English Literature from Agnes Scott College.
fromInfoWorld
2 months ago

AI is changing the way we think about databases

Developers have spent the past decade trying to forget databases exist. Not literally, of course. We still store petabytes. But for the average developer, the database became an implementation detail; an essential but staid utility layer we worked hard not to think about. We abstracted it behind object-relational mappers (ORM). We wrapped it in APIs. We stuffed semi-structured objects into columns and told ourselves it was flexible.
Software development
Business intelligence
fromTechzine Global
2 months ago

ClickHouse, the open-source challenger to Snowflake and Databricks

ClickHouse is a high-performance columnar OLAP database rapidly adopted by AI and enterprise users, now valued at $15B and acquiring Langfuse.
fromInfoWorld
2 months ago

How to use Pandas for data analysis in Python

When it comes to working with data in a tabular form, most people reach for a spreadsheet. That's not a bad choice: Microsoft Excel and similar programs are familiar and loaded with functionality for massaging tables of data. But what if you want more control, precision, and power than Excel alone delivers? In that case, the open source Pandas library for Python might be what you are looking for.
Python
fromTreehouse Blog
1 month ago

Portfolio Projects for Entry-Level Data Roles

Most beginner data portfolios look similar. They include: A few cleaned datasets Some charts or dashboards A notebook with code and commentary Again, nothing here is wrong. But hiring teams don't review portfolios to check whether you can follow instructions. They review them to see whether you can think like a data analyst. When projects feel generic, reviewers are left guessing:
Data science
Artificial intelligence
fromInfoQ
2 months ago

Autonomous Big Data Optimization: Multi-Agent Reinforcement Learning to Achieve Self-Tuning Apache Spark

A Q-learning agent autonomously learns and generalizes optimal Spark configurations by discretizing dataset features and combining with Adaptive Query Execution for superior performance.
Software development
fromInfoQ
1 month ago

Are You Missing a Data Frame? The Power of Data Frames in Java

DataFrames and data-oriented programming promote modeling immutable data separately from behavior, making Java suitable for DataFrame-style data manipulation comparable to Python.
Business intelligence
fromInfoWorld
2 months ago

Google tests BigQuery feature to generate SQL queries from English

Google allows natural language expressions inside SQL comments to speed translation of intent into executable queries, reducing query-writing time and easing analytics workflows.
fromMedium
2 months ago

How I Fixed a Critical Spark Production Performance Issue (and Cut Runtime by 70%)

"The job didn't fail. It just... never finished." That was the worst part. No errors.No stack traces.Just a Spark job running forever in production - blocking downstream pipelines, delaying reports, and waking up-on-call engineers at 2 AM. This is the story of how I diagnosed a real Spark performance issue in production and fixed it drastically, not by adding more machines - but by understanding Spark properly.
Data science
fromMedium
2 months ago

The Complete Guide to Optimizing Apache Spark Jobs: From Basics to Production-Ready Performance

Optimize Spark jobs by using lazy evaluation awareness, early filter and column pruning, partition pruning, and appropriate join strategies to minimize shuffles and I/O.
fromMedium
1 month ago

Why "Data Scientist" is Becoming "AI Engineer" and What That Actually Means

The title "data scientist" is quietly disappearing from job postings, internal org charts, and LinkedIn headlines. In its place, roles like "AI engineer," "applied AI engineer," and "machine learning engineer" are becoming the norm. This Data Scientist vs AI Engineer shift raises an important question for practitioners and leaders alike: what actually changes when a data scientist becomes an AI engineer, and what stays the same? More importantly, what skills matter if you want to make this transition intentionally rather than by accident?
Artificial intelligence
#python
fromTechzine Global
1 month ago

Databricks makes serverless Postgress service Lakebase available

Databricks today announced the general availability of Lakebase on AWS, a new database architecture that separates compute and storage. The managed serverless Postgres service is designed to help organizations build faster without worrying about infrastructure management. When databases link compute and storage, every query must use the same CPU and memory resources. This can cause a single heavy query to affect all other operations. By separating compute and storage, resources automatically scale with the actual load.
Software development
Artificial intelligence
fromMedium
2 months ago

Extracting AI-Ready Data From Organizational Documents

Poor document extraction corrupts retrieval; preserving document structure at ingestion produces reliable embeddings and trustworthy RAG outputs.
Data science
fromComputerworld
2 months ago

Great R packages for data import, wrangling, and visualization

A set of R packages (dplyr, purrr, readr/vroom, datapasta, Hmisc) streamline data wrangling, importing, and analysis with faster, standardized, and reproducible tools.
#instructed-retriever
fromInfoWorld
2 months ago
Artificial intelligence

Databricks says its Instructed Retriever offers better AI answers than RAG in the enterprise

fromInfoWorld
2 months ago
Artificial intelligence

Databricks says its Instruction Retrieval offers better AI answers than RAG in the enterprise

fromInfoWorld
2 months ago
Artificial intelligence

Databricks says its Instructed Retriever offers better AI answers than RAG in the enterprise

fromInfoWorld
2 months ago
Artificial intelligence

Databricks says its Instruction Retrieval offers better AI answers than RAG in the enterprise

Data science
fromCIO
2 months ago

5 perspectives on modern data analytics

Data/business analytics is the top IT investment priority, yet analytics projects often fail due to poor data, vague objectives, and one-size-fits-all solutions.
Artificial intelligence
fromFortune
2 months ago

Want to get AI agents to work better? Improve how they retrieve data, Databricks says | Fortune

Engineering complete AI-agent workflows and providing access to correct information are essential for moving AI agents beyond pilot phase.
Data science
fromInfoQ
1 month ago

Beyond the Warehouse: Why BigQuery Alone Won't Solve Your Data Problems

Data warehouses like BigQuery perform well initially but become slow, costly, and disorganized at scale, undermining low-latency operational use and innovation.
fromTechzine Global
1 month ago

Databricks shows how AI strengthens the SaaS model

The rise of generative AI is often seen as an existential threat to the SaaS model. Interfaces would disappear, software would fade away, and existing players would become irrelevant. However, new figures from Databricks paint a different picture. Rather than undermining SaaS, AI appears to be increasing its use. This week, Databricks reported a revenue run rate of $5.4 billion, a 65 percent year-on-year increase. More than a quarter of that now comes from AI-related products.
Artificial intelligence
Data science
fromInfoWorld
1 month ago

Snowflake debuts Cortex Code, an AI agent that understands enterprise data context

Cortex Code enables developers to use natural language to build, optimize, and deploy governed, production-ready data pipelines, analytics, ML workloads, and AI agents.
Artificial intelligence
fromTechzine Global
1 month ago

Snowflake launches Cortex Code agent for understanding data context

Cortex Code is an AI agent that converts complex data engineering, ML, and analytics tasks into natural-language workflows integrated into Snowflake and developer tools.
Artificial intelligence
fromInfoWorld
2 months ago

Teradata unveils enterprise AgentStack to push AI agents into production

Teradata positions Enterprise AgentStack as a vendor-agnostic execution layer across hybrid environments, contrasting platform-tied AI approaches from Snowflake and Databricks.
Data science
fromComputerworld
2 months ago

Tableau re-engineers dashboards, adds new analytics tools for business analysts

Tableau 2022.3 adds Data Guide and Table Extension, dynamic dashboards, event auditing, and performance/cost optimization to simplify self-service analytics for business users.
Artificial intelligence
fromInfoQ
2 months ago

Google BigQuery Adds SQL-Native Managed Inference for Hugging Face Models

BigQuery lets data teams deploy and run Hugging Face or Vertex AI open models with plain SQL, auto-provisioning compute and managing endpoints.
Data science
fromMedium
2 months ago

Migrating from Historical Batch Processing to Incremental CDC Using Apache Iceberg (Glue 4...

Use Apache Iceberg Copy-on-Write tables in AWS Glue 4 to migrate from full historical batch reprocessing to incremental CDC, reducing redundant computation, I/O, and costs.
fromInfoWorld
1 month ago

Databricks adds MemAlign to MLflow to cut cost and latency of LLM evaluation

By replacing repeated fine‑tuning with a dual‑memory system, MemAlign reduces the cost and instability of training LLM judges, offering faster adaptation to new domains and changing business policies. Databricks' Mosaic AI Research team has added a new framework, MemAlign, to MLflow, its managed machine learning and generative AI lifecycle development service. MemAlign is designed to help enterprises lower the cost and latency of training LLM-based judges, in turn making AI evaluation scalable and trustworthy enough for production deployments.
Artificial intelligence
Artificial intelligence
fromInfoQ
2 months ago

Why Most Machine Learning Projects Fail to Reach Production

Most ML projects fail to reach production because of problem choice, data/labeling issues, model-to-product gaps, offline-online mismatches, and non-technical blockers.
fromInfoQ
1 month ago

Building Embedding Models for Large-Scale Real-World Applications

What happens under the hood? How is the search engine able to take that simple query, look for images in the billions, trillions of images that are available online? How is it able to find this one or similar photos from all that? Usually, there is an embedding model that is doing this work behind the hood.
Artificial intelligence
[ Load more ]