#reliability-engineering

[ follow ]
DevOps
fromInfoQ
4 days ago

Failure As a Means to Build Resilient Software Systems: A Conversation with Lorin Hochstein

Using software failures can enhance software architecture and reliability engineering practices.
DevOps
fromInfoQ
3 weeks ago

Change as Metrics: Measuring System Reliability Through Change Delivery Signals

System changes cause 60-80% of production incidents, making change-related metrics essential first-class reliability signals aligned with DORA framework principles.
fromInfoQ
3 months ago

QConAI NY 2025 - Designing AI Platforms for Reliability: Tools for Certainty, Agents for Discovery

His central message was that reliability comes from combining probabilistic components with deterministic boundaries. Erickson argued agentic AI becomes more interesting when it is treated as a layer over real operational systems rather than a replacement for them. The model can interpret questions, retrieve evidence, classify situations, and suggest actions. Deterministic systems execute the actions, enforce the constraints, and provide the telemetry that allows the whole loop to be evaluated.
Artificial intelligence
Software development
fromInfoQ
3 months ago

AWS Debuts "DevOps Agent" to Automate Incident Response and Improve System Reliability

AWS DevOps Agent is an autonomous, always-on on-call engineer that integrates with observability, deployment, and ticketing tools to automate incident response and improve reliability.
Software development
fromInfoQ
5 months ago

From Grassroots to Enterprise: Vanguard's Journey in SRE Transformation

Vanguard built an enterprise SRE program from minimal resources into an organization-wide job family, emphasizing performance, resilience, coaching, and technical solutions.
[ Load more ]