In complex data platforms powered by AWS Glue, AppFlow, Airflow, and Step Functions, operational issues are inevitable. This talk shares how we built a self-healing system that automatically detects issues, tags them with context, uses Bedrock-based LLMs to suggest resolutions, and keeps stakeholders informed via GitLab and Slack. The result: a 40% drop in incident volume and significantly faster resolution times.
Key takeaways: • Using automation + AI for smarter incident workflows • Practical use of Bedrock in a data engineering context • Bridging operations and business with structured communication