Rajesh Kumar
(Senior DevOps Manager & Principal Architect)
Rajesh Kumar — an award-winning academician and consultant trainer, with 15+ years’ experience in diverse skill management, who has more than a decade of experience in training large and diverse groups across multiple industry sectors.
Reliability is the probability that a system will produce correct outputs up to some given time t.
Source: E.J. McClusky & S. Mitra (2004). "Fault Tolerance" in Computer Science Handbook 2ed. ed. A.B. Tucker. CRC Press.
Highly reliable data pipelines
Highly reliable data pipelines
Highly reliable data pipelines
- Architecture
- Monitoring
- Failures handling
Historical metric queries
Historical metric queries
Historical metric queries
Historical metric queries
Highly reliable data pipelines
- Architecture
- Monitoring
- Failures handling
Our big data platform architecture
Pick the best hardware for each job
Scale up/down clusters
- If we are behind.
- Scale as we grow.
- No more waiting on loaded clusters.
Safer upgrades of EMR/Hadoop/Spark
How can we build highly reliable data pipelines with instances killed randomly all the time?
No long running jobs
- The longer the job, the more work you lose on average.
- The longer the job, the longer it takes to recover.
Break down jobs into smaller pieces
Break down jobs into smaller pieces
Lessons
- Many clusters for better isolation.
- Break down jobs into pieces (no longer than ~3 hours).
- Trade-off between performance and fault tolerance.
Highly reliable data pipelines
- Architecture
- Monitoring
- Failures handling
Reliability is the probability that a system will produce correct outputs up to some given time t.
Reliabilityis the probability that a system will produce correct outputs up to some given time t.
Monitoring data pipelines
1. Is the data pipeline going to finish before the deadline?
We monitor actively 3 types of metrics:
- Data lags metrics.
- Cluster health metrics.
- Job health metrics.
Monitoring data pipelines
1. Is the data pipeline going to finish before the deadline?
We monitor actively 3 types of metrics:
- Data lags metrics.
- Cluster health metrics.
- Job health metrics.
Monitoring data pipelines
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
1. Is the data pipeline going to finish before the deadline?
We monitor actively 3 types of metrics:
- Data lags metrics.
- Cluster health metrics.
- Job health metrics.
Monitoring data pipelines
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
1. Is the data pipeline going to finish before the deadline?
We monitor actively 3 types of metrics:
- Data lags metrics.
- Cluster health metrics.
- Job health metrics.
Monitoring data pipelines
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
1. Is the data pipeline going to finish before the deadline?
Monitoring data pipelines
2. Is the data produced correct?
- Add custom counters throughout the pipelines.
- Count records.
- Count duplicates.
- Count records that can’t join.
- Ad hoc checks on the output data.
Lessons
- Monitoring = will we finish before t? + is the data correct?
- Measure, measure and measure!
- Alert on meaningful and actionable metrics.
Highly reliable data pipelines
- Architecture
- Monitoring
- Failures handling
Data pipelines will break
Data pipelines will break
1. Recover fast
We want to fix the issues ASAP.
2. Degrade gracefully
We want to limit the customer-facing impact.
Recover fast
- No long running job.
- Switch from spot to on-demand clusters.
- Increase cluster size.
- Easy ways to rerun jobs (not always trivial!).
Example: rerun the rollups pipeline
Example: rerun the rollups pipeline
Example: rerun the rollups pipeline
Example: rerun the rollups pipeline
Example: rerun the rollups pipeline
Example: rerun the rollups pipeline
Example: rerun the rollups pipeline
Lessons
- Think about potential issues ahead of time.
- Have knobs ready to recover fast.
- Have knobs ready to limit the customer facing impact.
Conclusion
Building highly reliable data pipelines
Conclusion
Building highly reliable data pipelines
- Know your time constraints
Conclusion
Building highly reliable data pipelines
- Know your time constraints
- Break down jobs into small survivable pieces.
Conclusion
Building highly reliable data pipelines
- Know your time constraints
- Break down jobs into small survivable pieces.
- Monitor cluster metrics, job metrics and data lags.
Conclusion
Building highly reliable data pipelines
- Know your time constraints
- Break down jobs into small survivable pieces.
- Monitor cluster metrics, job metrics and data lags.
- Think about failures ahead of time and get prepared.
Thanks!
We’re hiring!
qf@datadoghq.com
https://jobs.datadoghq.com
Any Questions?
Thank You!
DevOpsSchool — Lets Learn, Share & Practice DevOps
www.devopsschool.com
Connect with us on
contact@devopsschool.com | +91 700 483 5930