Datadog-Data-Pipelines

Building highly reliable data pipelines @ Datadog

by DevOpsSchool.com

Rajesh Kumar

(Senior DevOps Manager & Principal Architect)

www.rajeshkumar.xyz

/RajeshKumarLog

Rajesh Kumar — an award-winning academician and consultant trainer, with 15+ years’ experience in diverse skill management, who has more than a decade of experience in training large and diverse groups across multiple industry sectors.

Reliability is the probability that a system will produce correct outputs up to some given time t.

Source: E.J. McClusky & S. Mitra (2004). "Fault Tolerance" in Computer Science Handbook 2ed. ed. A.B. Tucker. CRC Press.

Highly reliable data pipelines

Architecture

Highly reliable data pipelines

Architecture
Monitoring

Highly reliable data pipelines

Architecture
Monitoring
Failures handling

Historical metric queries

Historical metric queries

Historical metric queries

Historical metric queries

Highly reliable data pipelines

Architecture
Monitoring
Failures handling

Our big data platform architecture

Many ephemeral clusters

Total isolation

Pick the best hardware for each job

Scale up/down clusters

If we are behind.
Scale as we grow.
No more waiting on loaded clusters.

Safer upgrades of EMR/Hadoop/Spark

Spot-instance clusters

Spot-instance clusters

How can we build highly reliable data pipelines with instances killed randomly all the time?

No long running jobs

The longer the job, the more work you lose on average.
The longer the job, the longer it takes to recover.

No long running jobs

No long running jobs

Break down jobs into smaller pieces

Example

Rollups pipeline

Example

Rollups pipeline

Break down jobs into smaller pieces

Lessons

Many clusters for better isolation.
Break down jobs into pieces (no longer than ~3 hours).
Trade-off between performance and fault tolerance.

Highly reliable data pipelines

Architecture
Monitoring
Failures handling

Reliability is the probability that a system will produce correct outputs up to some given time t.

Reliabilityis the probability that a system will produce correct outputs up to some given time t.

Monitoring data pipelines

1. Is the data pipeline going to finish before the deadline?

We monitor actively 3 types of metrics:

Data lags metrics.
Cluster health metrics.
Job health metrics.

Monitoring data pipelines

1. Is the data pipeline going to finish before the deadline?

We monitor actively 3 types of metrics:

Data lags metrics.
Cluster health metrics.
Job health metrics.

Monitoring data pipelines

1. Is the data pipeline going to finish before the deadline?

Monitoring data pipelines

1. Is the data pipeline going to finish before the deadline?

We monitor actively 3 types of metrics:

Data lags metrics.
Cluster health metrics.
Job health metrics.

Monitoring data pipelines

1. Is the data pipeline going to finish before the deadline?

Monitoring data pipelines

1. Is the data pipeline going to finish before the deadline?

Monitoring data pipelines

1. Is the data pipeline going to finish before the deadline?

We monitor actively 3 types of metrics:

Data lags metrics.
Cluster health metrics.
Job health metrics.

Monitoring data pipelines

1. Is the data pipeline going to finish before the deadline?

Monitoring data pipelines

1. Is the data pipeline going to finish before the deadline?

Monitoring data pipelines

1. Is the data pipeline going to finish before the deadline?

Monitoring data pipelines

2. Is the data produced correct?

Add custom counters throughout the pipelines.

Count records.
Count duplicates.
Count records that can’t join.

Ad hoc checks on the output data.

Lessons

Monitoring = will we finish before t? + is the data correct?
Measure, measure and measure!
Alert on meaningful and actionable metrics.

Highly reliable data pipelines

Architecture
Monitoring
Failures handling

Data pipelines will break

Data pipelines will break

1. Recover fast

We want to fix the issues ASAP.

2. Degrade gracefully

We want to limit the customer-facing impact.

Recover fast

No long running job.
Switch from spot to on-demand clusters.
Increase cluster size.
Easy ways to rerun jobs (not always trivial!).

Example: rerun the rollups pipeline

Example: rerun the rollups pipeline

Example: rerun the rollups pipeline

Example: rerun the rollups pipeline

Example: rerun the rollups pipeline

Example: rerun the rollups pipeline

Example: rerun the rollups pipeline

Lessons

Think about potential issues ahead of time.
Have knobs ready to recover fast.
Have knobs ready to limit the customer facing impact.

Conclusion

Building highly reliable data pipelines

Conclusion

Building highly reliable data pipelines

Know your time constraints

Conclusion

Building highly reliable data pipelines

Know your time constraints
Break down jobs into small survivable pieces.

Conclusion

Building highly reliable data pipelines

Know your time constraints
Break down jobs into small survivable pieces.
Monitor cluster metrics, job metrics and data lags.

Conclusion

Building highly reliable data pipelines

Know your time constraints
Break down jobs into small survivable pieces.
Monitor cluster metrics, job metrics and data lags.
Think about failures ahead of time and get prepared.

Thanks!

We’re hiring!

qf@datadoghq.com

https://jobs.datadoghq.com

DevOpsSchool Community Networks

These platforms provide you the opportunity to connect with peers and industry DevOps leaders, where you can share, discuss or get information on latest topics or happenings in DevOps culture and grow your DevOps professionals network.


DevOps
Build & Release


DevOps
Build & Release


DevOpsSchool
DevOps Group


BestDevOps.com

Any Questions?

Thank You!

DevOpsSchool — Lets Learn, Share & Practice DevOps

www.devopsschool.com

Connect with us on

contact@devopsschool.com | +91 700 483 5930

Building highly reliable data pipelines @ Datadog by DevOpsSchool.com