Step 1 – Spinning Up Our Legacy E-Commerce Shop
Our legacy monolith shop uses Ruby on Rails and Spree. We’ve started to build out a first set of microservices, and these have been added to an initial set of containres.
We use docker-compose
to bring it up and running. There’s a prebuilt Rails Docker container image, along with the new Python / Flask microservice which handle our Coupon codes and Ads which display in the store.
In this workshop, we’re going to spin up and instrument our application to see where things are broken, and next, find a few bottlenecks.
We’ll focus on thinking through what observability might make sense in a real application, and see how setting up observability works in practice.
Our application should be cloned from Github in this scenario, and if we change into the directory, we should be able to start the code with the following:
$ cd /ecommerce-observability
$ POSTGRES_USER=postgres POSTGRES_PASSWORD=postgres docker-compose up
Once our images are pulled, we should be able to jump into and view the application within Katacoda:
https://2e652dae321844ddb7f22fd05609a510-167772165-3000-ollie02.environments.katacoda.com/
Try browsing around, and notice the homepage takes an especially long time to load.
The first thing we’ll do is see where that slow load time may be coming from.
Step 2 – How to Grok an Application with Datadog
Whenever working with new code, it can be daunting to understand a system and how it all interacts together together.
Our Datadog instrumentation allows us to get an immediate insight into what’s going on with the code.
Let’s add the Datadog Agent to our docker-compose.yml
, and begin instrumenting our application:
agent:
image: "datadog/agent:6.13.0"
environment:
- DD_API_KEY
- DD_APM_ENABLED=true
- DD_LOGS_ENABLED=true
- DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL=true
- DD_PROCESS_AGENT_ENABLED=true
- DD_TAGS='env:ruby-shop'
ports:
- "8126:8126"
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /proc/:/host/proc/:ro
- /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
labels:
com.datadoghq.ad.logs: '[{"source": "datadog-agent", "service": "agent"}]'
With this, we’ve added volumes to see the resource usage on our host, along with the Docker socket so we can read the containers running on the host.
We’ve also added a DD_API_KEY
, along with enabling logs and the process Agent. Finally, we’ve opened the port 8126
, where traces get shipped to for collection at the Agent level.
We can now rerun our application with our DD_API_KEY
with the following command:
$ export DD_API_KEY=<YOUR_API_KEY>
$ POSTGRES_USER=postgres POSTGRES_PASSWORD=postgres docker-compose up
And with that, we should start to see info coming in to Datadog.
Step 3 – APM Automatic Instrumentation with Rails
Our code has already been set up with instrumentation from Datadog.
Depending on the language your application runs in, you may have a different process for instrumenting your code. It’s best to look at the documentation for your specific language.
In our case, our applications run on Ruby on Rails and Python’s Flask.
We’ll instrument each language differently.
Installing the APM Language Library
For Ruby on Rails, we need to first add the ddtrace
Gem to our Gemfile. Take a look at store-frontend/Gemfile
in the Katacoda file explorer, and notice we’ve added the Gem so we can start shipping traces.
Because we plan on also consuming logs from Rails and correlating them with traces, we’ve also added logging-rails
and lograge
. Both of these are documented on the Ruby trace / logs correlation part of the documentation.
Once these are both added to the list of our application’s requirements, we must then add a datadog.rb
to the list of initializers.
You’ll find the file in store-frontend/config/initializers/
.
There, we control a few settings:
Datadog.configure do |c|
# This will activate auto-instrumentation for Rails
c.use :rails, {'analytics_enabled': true, 'service_name': 'store-frontend'}
# Make sure requests
are also instrumented
c.use :http, {'analytics_enabled': true, 'service_name': 'store-frontend'}
c.tracer hostname: 'agent'
end
We set analytics_enabled
to be true
for both our Rails auto instrumentation, and the http
instrumentation.
This allows us to use Trace Search and Analytics from within Datadog.
We then set a hostname
for all our traces to be sent to. Because we set the Datadog Agent to listen on port 8126
, we set this to be the hostname available within our docker-compose
.
Finally, we set an environment for our traces. This allows us to separate different environments, for example, staging and production.
With this, our Ruby application is instrumented. We’re also able to continue traces downstream, utilizing Distributed Traces.
Shipping Logs Correlated with Traces
To ship logs to Datadog, we’ve got to ensure they’re converted to JSON format. This allows for filtering by specific parameters within Datadog.
Within our config/development.rb
, we see the specific code to ship our logs along with the correlating traces:
config.lograge.custom_options = lambda do |event|
# Retrieves trace information for current thread
correlation = Datadog.tracer.active_correlation
{
# Adds IDs as tags to log output
:dd => {
:trace_id => correlation.trace_id,
:span_id => correlation.span_id
},
:ddsource => ["ruby"],
:params => event.payload[:params].reject { |k| %w(controller action).include? k }
}
end
Next, let’s look at how a Python application is instrumented.
Step 4 – APM Automatic Instrumentation with Python
Now that we’ve set up our Ruby on Rails application, we can now instrument our downstream Python services.
Looking at the documentation for the Python tracer, we have a utility called ddtrace-run
.
Wrapping our Python executable in a ddtrace-run
allows us to spin up a running instance of our application fully instrumented with our tracer.
For supported applications like Flask, ddtrace-run
dramatically simplifies the process of instrumentation.
Instrumenting the Advertisements Service
In our docker-compose.yml
there’s a command to bring up our Flask server. If we look, we’ll see it’s a:
flask run --port=5002 --host=0.0.0.0
Once we install the Python ddtrace
by adding it to our requirements.txt
(it should already be there), we edit this command by putting a ddtrace-run
in front:
ddtrace-run flask run --port=5002 --host=0.0.0.0
With this, we’re ready to configure out application’s insturmentation.
Automatic instrumentation is done via environment variables in our docker-compose.yml
:
- DATADOG_SERVICE_NAME=advertisements-service
- DATADOG_TRACE_AGENT_HOSTNAME=agent
- DD_LOGS_INJECTION=true
- DD_ANALYTICS_ENABLED=true
With this, we’ve connected and instrumented all of our services to APM.
The last thing we need to add is a label to our container, so our logs are shipped with the label of the service, and with the proper language processor:
labels:
com.datadoghq.ad.logs: '[{"source": "python", "service": "ads-service"}]'
We can repeat the process, and fill out the settings for the discounts-service
:
discounts:
environment:
- FLASK_APP=discounts.py
- FLASK_DEBUG=1
- POSTGRES_PASSWORD
- POSTGRES_USER
- DATADOG_SERVICE_NAME=discounts-service
- DATADOG_TRACE_AGENT_HOSTNAME=agent
- DD_LOGS_INJECTION=true
- DD_ANALYTICS_ENABLED=true
image: "burningion/ecommerce-spree-discounts:latest"
command: ddtrace-run flask run --port=5001 --host=0.0.0.0
ports:
- "5001:5001"
volumes:
- "./discounts-service:/app"
depends_on:
- agent
- db
labels:
com.datadoghq.ad.logs: '[{"source": "python", "service": "discounts-service"}]'
Next, let’s take a closer look at why and where our application may be failing.
Step 5 – Debugging Our Application with APM
Now that we’ve instrumented all of our code, let’s spin up some traffic so we can get a better look at what may be happening.
Spinning up Traffic for Our Site
In our /ecommerce-observability
folder, we’ve got a copy of GoReplay.
We’ve also got a capture of traffic using GoReplay. Let’s spin up an infinite loop of that traffic:
$ ./gor --input-file-loop --input-file requests_0.gor --output-http "http://localhost:3000"
Once we spin up that traffic with our included observability, we can now take a look at the issues we’ve come across since the new team rolled out their first microservice, the advertisements-service
.
Before we began instrumenting with Datadog, there’d been reports that the new advertisements-service
broke the website. With the new deployment on staging, the frontend
team has blamed the ads-service
team, and the advertisements-service
team has blamed the ops team.
Now that we’ve got Datadog and APM instrumented in our code, let’s see what’s really been breaking our application.
Debugging an Application with Datadog
The first place we can check is the Service Map, to get an idea for our current infrastructure and microservice dependencies.
In doing so, we can tell that we’ve got two microservices that our frontend calls, a discounts-service
, along with an advertisements-service
.
If we click in to view our Service Overview in Datadog, we can see that our API itself isn’t throwing any errors. The errors must be happening on the frontend.
So let’s take a look at the frontend service, and see if we can find the spot where things are breaking.
If we look into the service, we can see that it’s been laid out by views. There’s at least one view that seems to only give errors. Let’s click into that view and see if a trace from that view can tell us what’s going on.
It seems the problem happens in a template. Let’s get rid of that part of the template so we can get the site back up and running while figuring out what happened.
Open store-frontend/app/views/spree/layouts/spree_application.html.erb
and delete the line under <div class="container">
. It should begin with a <br />
and end with a </center>
.
The banner ads were meant to be put under store-frontend/app/views/spree/products/show.html.erb
and store-frontend/app/views/spree/home/index.html.erb
.
For the index.html.erb
, under <div data-hook="homepage_products">
add the code:
<br /><center><a href="<%= @ads['url'] %>"><img src="data:image/png;base64,<%= @ads['base64'] %>" /></a></center>
And for the show.html.erb
at the very bottom add:
<br /><center><a href="<%= @ads['url'] %>"><img src="data:image/png;base64,<%= @ads['base64'] %>" /></a></center><br />
With that, our project should be up and running. Let’s see if there’s anything else going on.
Step 6 – Spotting and Resolving Bottlenecks with the Service List
With the Service List, we can see at a quick glance see endpoints that are running slower than the rest.
If we look at the Frontend Service, we can see there are two endpoints in particular that are substantially slower than the rest.
Both the HomeController#index
and the ProductController#show
enpoints are showing much longer latency times. If we click in, and view a trace, we’ll see that we’ve got downstream microservices taking up a substantial portion of our load time.
Use the span list to see where it may be, and we can then take a look at each of the downstream services and where things may be going wrong.
It seems two microservices in particular are being called for the homepage. If we look into our docker-compose.yml
, we can see both the advertisements-service
and discounts-service
are each taking over 2.5 seconds for each request. Let’s look within their specific urls to see if there isn’t something amiss.
Looking at the code, it appears we’ve accidentally left a line in from testing what happens if latency goes up.
Try spotting the line and removing the code to see if you can bring the latency down again for the application.
What sort of an improvement in page load time did it give you? Can you graph the differences over time?
Reference
- https://www.katacoda.com/burningion/scenarios/ecommerce-observability
- Best AI tools for Software Engineers - November 4, 2024
- Installing Jupyter: Get up and running on your computer - November 2, 2024
- An Introduction of SymOps by SymOps.com - October 30, 2024