Thought Leadership

Scaling Observability: Designing a High-Volume Telemetry Pipeline - Part 4

Jigar Bhatt

14 May 2025 — 17 min read

Data Super-Highway (AI Generated)

In Part 3, we explored building scalable telemetry pipelines with agents, batching, Kafka buffering, and backpressure control for resilient observability. Now let's bring it home with this last part of our blog series by addressing how to make the entire pipeline horizontally scalable and highly available, explore cost management and retention policies, and examine lessons learned from production deployments.

Horizontal Scaling and High Availability

We’ve implicitly discussed scaling by adding more instances in various parts of the pipeline. Let’s summarize how to achieve horizontal scaling and ensure high availability in each layer:

Collectors/Agents: These are usually per node or per service already (scaled with the cluster). For centralized collectors (like a cluster of OTel collectors or Jaeger collectors), you can run them behind a load balancer or use a service discovery mechanism. They are stateless (unless doing tail sampling – then state per trace in memory, but that’s handled by partitioning). By adding more collector instances, you increase aggregate throughput. Ensure a load balancing strategy that keeps related data together if needed (e.g., hash by traceID as mentioned). For availability, run collectors in multiple zones or use a daemonset per node (so if one node dies, it only affects that node’s agent, not others).
Kafka/Queue: When creating a Kafka-based pipeline, decide on the number of partitions such that each partition’s throughput is within the capacity of a single consumer. Kafka can scale to hundreds or thousands of partitions if needed, though extremely high partition counts can add overhead. Kafka offers built-in replication for availability – each partition is replicated to multiple brokers. For high availability, use a replication factor of 3 (or at least 2) so that a broker failure doesn’t lose data, and distribute replicas across availability zones. Consumers and producers will automatically handle broker failovers. You can also auto-balance partitions if a broker goes down or new brokers are added (using Kafka’s partition reassignments), keeping load evenly spread.
Metrics Backend: Systems like Mimir or Cortex scale by running multiple instances of each component. If ingest throughput grows, add more ingesters (and possibly increase their memory or WAL persistence as needed). Use consistent hashing or shuffle sharding so that new ingesters pick up a share of series from distributors. If query load grows, add more querier and query-frontend instances – these are stateless and easy to scale. High availability is achieved via replication (Cortex/Mimir replicating samples to multiple ingesters), plus having multiple replicas of each service behind a load balancer. Also, run these services across multiple nodes/VMs or Kubernetes pods, so a single machine failure only drops a fraction of capacity. Multi-tenancy support in these systems means one tenant’s heavy usage shouldn’t affect another’s, especially if you configure per-tenant limits and perhaps separate critical metrics into their own tenant.
Trace Backend: If using Jaeger with Kafka + Elasticsearch, you scale the Jaeger collectors by adding more (they are stateless, reading from Kafka). Scale the Elasticsearch cluster by adding data nodes and shards if write volume increases; also ensure you have multiple Kibana or query nodes for front-end throughput. For Grafana Tempo, you scale similarly to Loki/Cortex: multiple ingesters to handle trace ingestion (each writes to object storage) and multiple queriers for reading. Tempo is designed to be highly scalable and cost-efficient by forgoing heavy indexing. Ensure your object storage (e.g. S3) has sufficient throughput (it usually auto-scales). Use overlapping deployments if needed for migrations – e.g. spin up a new set of collectors while the old is still running, then cut over (to avoid downtime).
Log Backend: Elasticsearch clusters require careful scaling – monitor indexing throughput and query latency. If indexing is bottleneck, consider increasing the number of primary shards or the number of ingest nodes (which do parsing). Too many shards can hurt query performance, so strike a balance. Using Index Lifecycle Management, you can reduce replica count for older indices to save resources (since they’re queried less). For Loki, scaling means adding more ingesters (which handle incoming log streams and write to storage) and queriers (which fetch and filter log chunks). Loki can be scaled in a “simple scalable” mode separating write and read components that you scale independently. Both Elastic and Loki benefit from sharding by tenant or source if extremely high volume – for example, run two separate Loki clusters for two distinct environments to halve the load on each (at the cost of having two places to query). High availability: run multiple instances of each component (Elastic master nodes, ingest nodes, query nodes; Loki distributors, ingesters, etc.) so that one instance going down doesn’t stop ingestion or querying.
Stream Processing: Flink and Spark are inherently distributed; you scale by increasing parallelism (tasks). Ensure your job can partition the data effectively (e.g. by key) to utilize parallel consumers. Flink jobs can be rescaled by taking a savepoint (snapshot state) and redeploying with more task slots. Always configure periodic checkpoints to durable storage so that if a job fails or needs to move, it can resume without data loss. For HA, deploy Flink with a standby JobManager or use Kubernetes operators that restart jobs on failure. Because Flink maintains state, it’s crucial to plan for fault recovery – e.g., Flink’s checkpointing provides exactly-once guarantees, meaning even if a node fails, on recovery the job will not skip or double-process events (as long as it can restore from the last checkpoint). This ensures your streaming analytics remain correct under failures.

A horizontally scalable design also means no single point of failure (SPOF). Avoid having just one instance of any critical component. For example, a single log forwarder for an entire cluster is a SPOF – instead, run one per node or multiple in a load-balanced set. Use cluster modes for databases and message brokers. For each layer, consider how it can recover from node failures: if an OTel collector pod dies, do agents buffer and retry to another? If a Kafka broker dies, do producers time out and retry correctly to other brokers? Testing these failure scenarios is part of the engineering process.

Statelessness vs. Stateful Partitioning

Wherever possible, keep components stateless. Stateless components (like distributors, collectors, query frontends) can scale freely and recover easily by just restarting (no data to reload). When state is needed (like ingesters caching recent metrics, or tail-sampling buffers, or stream job state), design it so that the state is partitioned and/or backed by durable storage. Partitioning ensures that adding/removing nodes only affects a subset of data. For example, if you have 10 ingesters each with some subset of metrics series, losing one means those series need to failover to others, which the system can handle if data was replicated or can be recovered from WAL. Systems like Cortex use a gossip/consistent-hash ring to manage ingester membership. On failure, series assigned to the lost ingester get taken over by remaining ones (with maybe a brief dip in availability for those series until caught up from WAL or from the replicas). Always configure features like Mimir's HA tracking (if you have multiple Prometheus feeding it the same data) and Loki’s boltdb/TSDB index such that new ingesters can quickly load index info from shared storage on join.

Geo-Distribution

In some large organizations, they deploy pipelines per region for performance, then merge results at the query layer (federation). This avoids shipping all telemetry across the globe. However, that introduces complexity of multi-cluster management. If your use case requires global aggregation (e.g. traces spanning regions or logs from multi-region services), you’ll need a pipeline that can either handle cross-region data or replicate data to a central location. Kafka MirrorMaker or similar can sync topics across regions for a unified processing in one region at the expense of cross-region bandwidth.

To summarize, design your pipeline so that every component can scale out and has a failover plan.

Embrace loose coupling – components communicate via network and durable storage rather than in-memory assumptions. This way, adding resources or restarting services becomes routine. A properly scaled and load-balanced pipeline will continue running even when individual nodes crash, and will smoothly handle increasing load by consuming more resources rather than collapsing. High-volume observability pipelines at companies like Uber, Netflix, etc. are built on these principles to achieve near-linear scaling with system growth.

Cost Management and Data Retention

While not purely a technical scaling concern, cost control is an essential aspect of running a high-volume telemetry pipeline. Unbounded data growth can lead to runaway expenses (in cloud bills for storage/CPU or on-prem hardware). Here we discuss strategies to manage costs, which often coincide with improving scalability:

Retention Policies: Not all telemetry needs to be kept forever. Implement sensible retention schedules: e.g., keep full-resolution data for a short period (say 7-30 days) and then drop or archive it. For metrics, you might retain 15-second data for 2 weeks, then only 5-minute averages for 3 months. Logs might be kept in hot storage for a week, then archived to cheap storage for a few months, then deleted. Traces might only be kept for a few days unless they are for important transactions. By limiting retention, you cap the storage footprint, which directly bounds cost. Ensure compliance requirements are met though – if certain logs must be retained for a year, you’ll need an archive (but possibly not in the high-cost system). Many systems support automated deletion or rollover based on age (e.g., Elasticsearch ILM, Loki retention configs, etc.).
Sampling & Filtering: The most effective cost lever is to simply ingest less data. As discussed in previous parts, trace sampling aggressively reduces trace data volume (keeping perhaps <1% of all requests). This cuts storage, network, and processing costs dramatically while usually preserving the most useful information. Log filtering is equally important: do you really need every debug log from every instance? Likely not. Drop logs below a certain level in production, or use dynamic log levels to only increase verbosity when needed. Some teams filter out health-check logs or other chatty but low-value messages early in the pipeline. It’s common to discard known benign errors or redundant info. Even a simple filter like “drop all logs containing ‘database heartbeat’” (if that’s a noisy periodic message) can save gigabytes daily. Another approach is log sampling – if an error message repeats 1000 times, perhaps only store 100 of them and count the rest. We saw earlier an example of tail sampling for traces to keep all errors; similarly for logs, you might set rules to always keep certain log types (errors, warnings) but sample down the info/debug messages. Edge computing solutions can identify repetitive log patterns and only forward unique patterns plus counts, reducing volume significantly.
High-Cardinality Data Guardrails: Ensure metrics with unbounded cardinality are controlled. Some systems let you enforce limits like “max 1000 time-series per metric name” – beyond that, drop new series. This prevents a runaway metric from OOMing the database (and incurring cost for all that memory/storage). Also watch out for label combinations that multiply out (like combining userID with URL with error code etc.). You might need to restrict combinations or explode them into separate metrics. High cardinality metrics or tags often correlate with high cost because they create more database entries and more index data. So, addressing cardinality is both a performance and a cost optimization.
Tiered Storage (Hot vs Cold): Use expensive storage for recent, frequently-accessed data, and cheaper storage for older or infrequently accessed data. For logs, this could mean using SSD-based Elasticsearch for 7 days, and then shifting to an S3-based data lake or a compressed store for anything older. Some companies use services like AWS S3 Glacier for archival – super cheap to store, though slow to retrieve. If you implement this, also implement a process to retrieve data when needed (perhaps an automated job that can load archived logs into a cluster for analysis if an incident from last quarter needs investigation). For metrics, systems like Thanos or Mimir inherently do tiering by using object storage for older data. For traces, you might dump old traces to files in cloud storage after a few days and remove them from the fast database. This drastically cuts the cost of keeping historical data.
Compression and Efficient Encoding: Make sure data is compressed at rest and in transit. Text logs compress 5-10x typically. Metrics use efficient encoders by default (Prometheus TSDB does delta-of-delta encoding and bit packing). Trace data (often JSON) can be heavy; using Protobuf (OTLP) and compressing (gRPC can enable gzip) saves bandwidth. Also, consider not storing duplicate fields repeatedly. For example, if every log line includes the same static metadata, Loki’s approach of labeling captures that once and not on every line saves space. In Elasticsearch, using mappings to not index certain large text fields (store only) can save index size if you rarely search by them. Also, prefer structured logs to raw text parsing – structured data can be stored more compactly (like numeric fields vs strings) and filtered without full text indexing.
Optimize Query Costs: If you pay for query execution (like in a cloud logging service or Athena), reduce expensive queries. This is more on the usage side, but e.g. avoid running regex searches over all logs needlessly. Provide saved searches or aggregated metrics to answer common questions instead of ad-hoc expensive queries each time. Some cloud services charge by data scanned – partitioning data well (by time, by service) can limit scan sizes and thus cost.
Choosing Open Source vs Managed: Many organizations move to open-source self-hosted solutions (Prometheus, Loki, etc.) because sending all data to a SaaS vendor is cost-prohibitive at scale (vendors often charge per GB ingested, which at TB/day is huge). Self-hosting shifts cost to your cloud or hardware spend, which can be cheaper if optimized. On the other hand, managed services might have more efficient backends or easier scaling (but you pay a premium). A hybrid approach is possible: for critical data, use a paid service with strong analytics; for raw bulk data, use open source internally. For example, you might send high-value business metrics to a SaaS analytics tool, but keep all system metrics in your own cluster.
Showback/Chargeback: Within a large enterprise, visibility into who is generating the most telemetry can help control costs. If Team A sees that they produce 10x more logs than anyone else, they might investigate why (maybe a verbose setting left on) or chip in more for the budget. Even if you don’t actually charge teams, showing them their usage encourages optimization. It turns cost into a shared responsibility.
Automated Cleanup: Implement auto-deletion for obsolete telemetry data. For instance, if an application is decommissioned, make sure its old metrics and logs are aged out and not kept indefinitely taking up space. The pipeline can include retention enforcement that aggressively drops data from sources that haven’t been seen in a while.

Ultimately, cost management is about smart data management – collecting what’s useful at a resolution that’s needed, and for only as long as it provides value.

Every piece of data has a “diminishing value over time”: recent data is extremely valuable for debugging active issues, but months-old raw telemetry is rarely used except for compliance or audits. Tailor your pipeline to that reality by summarizing or purging old data.

Trade-Offs, Anti-Patterns, and Lessons Learned

Designing a high-volume telemetry pipeline involves balancing trade-offs. Here are some key trade-offs and pitfalls (“anti-patterns”) to be aware of, along with lessons learned from real-world deployments:

Trade-Offs & Anti-Patterns

Completeness vs. Cost/Performance: An overarching trade-off is how much data to collect and retain versus the cost and performance implications. 100% data collection (no sampling, no filtering) maximizes visibility but is usually unsustainable at scale. Too aggressive filtering, on the other hand, might leave gaps just when you need detail. The sweet spot is to capture all high-value signals (errors, anomalies, key transactions) but aggressively downsample repetitive, low-value data. Many teams iterate on this. For example, one might start tracing at 10% sampling and realize the system can handle more, so increase to 20%. Or discover that certain log categories are rarely used and safely drop them. It’s an ongoing calibration of signal vs. noise. A good practice is to review incidents: if an incident was hard to debug due to missing data, you may decide to collect more of that type; if you never used a type of data, maybe collect less of it.
Unified vs. Specialized Tools: A unified pipeline (one agent, one format) simplifies deployment and correlation, but sometimes specialized tools are better for each domain (metrics, logs, traces). One anti-pattern is trying to force one storage engine to do everything. For instance, storing high-cardinality time-series in Elasticsearch (which is not optimized for that query pattern) would both be slow and costly – a specialized TSDB is more appropriate. Conversely, using a TSDB for logs (treating logs as metrics) loses the richness of text search. The lesson is use the right tool for each job, but integrate them. OpenTelemetry helps unify data collection without forcing a single backend. Modern observability stacks embrace polyglot persistence while providing a cohesive user experience on top.
Overlooking Bottlenecks: A pipeline is a chain; the slowest link dictates throughput. A common anti-pattern is scaling the obvious parts (like adding Kafka brokers) but missing a hidden bottleneck. Examples: a single-threaded parser that can’t handle more than X events/sec, or a database lock contention issue limiting write rate. Regularly perform end-to-end load testing or at least capacity analysis for each component. Tracing can even be applied to the pipeline itself to see where time is spent. If one component consistently has high CPU or queue backlogs, that’s your bottleneck. Also be aware of head-of-line blocking – e.g., if one very large log batch slows down a consumer thread, it might delay others (depending on the design). Solutions like parallel consumer threads or partitioning can alleviate this.
Too Many Unique Dimensions: High cardinality was mentioned, but it’s worth re-emphasizing as an anti-pattern. Metrics with exploding dimensions, logs with unbounded unique values, and traces with too granular labels (like span tags that are unique per request) will strain any system. Limit uniqueness in what you index and aggregate on.
Volume of Logs vs. Usefulness: Often, log volume grows out of habit (developers logging everything) rather than real need. An anti-pattern is treating log management as a “write-only” archive of absolutely everything. If no one ever reads 90% of those logs, you’re wasting resources. Observability is about making informed decisions on what to log. One lesson learned in many orgs: implement logging guidelines – e.g., logs should be actionable or informative, not debug spam; use metrics for counts rather than log each occurrence, etc. Some companies introduce a review step for adding a very high-volume log in code (because it has infrastructure cost). Culturally, shift to “log smart, not log more.”
Not Monitoring the Pipeline Itself: Failing to apply observability to your observability pipeline is ironically common. The pipeline should emit its own metrics: queue lengths, processing latencies, error counts, dropped events counts, CPU/memory usage, etc. Anti-pattern is treating it as a black box until something breaks. Instead, instrument the collectors, brokers, and storage with metrics (most have them – e.g., Kafka has metrics for bytes in/out, Lag, etc., Elastic has node stats, OTel Collector exports self-metrics). Build dashboards like “Telemetry Pipeline Health” showing ingestion rates vs. consumption, any backlogs, error rates in exporters, etc. Also set up alerts – e.g., alert if log queue lag > 10^6 messages or if trace drop count suddenly spikes. This meta-monitoring will catch issues early (maybe before users notice missing data 😉). As a lesson: treat the pipeline as you would a critical production service with SLOs (e.g., 99.9% of data processed within 1 minute, etc.).
No Backpressure Strategy: Another anti-pattern is not designing for overflow, leading to crashes or out-of-memory conditions. We discussed ways to handle this (queues, drop policies). Systems that lack these will eventually suffer when an upstream flood occurs. For instance, a naive log aggregator that tries to buffer everything in RAM if the database is slow will eventually OOM and lose everything. It’s better to drop some data gracefully (perhaps with an error log saying it did so) than to crash. Make sure each component in the pipeline has some limit and behavior beyond that limit. Many modern components have such controls (for example, OTel Collector’s memory_limiter, Fluent Bit’s storage.backlog.mem_limit, etc.) – use them.
Reinventing the Wheel Unnecessarily: It can be tempting for very advanced teams to build custom pipeline components for absolute performance (and some have, like custom messaging systems or in-house time-series databases). But for most, this is an anti-pattern as it diverts engineering effort and may lack the robustness of community-maintained projects. You likely don’t need to write your own Kafka from scratch, or your own metrics database from scratch, unless you have truly unique requirements that existing solutions cannot meet. Leverage open source – they are battle-tested at high scale. Only go custom if you’ve clearly identified a gap that no existing tool fills and if you have the capacity to maintain it long-term.
Ignoring Governance and Access Control: At enterprise scale, you must also consider who can access the data and how to prevent abuse. An anti-pattern is an open pipeline where anyone can query anything without quotas – one wild regex search could destabilize a cluster. Implement rate limits on queries, and consider multi-tenancy features of tools to isolate heavy users. Also, secure the data: logs often contain sensitive info, so your pipeline should have encryption in transit and at rest, and proper authentication/authorization for querying (especially if multi-tenant). While this is more about security/compliance, it’s a lesson learned that at scale, you need to manage usage to keep the pipeline reliable for everyone.

Lessons Learned

Summarizing a few key lessons from teams operating large telemetry systems:

Start Simple, Then Optimize: It’s okay to begin with a basic pipeline (maybe one cluster for Elastic, one Prom for metrics) when volumes are low. But design with an eye for future distributed architecture. As volume grows, incrementally introduce the scalable components and test them. This avoids premature complexity but also ensures you’re not stuck when scale hits.
Invest in Architecture over Hardware: Throwing bigger machines at the problem (vertical scaling📈) hits limits fast (and cost skyrockets💸). It’s better to architect for horizontal scaling earlier than to buy a huge supercomputer to run one monolithic database. The latter will become a single point of failure and a pain to replace. Clusters of smaller commodity instances are usually more cost-effective and resilient.
Communication Between Teams: In large orgs, the producers of telemetry (application teams) and the consumers (observability team) might be different. Communicate about the impact of telemetry – e.g., if an app team suddenly enables very verbose logging, make sure they know the cost or pipeline impact. Having a feedback loop (“we saw your service’s log volume grew 5x last week, is that expected?”) can catch accidental issues. Also, share best practices with developers on how to instrument efficiently (use async spans, don’t log in tight loops, etc.).
Regular Capacity Planning: Just as you plan capacity for your main product, do so for the telemetry pipeline. Monitor growth trends (metrics count, log size per day) and project when you’ll need more resources. Because if your pipeline becomes a bottleneck, it can hinder troubleshooting precisely when your system is under duress. Treating it with the same rigor as production capacity avoids unpleasant surprises.
Keep Learning: The observability landscape evolves rapidly. New compression algorithms, new database technologies, and better strategies (like exemplary traces, dynamic sampling) emerge. For example, if your current trace sampling misses some issues, you might adopt new techniques like adaptive sampling (increasing sampling when errors rise). Or if storage cost is an issue, you might evaluate a newer log store that promises lower cost. Stay informed via community forums, conference talks, and case studies from other companies. Adopting improvements can yield big savings or capabilities.

In essence, building a high-volume telemetry pipeline is a continual process of refinement. Start with a robust foundation, observe how it behaves, and iteratively improve reliability, efficiency, and usefulness. Avoiding the common pitfalls and learning from others’ experiences will accelerate this journey.

Scaling an observability pipeline to handle millions of events per second is a challenging endeavor, but by applying the right architectures and practices, it can be achieved in a sustainable way. We’ve explored how to handle metrics, traces, and logs at scale – each with their tailored storage and processing designs – and how to tie them together into a cohesive system. The themes of horizontal scalability, decoupling through queues, smart filtering, and careful resource management come up again and again.

A well-designed pipeline will ensure that no matter how much your application traffic grows or how complex your microservices become, you will maintain visibility into the system’s behavior. This means faster debugging, more reliable alerting, and ultimately higher confidence in managing large-scale systems. Just as importantly, we emphasized cost-awareness: it’s easy to ingest everything and drown in data (and bills), but far more effective to be selective and strategic in what you observe.

By leveraging modern open-source tools and cloud-native patterns – you can build on the lessons of industry leaders. You don’t need to reinvent the wheel; you can assemble a combination of these projects and scale them out on Kubernetes or your infrastructure of choice.

In the end, observability is an enabler for reliability. A scalable telemetry pipeline is not just an IT cost center – it’s an investment in uptime and agility. It allows engineers to quickly pinpoint issues in distributed environments and understand system behavior even under massive load. The strategies outlined in this series of posts serve as a blueprint to design and evolve your own high-volume telemetry pipeline. Take these lessons, adapt them to your context, and you’ll be well on your way to observability at scale: a pipeline that grows with your systems and continuously delivers the insights you need, when you need them. Happy Observing!