Royal Stock Exchange(RSE) - How to Scale Our System to Millions?

Royal Stock Exchange(RSE) - How to Scale Our System? - Part 4

Scaling systems to accommodate large amounts of data and millions of concurrent users is a complex and challenging task.

It requires a comprehensive set of skills, techniques, technologies, and tools to ensure success.

In this post, we will delve into RSE's strategy for scaling a stock exchange application to meet the demands of millions of concurrent users while efficiently managing a high volume of daily records. Our goal is to provide insights into the key components and approaches that will enable us to achieve this ambitious scale.

JSON objects in DB Table Columns
Background workers
Redis
Elasticsearch(planned if needed)

Intro

By adopting a multi-faceted approach, we aim to address the various aspects involved in scaling a system effectively. This includes database design and optimization, efficient data aggregation through background workers, intelligent caching using Redis, and leveraging the power of Elasticsearch for advanced querying and analysis.

Our plan is centered around the idea of leveraging SQL and JSON for efficient data storage, ensuring flexibility and quick development. We will utilize background workers to aggregate and update data during non-peak hours, reducing the need for on-demand processing. By employing Redis as a caching layer, we can serve commonly requested data from memory, significantly reducing the load on the database and improving response times.

Additionally, we recognize the importance of Elasticsearch for complex search queries and aggregations. Its powerful indexing capabilities and efficient querying mechanisms will enable us to perform sophisticated analysis on the vast amounts of data generated by the stock exchange. This will empower our users with real-time insights and ensure efficient scalability for our application.

It's important to emphasize that achieving scalability for millions of concurrent users requires a holistic approach. Alongside the technical components, we will also consider hardware infrastructure, application architecture, and performance monitoring. By continuously optimizing and fine-tuning our system, we can address potential bottlenecks and ensure smooth operations at scale.

Through this post, we aim to provide valuable insights into our strategy for scaling to millions of concurrent users, taking into account the unique requirements and challenges of a stock exchange application. By combining the right mix of skills, techniques, technologies, and tools, we are confident in our ability to build a robust and scalable system capable of serving millions of users efficiently.

DB Tables with JSON

We'll store JSON in our DB columns to simplify modeling our data. We can avoid overly complicated queries and relationships between tables(at least initially) with this approach.

{
  "sym": "TSLA",
  "period": "live",
  "name": "Tesla",
  "o": 200.0,
  "l": 200.0,
  "h": 220.0,
  "c": 220.0,
  "vwa": 0.0,
  "series": [
    {
      "o": 200.0,
      "l": 200.0,
      "h": 220.0,
      "c": 220.0,
      "vwa": 0.0,
      "time": ""
    }
    // ... more objects
  ]
}

An example object column for one record in our DB which stores time series data from the current time moving backwards until the beginning of the day.

Storage Calculation/Memory usage

If we consider storing one object like the provided example every 5 minutes during trading hours for a year and apply that to all 6,100 companies on the NYSE and NASDAQ, the calculations would be as follows:

Size of one object: Approximately 60 bytes. Number of objects per day: 6.5 hours _ 12 (5-minute intervals per hour) = 78 objects. Number of objects per year for one company: 78 objects _ 252 trading days = 19,656 objects. Total number of objects in a year for all companies: 19,656 objects * 6,100 companies = 119,665,600 objects.

Total memory usage in a year: Memory usage per company: 60 bytes _ 19,656 objects = 1,173,360 bytes or approximately 1.12 MB. Total memory usage in a year for all companies: 1.12 MB _ 6,100 companies = 6,835,200 MB or approximately 6.55 GB.

Therefore, if we store one object every 5 minutes during trading hours for a year for all 6,100 companies on the NYSE and NASDAQ, the estimated memory usage would be approximately 6.55 GB.

A personal laptop could hold a few years worth of data easily.

A dedicated bare metal machine would last much longer.

Potential Pitfalls

Here are some of the challenges that might come up

Storing data as JSON objects in a column of a table in SQL Server can offer flexibility and convenience in certain scenarios. However, there are some considerations and potential challenges to be aware of:

Limited querying capabilities: Storing data as JSON objects can make it more challenging to perform complex queries and aggregations compared to a traditional relational database structure. SQL Server provides some JSON functions and operators for querying JSON data, but the querying capabilities may be more limited compared to a structured relational model.
Lack of data integrity and validation: Storing data as JSON objects does not enforce the same level of data integrity and validation as a structured relational model. In a relational database, you can define constraints, relationships, and data types to ensure data consistency. With JSON storage, you'll need to handle data validation and integrity checks manually within your application.
Increased storage requirements: Storing data as JSON objects can potentially result in increased storage requirements compared to a structured relational model. JSON objects typically have additional overhead due to the JSON formatting and the flexibility of storing variable and nested structures.
Performance considerations: Querying and indexing JSON data may have performance implications depending on the size and complexity of the JSON objects. Large and deeply nested JSON structures could impact query performance and indexing effectiveness.
Limited cross-platform compatibility: While storing JSON data in SQL Server can be convenient, it may limit the ability to easily migrate the data to other database systems that do not support JSON natively. Consider the portability and compatibility requirements of your application.
It's important to carefully evaluate your specific use case, performance requirements, data access patterns, and long-term scalability needs when deciding to store data as JSON objects in SQL Server. It may be suitable for certain scenarios, but it's essential to be aware of the potential challenges and trade-offs mentioned above.

In response to challenge 1, I plan to run background workers regularly to aggregate data during non peak hours so that we don't need to perform them "on the fly" when users request certain periods of time series. Does this sound like a good idea?

Background Workers

Running background workers to pre-aggregate data during non-peak hours can be a good strategy to improve performance and responsiveness when users request certain periods of time series. By pre-computing and storing aggregated data, you can reduce the complexity and processing time required to generate the requested results on the fly.

This approach has several advantages:

Improved query performance: Pre-aggregating data allows you to store summarized or aggregated information that can be quickly retrieved when users request specific time series data. This can significantly reduce the query execution time and improve the overall system performance.
Reduced computational load: By offloading the aggregation process to background workers during non-peak hours, you can distribute the computational load more evenly and ensure that users' requests are processed efficiently during peak usage times.
Enhanced user experience: With pre-aggregated data, users can obtain the requested time series information more quickly, leading to a better user experience. Users won't need to wait for on-the-fly computations, resulting in faster response times.
Scalability: Pre-aggregation can help improve scalability by reducing the computational overhead and database load during peak usage periods. It allows the system to handle more concurrent user requests without experiencing performance degradation.

However, it's important to consider a few factors:

Frequency of pre-aggregation: Determine an appropriate frequency for running the background workers to pre-aggregate data. Consider the trade-off between the freshness of the data and the system's processing capacity during non-peak hours.
Data freshness: Pre-aggregated data may not reflect real-time updates, as it is computed and stored in advance. Ensure that this trade-off aligns with your application's requirements and user expectations.
Storage considerations: Pre-aggregated data will require additional storage space. Plan and allocate sufficient storage capacity to accommodate the pre-aggregated data based on your estimated growth and retention requirements.

Overall, employing background workers to pre-aggregate data during non-peak hours is a sound approach to improve performance and provide a better user experience. Just ensure that the pre-aggregation frequency, data freshness, and storage capacity are carefully considered to strike the right balance for your specific application.

Additional Technique - Redis

A Redis DB for storing commonly requested intervals of time for assets to be saved in memory and served without hitting the db.

Using a Redis database to store commonly requested intervals of time series data can be an effective strategy to improve performance and reduce the load on your primary database. Here are some thoughts on implementing this approach:

Caching frequently accessed data: By caching commonly requested intervals of time series data in Redis, you can serve the data directly from memory without hitting the primary database. This can significantly reduce the response time and improve the overall system performance.
Faster retrieval: Redis is an in-memory data store that offers high-speed read and write operations. Storing the frequently accessed data in Redis allows for quick retrieval, as it avoids the latency associated with disk-based database systems.
Reduced database load: By serving cached data from Redis, you can reduce the load on your primary database. This can free up resources and improve the scalability of your system, allowing it to handle more concurrent requests.
Flexible expiration and eviction policies: Redis provides flexible expiration and eviction policies that allow you to control the lifespan of cached data. You can set an appropriate expiration time based on the volatility and freshness requirements of the time series data.
Monitoring and maintenance: It's important to monitor the Redis cache to ensure its performance and availability. Additionally, consider implementing strategies like cache warming (preloading frequently accessed data into Redis) to ensure the cache remains warm and provides optimal performance.
Data consistency: When using a caching layer like Redis, it's essential to consider data consistency. Changes made to the primary database should also update or invalidate the corresponding cached data in Redis to ensure data integrity and consistency.
Cache invalidation: Implement a mechanism to invalidate or update the cached data when it becomes stale or outdated. This can be based on a time-based expiration or triggered by updates to the primary database.
Cost considerations: Remember that Redis stores data in memory, which can be more expensive compared to traditional disk-based storage. Consider the size of the data you plan to cache and evaluate the cost implications.

Using Redis as a caching layer for commonly requested time series data is a widely adopted approach and can bring significant performance benefits to your application. However, ensure you carefully design your caching strategy, consider data consistency, monitor the cache, and plan for appropriate cache expiration and eviction policies to maintain data integrity and achieve the desired performance improvements.

One more technique: Elasticsearch

Elasticsearch is a highly scalable and distributed search and analytics engine built on top of the Apache Lucene library. It is designed to handle large volumes of data and provide fast and flexible search capabilities. Here are some of the key benefits of Elasticsearch:

Full-Text Search: Elasticsearch excels at full-text search, enabling you to perform complex search queries across large amounts of structured and unstructured data. It uses an inverted index to tokenize, index, and search text efficiently.
Distributed and Scalable: Elasticsearch is built to be distributed from the ground up. It can easily scale horizontally by adding more nodes to handle increased data volume and search traffic. It automatically distributes data across the nodes for high availability and fault tolerance.
Real-Time Search and Analytics: Elasticsearch provides real-time search and analytics capabilities, allowing you to index and search data as it is ingested. It supports near real-time updates, making it suitable for applications that require up-to-date insights and analytics.
Document-Oriented: Elasticsearch is a document-oriented database, where data is organized and queried as JSON documents. It offers powerful CRUD operations for indexing, updating, and deleting documents, as well as retrieving search results.
Rich Query Language: Elasticsearch offers a comprehensive Query DSL (Domain-Specific Language) that allows you to construct complex queries to filter, aggregate, and analyze your data. It supports various query types, including term queries, range queries, match queries, and more.
Schemaless and Flexible: Elasticsearch is schemaless, meaning you don't need to define a fixed schema upfront. It dynamically maps data based on the structure of the JSON documents. This flexibility enables easy data ingestion and adaptability to changing data formats.
Aggregations and Analytics: Elasticsearch provides powerful aggregations that enable you to perform analytics and generate meaningful insights from your data. You can perform metrics aggregations, bucket aggregations, and pipeline aggregations to derive statistics, perform data grouping, and calculate complex aggregations.

Challenges of Elasticsearch:

Complexity: Elasticsearch has a steep learning curve, especially when it comes to advanced features and complex use cases. Configuring and optimizing Elasticsearch requires a good understanding of its internals and best practices.
Resource Intensive: Elasticsearch can be resource-intensive, particularly in terms of memory and storage requirements. Large-scale deployments may require careful capacity planning and optimization to ensure optimal performance.
Data Sharding and Distribution: Properly sharding and distributing data across nodes in an Elasticsearch cluster can be challenging. It requires careful consideration of data volume, indexing throughput, and query patterns to ensure balanced and efficient data distribution.
Data Modeling and Denormalization: Elasticsearch is not a relational database, and data modeling requires denormalization to optimize search performance. Designing an efficient data model and managing data relationships can be complex, especially for complex data structures.
Data Consistency: Elasticsearch prioritizes availability and performance over strict consistency. In distributed setups, there might be eventual consistency among replicas during updates, which can impact applications that require strong data consistency guarantees.
Monitoring and Operations: Managing and monitoring an Elasticsearch cluster involves monitoring node health, indexing and search performance, managing cluster topology, and performing maintenance tasks. It requires dedicated monitoring tools and operational expertise.

Overall, Elasticsearch provides powerful search and analytics capabilities for handling large-scale data sets. It excels at real-time search and enables efficient querying and analysis of structured and unstructured data. However, it requires careful planning, configuration, and monitoring to ensure optimal performance and scalability.

Final Thoughts

The scalability of a system depends on various factors, including the hardware infrastructure, software architecture, database design, and the efficiency of the implemented solutions. While the described approach using SQL, JSON, background workers, Redis, and Elasticsearch can provide a solid foundation for efficient storage, caching, and complex querying, scaling to millions of concurrent users requires careful consideration of additional aspects. Here are a few key points to consider:

Hardware Infrastructure: The scalability of your system relies on the capabilities of your hardware infrastructure, including server capacity, network bandwidth, and storage performance. Ensuring sufficient resources and scaling horizontally by adding more servers can help handle increased user load.
Application Architecture: The overall architecture of your application should be designed to support horizontal scalability and distributed processing. This includes load balancing, clustering, and proper handling of concurrent requests to ensure efficient utilization of resources.
Database Optimization: While storing JSON objects in a SQL database can provide flexibility, it's important to optimize the database schema, indexes, and queries to ensure efficient storage and retrieval. Proper indexing, query optimization, and caching strategies can significantly improve performance.
Background Processing and Caching: Using background workers to aggregate and update data, and caching commonly requested intervals in Redis, can help reduce the load on the database and improve response times. However, careful consideration should be given to the cache eviction policies, cache size, and data synchronization between the cache and the database.
ElasticSearch Indexing and Query Optimization: Leveraging Elasticsearch for complex queries and aggregations can offload processing from the database and provide fast search capabilities. Proper index design, query optimization, and tuning Elasticsearch settings can improve performance and scalability.
Distributed Systems and Scaling Patterns: As the user load increases, you may need to adopt distributed systems and scaling patterns such as sharding, partitioning, and data replication. These techniques allow you to distribute the load across multiple servers and handle increased concurrency effectively.
Monitoring and Performance Tuning: Continuous monitoring of the system's performance, resource utilization, and user behavior is crucial for identifying bottlenecks and optimizing the system. Performance tuning, optimizing queries, and adjusting system configurations based on real-time insights can help ensure scalability.

It's important to note that scaling to millions of concurrent users is a complex undertaking and may require further optimizations and architectural considerations specific to your application's requirements and usage patterns. Conducting load testing, performance profiling, and periodically reviewing and optimizing the system architecture will be necessary to ensure smooth scalability as the user base grows.

In conclusion

At the end of the day we don't know until we have to do it. No system was built at scale from day one. They were built through iterations over time.

Although we'll never know until we "have to" do it, I believe we've got a solid plan on hot to scale RSE for the forseeable future, which is all I can hope for.