What is Single Point of Failures? How can identify and avoid ?
A Single Point of Failure (SPOF) refers to a critical component within a system whose failure can result in system-wide outages, leading to downtime, potential data loss, and a negative user experience.
Let’s try to understand very simply
— — Suppose you possess a very important light in your house which connects with only a bulb. This means if that single bulb develops a fault it ceases to glow due to absence of a spare bulb. Likewise in a computer system, a single point of failure is as that one light bulb if that particular part of the system collapses, the whole system comes collapsing down.For instance, a website has only one server and in case the server fails, the whole page is affected. To avoid this, in the same way that we switched on another light bulb or lamp when the first goes out in a room, we include backups and options in the system in such a way that if one component does not function effectively, the other parts can still effectively continue to perform their activity.
Let’s examine a system and identify its various potential Single Points of Failure (SPOFs):
This system integrates one load balancer, two application servers, database, and cache server.
System Overview:
Clients put their requests to the load balancer and the load balancer then distributes incoming traffic between two application servers. Application servers access data from the cache where necessary; if the data is not found in the cache, application servers query the database for data as required.
Potential Single Points of Failure (SPOFs):
- Load Balancer:
To the current architecture, the load balancer has been recognized as a system with a single point of failure. The major drawback is that should the load balancer experiences a concern, this is the only available instance to handle the traffic flow, which leads to the complete absence of the service. To avoid this the option of having redundant load balancers in active passive or active active would help in creating a fail over point where another load balancer can take over in case the primary instance is down. - Database:
Since the acquisition is made from only one database instance, all the system’s data would be unavailable when this instance fails, giving the system a shut down and probably loss of data. This makes the database a serious Single Point of Failure or SPOF in the entire system. For this reason, data should be made to be replicated across the different nodes through synchronous or asynchronous replication. Database clustering, geographic location and numerous auto-switching routines contribute to availability and minimize the possibility of an all encompassing outage. - Cache Server:
Although the cache server is not a hard SPOF in that, if it fails the rest of the system is not compromised, if we do not have a cache server, then all read operations will hit the database. This can place more heaver pressure on the database which causes poor performance and slow response rates. This effect can be avoided, or at least lessened, if one sets up distributed caching layer or has multiple cache nodes to guarantee more availability of own caches. - Application Servers:
The application servers are not SPOFs in this architecture as there two of them. Since there is an option of one application server going down, other will still be in a position to serve the rest of the traffic from load balancer. But, this is possible only under the condition when a load balancer is able to perform dynamic failover with the servers.
How to avoid Single Points of Failure (SPOF)
1-) Redundancy
Redundancy is perhaps one of the most suitable approaches for eliminating SPOFs. Redundancy, therefore, is the use of two or more components in a system to ensure that they are always functional in case the other component is out of order.
There are two types of the redundant components — active and passive. On-demand components are always running and provide load distribution and increased system robustness. On the other hand, passive components sit idle waiting to take over as soon as the primary component fails while maintaining the availability of failover.
2- Map Out the System Architecture
According to the first criterion that has to be met in the identification of SPOFs, one has to draw right architecture map of the system. This implies formulation and documentation of the general system as well as significant modules, data flow and correlation.
Documenting all components: List all pivotal elements by naming the applications servers, databases, load balancers, cache, storage systems and networks. Each encompassed part ought to be endued with the name that would unmistakably demonstrate what part it plays in the general framework.
Illustrating data and communication flows: These aims describe how the data circulates within components such as a database, or downstream, an API, or inter-service. According to them this gives information about which part of a system is dependent on other parts and which connections are needed for program operation.
Identifying external dependencies: Identify all the external systems which the system under consideration depends on (e.g., cloud services, APIs or payment service). Connection with other services can be also dangerous in the similar way if added services themselves are constituted by explicit SPOFs not containing failover features.
Visualizing the network layer: From IT organizations’ point of view, they are also next the network design which include the routers, switches, firewalls, VPNs etc. It also helps in identifying other elements of network-related SPOFs such as only those connections which involve data centres and depending on the selected ISP.
3- Analyze Each Component
After the system architecture is defined, the next process seeks to examine each of the acquired components further to evaluate their risk and capability of being SPOF. This involves a critical assessment of each elements, services and activities in the system.
Assess the criticality of each component: Determine what part each component plays in the functioning of the system.
Ask questions like:
In other words, how critical is this component to the total ability of the system?
Should this component fail, the …
If this component failed would it be a partial black out or a black out of the entire system?
Identify dependencies and interconnections: It’s important to get an understanding of how elements are reliant upon each other. Dependencies could be made such that one component depends with the other forming Points of Failure — point of failure-chain. Work out if any highly-available critical services (databases, message brokers, or application servers) rely on other single instance that if failed brings down the whole system.
Evaluate redundancy and failover mechanisms: For each component define if redundancy or failover exist. Non-optional parts that do not have backup are also most likely to be SPOFs. For instance:
Is it an active-passive backup or an active-active backup?
Is there any automated failure switch over process and how quickly does it work?
Consider performance bottlenecks: While a part may not be totally a SPOF, the corresponding unit will likely slow down the performance of a system it is in, under conditions of high loads. The entities like databases or application servers, which may easily get loaded, can lead to system failure if the related architecture cannot scale.
Examine maintenance and recovery strategies: Explain how each of the components is being managed or supported and if each component has sufficient recovery mechanism. If a component cannot be recovered from failure autonomously meaning they need manual intervention then that component qualifies to be a SPOF.
4- Data Replication
The replication is the copying of data to more than one site so that even if the site fails the other can be accessed and the result is the minimization of downtime in case of failure at one site.
Synchronous Replication: Here, data is synchronized in real time across the various locations and this requires traffic to be routed through different Layers of cache. Every write operation is claimed to have been completed once the new data has been written and synchronized with both the primary and replica sites in order to make data identical in all nodes. This method ensures high degree of consistency but it can create transaction overheads since all the sites must respond immediately after a transaction.
Asynchronous Replication: In the case of asynchronous replication, data are replicate with slight delay and the primary site does not wait for the replica to acknowledge the write operations. This has the disadvantage of cutting latency and the overall performance of a database, though there is always the potential for brief data loss between the primary and secondary sites, which can be critical in the instance of a failure before all data has been duplicated.
Examples…
Let me provide you with three foundational examples of technical(life story) scenarios involving Single Points of Failure (SPOF)
1- Problems associated with Database as a Single Point of Failure ( SPOF)
2- Load Balancer as a Single Point of Failure
3- Caching System as a Single Point of Failure
1- Problems associated with Database as a Single Point of Failure ( SPOF)
Story: The Farmer and His Granary
Oliver was a farmer, which every year received tons of grains; a farmer, surnamed Oliver, reaped tons of grains annually. Oliver had only one big grain storage house that contained all the grains he had. Once, people remained quiet and all happy, his granary stored his harvest and life was good. Nevertheless, once untoward evening, a severe hurricane came to the village, and the magazine was hit by lightning and burned down. All the grain was destroyed. The following morning, Oliver was before the remnants of his farm; for some terrible had happened he could not make anything more than a year’s produce. He possessed no other granary or emergency reserve. He and his village would starve that year: they subsisted on his grain.
Solution:
As for Oliver after the disaster he has learnt his lesson. He constructed more than one small house of grain storage around his farm not a big house of grain storage. In that way if one got on fire again he would always find the others to rely on as he saves both his harvest and the entire village. He also employed a couple of assistants to keep an eye each granary and provide feedback in cases of any hitch so that he could intervene.
Modern Lesson:
Like Oliver, businesses should not have all their information in a single silo structure for optimization. Through implementing database replication or clustering, we distribute the risk among those servers; consequently, if one of them malfunctioned, everything continues to function efficiently.
Technical Solution:
- Database Replication: It is best to use the master-slave or primary-replica system. All the writes queries are performed on the primary database though replicas are synchronized to provide for the read queries. In case of failure of the primary server there is an ability to swap the replica which will be adaptive to the primary one.
Automatic failover can also be achieved using MySQL Group Replication, PostgreSQL Replication with Failover and using services like AWS RDS Multi-AZ. - Database Clustering: Install distributed database clusters using Cassandra, MongoDB using replica sets or, CockroachDB are based on horizontal scaling and fault tolerance in multiple nodes & data centers.
2. Load Balancer as a Single Point of Failure
Story: The Toll Bridge in the Active City
There was a single toll bridge over which vehicular cross could be made in a city that was portrayed as busy. That included workers who had to cross this bridge from the eastern side of the city to the opposite side, come to work. It was also effective for a long time or until one fateful day that it was reported the toll bridge has crumbled down due to its natural aging and frequent use. And there was no way people could cross over to the other side again. The city ground to a halt. Employees could not get to their workplace, goods could not get to the markets, and the economy strangled.
Solution:
Soon after the implementation of the city, the mayor quickly regretted that the city only had one bridge. She planned to construct several bridges and to design several approaches so that if the bridge was to be washed away one more time, women and men could easily cross through the rest of the bridges. Furthermore, she put instilled traffic lights and road signs to arbitrate traffic flow on all the bridges in order to maintain relatively free flowing traffic at any one time despite the increase in traffic during rush hours.
Modern Lesson:
The load balancer in a digital system is similar to that toll bridge where the traffic is given control to servers. Since the load balancer in case has failed it is impossible to access the relevant servers: as for it workers could not cross the city. This means that if one load balancer is off, traffic can be statistically guided and there will be no interruption of service since there is a back-up load balancer band and DNS failovering.
Technical Solution:
- Redundant Load Balancers: Organizations should use multiple load balancers in active-passive or active-active mode. Depending on the architecture, the second load balancer is passive and replaces the first active one if it goes down. The difference in active-active is that both load balancers are actual active load balancers and traffic is distributed between them.
AWS ELB or GPC Load Balancing comes with fault tolerance as well as redundancy across multiple zones of availability. - DNS-Based Load Balancing: DNS failover with the AWS route 53 or cloud flare DNS where in DNS automatically redirects traffic to a healthy load balancer in case this one has failed. This is because it is unwise to have a single load balancer act as a SPOF, or a single point of failure.
3. Caching System as a Single Point of Failure
Story: The Bakery’s Secret Recipe Box
Clara was the name of a famous baker who owned a bakery shop famous for its pies. Every new secret recipe that Clara was perfecting she would put all in a wooden recipe box hidden under the counter. For the most part she had the recipes down pat, but when it came to the pies especially the more complicated ones, then she had to refer to the box. Once there was a man, who was buying some goods and nonchalantly caused a candle to fall and the recipe box with it caught fire. In the confusion displaced persons cannot bake the famed pies due to missing recipes of OWN cooking. Her bakery could not make pies for days and she was able to lose all her customers.
Solution:
Literally, after the fire, Clara had thought it wise to write down her recipes and make a photocopy of the original document and keep them in three different locations, one on the kitchen table, the second at home and the third in a locked strong box in the basement. Now, if one of such recipe set was ever lost, she would still have her secret pie recipes handy. She also reasoned that it is unnecessary to even refer to the recipe box for the most requested pie recipes, so she wanted to memorize them too.
Modern Lesson:
Cache in a system is just like Clara having her recipes, within them; she stores her secrets. It contains crucial data that make it easier for the system to solve problems and give faster results. But if the cache is unsuccessful, everything is slower, which is reminiscent of how Clara couldn’t bake pies without recipes. By means of cache replication (Sentinel, for instance) and/or clustering and using the main database as a fallback option, core operations stay uninterrupted.
Technical Solution:
- Cache Replication and Clustering: To change Redis configuration set up a Redis Sentinel or Redis Cluster. Redis Sentinel assists in the monitoring of a master-slave instances and will always get a slave promoted to become the master in case the original master fails. Redis Cluster is also designed to let the data be partitioned and distributed across several nodes for purposes of availability.
For Memcached, use client side mapping so that the data access can be striped over different Memcached instances. If one instance goes down, then the client can follow other servers or the database to do his or her business. - Graceful Fallback to Database: Make certain the cache fails gracefully and returns to the database. Cache miss should only penalise performance and should not interfere with the functionality the application is meant to provide.
— — — — — — — — — — — — — — — — — — — — -
And…