This is my personal summary of the book Software Architecture: The Hard Parts: Modern Trade-Off Analyses for Distributed Architectures. It’s important to note that this is only written to jog my memory - it is almost impossible to capture this book as a summary as there is so much in it. I highly recommend buying this book for any software architect working in complex, large systems. It’s not an easy read, but it covers many of the problems and trade-offs moving from a monolith to a microservice or service based architecture.
Chapter 1 - What Happens When There Are No “Best Practices”?
Architectural Decision Records (ADR)
A short noun phrase containing the architecture decision made up of
- Context - short one- or two-sentence description of the problem, and possible solutions.
- Decision - state the architecture decision with detailed justification of the decision
- Consequences - any consequences after the decision is applied, and trade-offs considered
Architecture Fitness Functions
- Any mechanism that performs an objective integrity assessment of some architecture characteristic or combination of architecture characteristics
- Automation and feedback are central to teams that want to move quickly without fear of breaking things
- Fitness functions validate architecture characteristics, as opposed to unit tests which validate domain criteria. An architect can decide whether a fitness function or unit test is needed by asking the question: “Is any domain knowledge required to execute this test?” If the answer is “yes,” then a unit/function/user acceptance test is appropriate; if “no,” then a fitness function is needed.
- Example: tests to ensure cyclic complexity rules are not breached; DI container can be verified; naming conventions are adhered to
- Build these into your build pipeline - nice opportunity for architects to be involved in the code
- Be careful to not take this too far - don’t want to overburden developers with too many hoops and slow the team down unnecessarily
Chapter 2 - Discerning Coupling in Software Architecture
- Coupling - Two parts of a software system are coupled if a change in one might cause a change in the other
- Choosing to decouple components adds complexity and taken too far, can over complicate your system and render it unworkable. Architects need to weigh up the trade-offs of decoupling and apply these given the pros and cons
- Advice for modern trade-off analysis in software architecture
- Find what parts are entangled together.
- Analyze how they are coupled to one another.
- Assess trade-offs by determining the impact of change on interdependent systems.
- Static vs dynamic coupling - static coupling describes how services are wired together (operation dependencies e.g. database), whereas dynamic coupling describes how services call one another at runtime (communication dependencies e.g. calling another service)
- The architectural quantum (number of shared dependencies) has a major impact on scalability and elasticity of your systems. It’s all very well having loosely coupled systems that operate independently, but if they all rely on a single database and the database locks up, all services are affected
Chapter 3 - Architectural Modularity
Describes why you should break up a monolithic application.
Architecture should not be over-engineered - there is a tradeoff to complex distributed systems in both skills required, and processes/infrastructure required to handle them. Architects shouldn’t break a system into smaller parts unless clear business drivers exist. Business drivers for breaking applications into smaller parts are for competitive advantage and agility (speed to market). As applications grow in size, the modularity drivers present a strong case for splitting an application into smaller, independently deployable parts e.g. microservice architecture.
Competitive advantage requires availability and scalability, agility requires testability, deployability and maintainability.
- Maintainability - monolithic applications are harder to maintain as changes affect all levels of the application, which may require input from multiple teams. Microservices architecture requires specific services to change.
- Testability - more difficult in monolithic architecture and results in longer build times as all tests need to run with every change. Microservices means smaller testable units and fewer tests that need to run when changes are made.
- Deployability - microservices allows for the deployment of a single piece of the application only. Deployments are lower risk as only one component is changing, not the whole application.
- Scalability is easier to achieve with microservices as effort can be put into the areas that are under load, rather than changes and deployments to the whole app.
- Availability - a single component crashing can leave the rest of the application still available.
All of these benefits of a microservices architecture diminish as communication between the services increases, which in turn raises a strong case for asynchronous communication which decreases runtime dependencies.
Chapter 4 - Architectural Decomposition
Describes how you should break up a monolithic application.
Component based decomposition
- Break the application up into multiple components
- When breaking monolithic applications into distributed architectures, build services from components, not individual classes.
- Usually easiest to move to a service based architecture first as a stepping stone to a microservices architecture - this allows you to keep the existing database and focus on the domain and functional partitioning prior to tackling database decomposition
- Useful when the application is a big ball of mud and doesn’t have many obvious components
- Clone a full copy of the application, and for example, deploy to different areas of the business, in each version deleting what is no longer used
- Teams can start on this immediately with very little risk or analysis
- Still requires concerted effort as the code is in no better state
Chapter 5 - Component-Based Decomposition Patterns
Steps moving from monolithic to microservices application:
- Identify and size components
- identify and catalog the architectural components (logical building blocks) of the application and then properly size the components (e.g. number of statements). Components should be of a similar size relative to others and this can be reflected using statement count, file count, and percentage of total statements across the code base. If a component is too big, try and find sub domains to split it into smaller components - but don’t contrive these for the sake of it.
- Gather common domain components
- Consolidating common domain functionality helps eliminate duplicate services when breaking apart a monolithic system. Often there are only very subtle differences among common domain functionality that is duplicated throughout the application, and these differences can be easily resolved within a single common service (or shared library).
- For example, common use of SMTPConnection hints at copied logic to send email, and would benefit from a common service
- Flatten components
- Don’t have source code at different namespace levels as you start to lose sight of what the components are. Rather move all code into leaf nodes of the same level so components are easily identifiable e.g. myapp.reporting.mis, myapp.reporting.financial, myapp.reporting.shared. Don’t put shared code into a root namespace as it becomes unclear what those classes are for.
- Determine component dependencies
- Build a dependency diagram of the monolith to determine coupling between components. Minimal dependencies means the monolith is a good candidate for refactoring into separate services with minimal effort. High levels of dependencies may make refactoring too difficult and a rewrite should be considered instead.
- Create component domains
- Logically group components together - multiple components may fall into a single service e.g a billing service may have both a payment component and a history component.
- Create domain services
- Individually deployable pieces of software that encapsulate a specific domain function. Generally each of these will be coupled with a monolithic database, but this serves as a great starting point for seeing whether services can be broken down further
- Create microservices
Chapter 6 - Pulling Apart Operational Data
When to consider breaking apart your data
- Change control - when you have a large number of services, co-ordinating database scheme changes become onerous, especially if services are managed by different teams.
- Connection management- each service will generally have its own connection pool, which will eventually result in too many connections to a shared database.
- Scalability - more databases results in fewer connection waits as services grow, and spreads I/O load on a single database.
- Fault tolerance - remove single points of failure.
- Architectural quantum - splitting the databases increases the quanta count, allowing for the easier deployment of separate services as dependencies are removed.
- Database type optimization- use the right tool for the job.
When to consider keeping data together?
- Data integrity - foreign keys, triggers, etc
- Transactions - using different databases can mean losing the concept of a single unit of work
Trade-offs! You need to decide what is more important, and then protect against the lost benefit.
Decomposing Monolithic Data
- Analyze data and create data domains
- Assign tables to data domains
- Separate database connections to different domains - when data is needed from other domains it should be accessed via services rather than accessing those databases directly
- Move schemas to separate databases or database servers - this can be done using backup/restore or with replication. Merge replication would be the safest and easiest method, but does require the most setup
- Switch over to independent servers/databases
Chapter 7 - Service Granularity
Granularity Disintegrators - reasons to break apart a service
- Service scope and function - single responsibility principle, although this is highly subjective e.g. is notifying a customer a single responsibility or is notifying a customer via Email specifically a single responsibility?
- Code volatility - rate at which code changes. Areas which change frequently justify being move out into another service - less testing and deployment impact on the areas that do not change frequently.
- Scalability and throughput - high variances in scalability and throughputs may justify splitting up a service.
- Fault tolerance - differing levels of fault tolerance means components that crash regularly may bring other components out of action. Also may want to employ more fault tolerance measures for critical services.
- Security - certain components may need to be more secure than others.
- Extensibility - the ability add functionality as requirements grow e.g. payment support types, it may be worthwhile separating these into different services from the outset so that as you add payment types you don’t need to retest and deploy all other payment types.
Granularity Integrators - reasons to keep components in a single service
- Database Transactions - a requirement for ACID transactions is a good argument for not separating components into different services.
- Workflow and Choreography - increasing the number of services increases the amount of communication required between these services…and a central key service going down may impact all other services. It is important to consider the communication dependencies with respect to your overall fault tolerance.
- Shared code - sometimes a high level of common code means it may make more sense to keep components in the same service.
- Data relationships - splitting components that own/access the same data may negatively impact complexity, performance, fault tolerance and scalability.
The choice to integrate or disintegrate must be driven by solid analysis, and the trade offs weighed up in the light of the business need.
Chapter 8 - Code Reuse
Reuse of code in monolithic applications is common and encouraged, whereas sharing code in microservices creates dependencies and hence coupling, so reuse is generally discouraged. Patterns for reuse:
- literally copy and paste
- Rarely used, but is useful for static code that has a low chance of bugs and will rarely change
- Common class or jar file
- Dependency management easy but making changes means updating every app/service that uses it
- Versioning is essential - allows you to only upgrade the apps that require the new version
- Performance, scalability not affected as compiled in
- Best suited to low change areas of code
- Common service which separately deployed
- Easier deployment as only one instance
- Riskier deployment as only one instance which is used by everything
- Introduces network latency to service calls…impact on performance and scalability
- Versioned APIs can reduce deployment risk but introduce the requirement for deployment of each dependent application
- May need to scale with other services as it is not deployed as a component of them
- Fault tolerance is an issue as you introduce another working point of failure - this can be mitigated with multiple instances but that introduces complexity
- Best use in environments with multiple languages and technologies and when shared functionality changes often
- Allows a clean way to spread some sort of cross-cutting concerns across a distributed architecture
- Implement sidecars in a service mesh that allows you to share common operational concerns (not domain concerns)
- Usually used for common operational requirements eg. Logging or monitoring
- Allows for consistent implementation and infrastructure coordination
- Sidecars can become large and complex
- It is very important that the service mesh is only used for operational concerns otherwise you introduce coupling problems before the introduction of microservices! Shared infrastructure items only!
- As developers we have been taught to encourage reuse of code
- However, this is not optimal in fast changing areas of code - changes in shared code require regression in every place that uses it, and changes become difficult to manage
- Shared code works very well in areas of slow change e.g. an operating system or infrastructure points - it does not necessarily work well on large systems where change occurs frequently e.g. domain code
Chapter 9 - Data Ownership and Distributed Transactions
- Usually the service that writes to a table, owns the table…but this gets complex when multiple services need to write to the same table
- Single ownership scenario - only one service writes to a table
- Common ownership scenario- most or all tables write to a common table e.g. audit table - often solved by adding a new service to deal with the common area
- Joint ownership scenario- more than one service write to a single table. Can be solved by:
- Splitting the table
- Consolidating services
- Shared data domain e.g. move tables into single scheme and these are shared by services in that domain - not always encouraged but can be necessary to preserve data integrity, fault tolerance, performance, etc. Does result in coupling and higher regression needed for changes to this area of the system
- Delegation - make only one service the owner of the table and all other services communicate with that service for updates. Does result in service coupling and difficulty with transactions
- Transactions are usually ACID and will roll back all changes on failure anywhere in the chain of modifications
- Distributed transactions do not support ACID - instead they support BASE
- BA - Basic Availability- all services must be available for the transactions
- S- Soft state - describes the situation where a transaction is in progress - you can not determine which service has completed or how far along it is in doing its individual work
- E - eventual consistency
Eventual Consistency Patterns
- Background Synchronisation Pattern
- Background task that runs periodically to keep data in sync
- Slowest in terms of eventual consistency
- Good for overall responsiveness as user doesn’t need to wait for entire business transaction to complete
- Tightly coupled to the tables owned by the services, so must be changed and tested with the services
- May require duplicated business logic in both the services and the background sync job
- Orchestrated Request-Based Pattern
- Attempts to process the entire distributed patterns during the business request
- Could be separate service, or designate one of the exiting services to be the primary service in the transaction and co-ordinate all the necessary steps
- Lends itself to tight coupling and dependencies
- Can overload one of the services with responsibilities that aren’t really the function of the affected service
- Not the best for performance - as the orchestrator needs to co-ordinate in real-time across the services - can add network hops and latency
- Error handling becomes complex - what if one of the services fails but the others have completed?
- Event-Based Pattern
- Events used in conjunction with asynchronous publish/subscribe model
- Services highly decoupled
- Good performance
- Error handling remains an issue - what if one of the subscribers fails to process the message?
Chapter 10 - Distributed Data Access
In a monolithic applications, data can be accessed from anywhere in the system using SQL as there is a shared database. However, accessing data that an application/service no longer owns is hard, as you introduce network calls and complexity trying to retrieve data that is no longer part of the service needing the data.
Interservice Communication Pattern
- Simplest solution - use remote access protocol to request data from other services
- Slow performance due to network latency
- Increased activity as you can’t “JOIN” for data
- Service coupling - you need another service to be alive for your service to function
Column Schema Replication Pattern
- Replicate data between databases or schemas
- Usually implemented asynchronously to improve performance, so there may be inconsistency
- Need to be careful about what is done with replicated data - other services should not be able to update replicated data
- Good performance boost as services can do JOINs to get associated data
- No service dependency or fault tolerance issues
- Can be managed to pub/sub event streaming, but this does add a lot of complexity
Replicated Caching Pattern
- Replicated in-memory caching so each service has the data it needs
- Each service has its own cached data (replicated), or a central distributed cache can be used - although the latter does introduce dependency and fault tolerance issues, as well as network latency
- Replicated caches are great for individual service performance, fault tolerance scalability
- Complex from an infrastructure perspective and each service has a dependency on its own cache
- Data volumes may be an issue - there is only so much that can be cached in memory, particularly if you have a lot of services
- High rate of change makes it difficult to keep the replicas in sync
Data Domain Pattern
- Data in a shared schema, that more than one service has access to
- Services remain decoupled from each other - only the shared tables create a dependency
- Data consistency and integrity remains high
- Multiple services will need to change if the structure of this data changes
- Security is compromised as multiple services have access to the same data
Chapter 11 - Managing Distributed Workflows
Orchestration Communication Style
- Synchronizes disparate services
- In microservices architecture, there is usually one orchestrator per workflow (not one global orchestrator for all workflows)
- One area where state and behavior is managed
- Error handling handled centrally
- Can add retry capability at the orchestrator level
- Communication all through single mediator which can create a bottleneck
- Single point of failure
- Doesn’t scale well as it needs to monitor multiple co-ordination points
Choreography Communication Style
- Each service executes steps in the “dance” in predefined order
- Simpler to understand for the “happy” path, but gets complicated when errors occur
- One problem is no obvious owner for workflow state.
- Responsive and scalability - less co-ordination required
- Fault tolerance easier to handle
- Less service coupling
- Error management difficult
- State management difficult
- Error handling more complex as each service must have more workflow knowledge
- Recoverability more difficult as no central service to do retries and co-ordinate state
Summary: Choreography better for high throughput scenarios with simpler error handling requirements. Orchestration best suited for complex workflows that include boundary and error conditions.
Chapter 12 - Saga Patterns
Saga patterns describe the movement of data between services, where a sequence of local transactions each publish an event, which triggers the next update in the sequence. The saga issues compensating updates if anything fails along the way. Various types of pattern exist, but each need to choose between:
- Synchronous vs asynchronous communication
- Atomic vs eventual consistency
- Orchestrated vs choreographed co-ordination
Any combination of these is valid in specific circumstances, but all come with their trade-offs.
Chapter 13 - Contracts
- Strict vs loose contracts - each have their trade-offs; strict means higher data integrity, easier versioning, and easier verification at build time; loose contracts mean loose coupling and fewer problems with backward compatibility
- Contracts between services should be on a “need to know” basis - don’t add properties in case you may need them in the future - you are just adding needless fragility
- Be wary of stamp coupling - where a large document is passed between services and each service only reads/writes a small portion of the document. This can introduce unnecessary dependencies and each service has to handle data (over the network or from the DB) that it does not require - this can add up in high volume systems
Chapter 14 - Data Analytics
Data Warehouse pattern
- Data extracted from many sources and transformed into a single schema
- Aggregation and analysis done on the warehouse, removing the load from the transactional system
- BI Reports and dashboards
- Usually with a SQL Interface
- Very brittle - dependent on the source schema
- Domain knowledge - architects, developers, DBAs and data scientiests must all co-ordinate changes and understand the evolution of those changes
- Adds complexity to the system
- Often don’t produce value relative to the amount of work used to build/maintain
- Bottlenecks occur trying to pull data from various sources
- Operational and analytical data are often not the same
Data Lake pattern
- Inverse of the data warehouse pattern
- Meant to be loose structure - a variety of data sources can be held in the lake
- Data extracted from many sources
- Still takes a centralized view of data
- Difficulties in discovery of data and relationships
- Challenges with regards to personal information
- Requires ad hoc transformation
An alternative (modern) approach
Data Mesh pattern
- Each service has its own data product quantum (DPQ)
- The DPQ is responsible for modelling and transforming the data for the service - either synchronously or asynchronously
- An analytics quantum can then co-ordinate reporting across multiple DPQs
- Well suited to microservices
- Follows modern architecture principles and engineering practices
- Decouples operational and analytical data
- Allows loosely coupled evolution of analytical capabilities
- Requires contract coordination with DPQ
- Requires asynchronous communication and eventual consistency
Chapter 15 - Build Your Own Trade-Off Analysis
- Create coupling diagrams when breaking apart monolithic applications
- Analyse coupling points and build a matrix of possibilities
- Assess trade-offs of the posssibility matrix to help choose the correct path
- Ensure you are comparing apples with apples
- Ensure you have context - other factors may influence your decision e.g. regulations, budget constraints, etc
- Model relevant domain cases in multiple ways and looks at the trade-offs to each
- Try not to be swayed by products or evangelism - objectively choose the best approach regardless of “latest and greatest” claims - everything has trade-offs, make sure these are considered!