Hi im trying to get a better understanding of how distributed searching works we are on Pega 7.2.1
We have 4 nodes in the environment. Three of these nodes have been defined as Search Indexing nodes and work as expected.
I can bring down each of the search nodes in turn and as long as one node is up that is defined as a search node search continues to work
when i am left with one node that is not defined as a search node search fails as expected.
We then start bringing the nodes back so one node no index and one node with a search index.
At this point search still does not work in the portal.
We then bring a second search node back so 3 of our 4 nodes are up 2 with a search index defined
At this point search works.
Why can we bring nodes down and have search working with only one search indexingnode available but if we start bringing nodes back search does not work until there are two search index nodes available?
In the scenario where you have a node which is not an index node and bring up one index node,
Write operations would fail as the quorum is not met (Elasticsearch by default needs > (replicas/2) + 1 to be active in order for writes to succeed. In your scenario it would be 2 i.e. atleast 2 nodes should be active for writes to succeed)
However, this won't be applicable to search requests.
Do you notice any exceptions in the log when you perform a search operation ?
If search is done on the node which has indices on it, does it return results ?
I have tested this locally and below are my observations for 3 index nodes, 1 non-index node
1. 3 index nodes down, 1 non-index node (no index nodes alive) ->> Search doesn't work (expected)
2. 2 index nodes and 1 non-index node down (1 index node is alive) ->> Search works fine
3. 3 index nodes down, 1 non-index node is alive and bring up 1 index node ->> Search doesn't work
4. All 4 index nodes down, 1 index node is brought up and 1 non-index node brought up (in this sequence) ->> Search doesn't work
The issue occurs due to number of replicas that gets configured on each index node addition. But, when the nodes are brought down, the replicas count is not updated.
During elasticsearch node start-up, it tries to make sure all shards (which includes replica shards) are allocated. In the scenario's 3 & 4, the cluster state remains in RED (due to the inability to allocate shards even for the primary shard). All search requests fail in this scenario.
When an additional index node is added, its able to assign the primary as well as 1 replica shard and cluster comes to YELLOW state.
However if we look at scenario 2 which is similar to 3&4 with the exception that index node hasn't been restarted, the cluster state remains in YELLOW as the primary shard is in an allocated state.