Search and indexing are vast topics but if you are familiar with database technology, I will try to draw analogy between the two of them.
When you write an SQL query to retrieve data from a database, it works great if the number of records are small but as the number of records increase, the performance drops. To overcome this, the simplest suggestion given is to create an index on the column which is used in the WHERE clause of the SQL statement. Now indexes consume more space in the database, but speed up retrieval and thus your queries run faster.
So if databases already have this feature, why do I need full text search?
Lucene is one of the libraries that provides full text search. Since in the Pega platform, we don't expose each and every property as a column in the database, we can't write SQL statements which are performant when they have to refer to the values in the storage stream. Also, since the structure of the data stored in the storage stream is hierarchical, it is not easy for RDBMS to provide efficient retrieval using SQL. So full text search engines do inverted indices. You can read more about inverted indices and full text search at the Lucene website - http://lucene.apache.org
So how does Pega use Lucene?
The Pega platform takes the data stored in the stream and indexes the content so that the search control can retrieve the details of any instance where the search string was found anywhere in the document. We have a search landing page which provides the details of the indices that we have. You can re-index through the search landing page. Now full text search index will maintain the index on the file system. Since the file system is specific for each node, thus only one node maintains the index. With Elastic Search in Pega 7.1.7 onwards, we can provide failover as well.
Why do we need the pr_sys_workindexer_queue?
The data in the database is not static. It keeps changing as instances are created, updated and deleted. This means that the Lucene index files need to be also made up to date with these changes. Thus as instances are changed in the Pega platform, we make a note of the pzInsKey of the instance in the pr_sys_workindexer_queue table. Subsequently the SystemWorkIndexer agent picks up the entry, gets the latest changes, and modifies the index files.
What does search do?
When you search for a specific text, the Lucene index is looked up and records are returned that have this text in it. Since the index is hosted on one node, we use SOAP to connect to the search node if the current node initiating the search is not the search node. This is internal to the Pega platform and as a Pega developer using the platform to develop an application need not be worried.
Would it retrieve the details which are present in BLOB as well?
As I mentioned above, we create an index on the filesystem for the data in the BLOB. So search will look into this index to see if the search text, that was provided, is available in any of the records or not. It doesn't go check the database and thus doesn't check the BLOB in the literal sense. But it can check if the search text is present in any record or not (even when the property was part of the BLOB in DB). That said, you cannot return the stored (exact) value of a property as part of the search results. We only put very few top level properties in its exact value in the index so that results can be displayed when someones does a search. Any other property can be opened up, from the DB, by using the properties returned.