Developer's Notebook: Notes on Apache SOLR

Solr is very easy to get started. Google. Download. That's it.

Starting SOLR

$ ./solr start
Waiting to see Solr listening on port 8983 [\]
Started Solr server on port 8983 (pid=31965). Happy searching!

SOLR Client

Solr embeds a jetty server for it’s admin interface.
It’s available by default at port 8983.
http://localhost:8983/solr/#/~cores/collection1

After the release of solr cloud, solr latest version combines solr cloud into itself.

SOLR Architecture

What is an architecture without an obscure image!!!

Ya...I'm not gonna explain that.

Note that solr comes with a default instance named ‘example’ and a core named ‘collection1’. But it doesn’t contain any documents.

Solr's basic unit of information is a document, which is a set of data that describes something.
In the Solr universe, documents are composed of fields, which are more specific pieces of information.
Fields can contain different kinds of data. You can tell Solr about the kind of data a field contains by specifying its field type.

Field analysis (processing or digestion) tells Solr what to do with incoming data when building an index.

A field type definition can include four types of information:

The name of the field type (mandatory)
An implementation class name (mandatory)
If the field type is TextField (custom type), a description of the field analysis for the field type Field type properties, depending on the implementation class, some properties may be mandatory.

In Solr, the term core is used to refer to a single index and associated transaction log and configuration files (including schema.xml and solrconfig.xml, among others).
Cores are a tool primarily used to have different schemas in a single Solr instance.
Cores might or might not run on same instance/machine.

With SolrCloud, a single index can span multiple Solr cores.
We call all of these SolrCores that make up one logical index a collection. Collection makes up the indexed and returnable data of a Solr search repository.

Usually,

collection1 = core1

Collections can be divided into slices.

Each slice can exist in multiple copies; these copies of the same slice are called shards. One of the shards within a slice is the leader, designated by a leader-election process. Each shard is a physical index, so one shard corresponds to one core.
It is important to understand the distinction between a core and a collection. In classic single node Solr, a core is basically equivalent to a collection in that it presents one logical index. In SolrCloud, the cores on multiple nodes form a collection. This is still just one logical index, but multiple cores host different shards of the full collection. So a core encapsulates a single physical index on an instance. A collection is a combination of all of the cores that together provide a logical index that is distributed across many nodes.

When should you shard?

If searches are taking too long or the index is approaching the physical limitations of its machine, you should consider distributing the index across two or more Solr servers.
Each shard then runs on a separate machine. Solr then partitions searches into sub-searches, which run on the individual shards, reporting results collectively.

collection1 = core1 (shard 1), core2 (shard 2)

Hashing is usually used to determine where the input goes among the cores.

When should you also replicate the shards?

You have a large search volume which one machine cannot handle, so you need to distribute searches across multiple read-only copies of the index.
There is a high volume/high rate of indexing which consumes machine resources and reduces search performance on the indexing machine, so you need to separate indexing and searching.
You want to make a backup of the index.

Now,

collection1 =
core1 = shard 1 and shard 2
core2 = shard 2 and shard 1

Leader = a node that can accept writes without consulting another node
In SolrCloud everyone(each shard) is a leader (challenges optimistic locking and consistency)

More information:
http://wiki.apache.org/solr/SolrCloud#Glossary
http://www.youtube.com/watch?v=eVK0wLkLw9w&index=7&list=PLsj1Ri57ZE94lISrJuy7W8COc2RNFC1Fl
http://lucene.472066.n3.nabble.com/solr-cloud-concepts-td3726292.html
http://lucene.472066.n3.nabble.com/Terminology-question-Core-vs-Collection-vs-td4030232.html

solr.xml and solrconfig.xml

The crucial parts of the Solr home directory are shown here:

solr-home-directory/
  solr.xml
  conf/
    solrconfig.xml
    schema.xml
  data/

You supply solr.xml, solrconfig.xml, and schema.xml to tell Solr how to behave.
By default, Solr stores its index inside data folder.

solr.xml specifies configuration options for your Solr core, and also allows you to configure multiple cores(In newer versions, cores are discovered automatically). Port configurations for the jetty server running within solar is defined here.

solrconfig.xml controls high-level behavior. You can, for example, specify an alternate location for the data directory.
schema.xml describes the documents you will ask Solr to index. Inside schema.xml, you define a document as a collection of fields. You get to define both the field types and the fields themselves.

Field Types, Fields, and Copy Fields

The big picture of the file really breaks down to three major areas.

First you define your field types to dictate which Solr java class the field type utilizes and how fields of this type will be analysed. This is where most of the "magic" is outlined in your schema.

Then you define your actual fields. A field has a name attribute which is what you will use when importing and querying your data, and it points back to a designated field type. This is where you tell Solr what data is indexed, and what data is stored.

Optionally you may choose to have some copy fields which can be used to Copy Fields intercept incoming data going to one field and fork a clone of it off to another field that is free to be a different field type.

Solr takes a dual path with data, keeping what is indexed completely separate from what is stored.
4 fundamental choices to ask yourself on each field:

indexed="true" stored="true"
Use this for information you want to search on and also display in search results - for example, book title or author.

indexed="false" stored="true"
Use this for fields that you want displayed with search results but that don't need to be searchable - for example, destination URL, file system path, time stamp, or icon image.

indexed="true" stored="false"
Use this for fields you want to search on but don't need to get their values in search results. Here are some of the common reasons you would want this:

Large fields and a database: Storing a field makes your index larger, so set stored to false when possible, especially for big fields.
Ordering results: Say you define field name="bookName" type="text" indexed="true" stored="true" that is tokenized and used for searching. If you want to sort results based on book name, you could copy the field into a separate nonretrievable, nontokenized field that can be used just for sorting - field name="bookSort" type="string" indexed="true" stored="false" copyField source="bookName" dest="bookSort"
Easier searching: If you define the field you can use it as a catch-all field that contains all of the other text fields. Since solr looks in a default field when given a text query without field names, you can support this type of general phrase query by making the catch-all the default field.

indexed="false" stored="false"
Use this when you want to ignore fields.

Other fields that are used sparingly are detailed here.

Analysis

When a document is added/updated, its fields are analyzed and tokenized, and those tokens are stored in solr’s index. When a query is sent, the query is again analyzed, tokenized and then matched against tokens in the index. This critical function of tokenization is handled by Tokenizer components.

In addition to tokenizers, there are TokenFilter components, whose job is to modify the token stream.

There are also CharFilter components, whose job is to modify individual characters. For example, HTML text can be filtered to modify HTML entities like & to regular &.

Analyzer

An analyzer examines the text of fields and generates a token stream.

Tokenizer

An analyzer is aware of the field it is configured for, but a tokenizer is not. Tokenizers read from a character stream (a Reader) and produce a sequence of Token objects (a TokenStream).

In the above definition of text_general the tokenizer used for both indexing and querying is solr.StandardTokenizerFactory.
The class named in the tokenizer element is not the actual tokenizer, but rather a class that implements the org.apache.solr.analysis.TokenizerFactory interface.
This factory class will be called upon to create new tokenizer instances as needed. Objects created by the factory must derive from org.apache.lucene.analysis.TokenStream, which indicates that they produce sequences of tokens.

Tokenizer types: Tokenizers

Filters

Like tokenizers, filters consume input and produce a stream of tokens. Filters also derive from org.apache.lucene.analysis.TokenStream. Unlike tokenizers, a filter's input is another TokenStream.

Filter types: Filters

It’s important to use the same or similar analyzers that process text in a compatible manner at index and query time. For example, if an indexing analyzer lowercases words, then the query analyzer should do the same to enable finding the indexed words. The default definition of text_general uses an extra filter for querying , SynonymFilterFactory.

Useful links:
http://heliosearch.org/solr-4-8-features/

Developer's Notebook

Notes on Apache SOLR

No comments:

Post a Comment

About Me

Me Elsewhere