Search your stuff with Solr
This 5 min quick search was on a comma separated data file to help me sift through the data in seconds or less.This article covers a very basic setup, loading and querying of data from Solr.
Why Solr?
Solr is a text-based search engine that you can use to index anything. It is built on the Apache Lucene project. It is also open source. The simple plug-and-play project can be used on any operating system. This takes away the havoc of a formal installation process.
Terms you MUST know
There are a lot of buzz words in the Solr world. Below are the only ones that we care about.
- Core — A core is an object that holds your data in Solr. Each core is a collection of data itself.
- Shards — Number of splits done on your data to be stored across different nodes on a Solr Cluster. This example is based on a single node cluster.
- Replication Factor — Number of copies of data made across the cluster.
- Schema — Each field in the data has a schema that contains the name of the field, type and whether it is stored or indexed. There are other fields in the schema too but we don’t care about those for this simple PoC.
Setting up Solr
The entire installation process was just three steps.
- Download the latest version of Solr. Find different Solr versions here.
- Unzip it in a suitable location in your machine. Open the directory and view the contents.
3. Start the Solr engine
You can access the Solr UI on any web browser on your machine by typing:
http://localhost:8983/
Searching in Solr
Step 1: Get yourself some data!
Obviously without data, this entire process is a moo point. I decided to get me some data from Kaggle. I picked the Philadelphia crime data because it was large in size.
Step 2: Creating the core
The first step is to create the core that is going to house the data we will load.
We leverage the create function and pass the name of the core, the number of shards and the replication factor of the data that we are going to put in the core.
bin/solr create -c testcore -shards 1 -replicationFactor 3
The core gets created as shown in this screenshot below
You can also view the core on the Solr UI as follows
Step 3: Create a schema
Once the core is added, we want to add the schema of the data to be ingested so that Solr knows what it can expect. In my case, I have to ingest a CSV file with a known schema.
Note: Schema might not be mandatorily added to the core. We can also program dynamic fields based on suffixes of column names that can help speed up the process.
For ease, I make all the columns as text type. The only thing that needs attention is to categorize which columns need to be stored in the core and which columns need to be indexed by the core.
Typically you set “indexed=true”, only if its a field that you would be querying on. You can set “stored=true” if you want to store the data element in the core. This component allows you to choose whether you want to store a certain column or not from the data file ingested.
This snippet can be executed multiple times for each column in the dataset from the command line. This is a simple HTTP POST to the Solr.
Command to add a field that is stored and indexed:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"Text_General_Code",
"type":"text_general",
"stored":true
"indexed":true }
}' http://localhost:8983/solr/testcore/schema
Command to add a field that is only stored but not indexed:
curl -X POST -H ‘Content-type:application/json’ — data-binary ‘{
“add-field”:{
“name”:”Dc_Key”,
“type”:”text_general”,
“stored”:true
“indexed”:false }
}’ http://localhost:8983/solr/testcore/schema
Repeat the same steps to add the entire schema.
After executing each CURL command there will be a response object returned which means your schema got added .
Step 4: Add data to the core
The Solr package provides a jar that you can execute to add data to the core. This post.jar can be found under examples folder of the Solr. Open the directory which has the post.jar. This jar basically executes a HTTP POST operation to send the data from your local to the Solr URL.
The command to add data to the core using the post.jar is:
java -Dtype=text/csv -Dc=testcore -jar post.jar /Users/smurug29/Downloads/crime.csv
In the above command we pass the type of file to be ingested, the core that we need to ingest data to and the location of the data file. This command might take a couple minutes to complete as it depends on the size of the file.
Step 5: Query your data
The Solr Web UI has a query functionality that can be leveraged for adhoc querying. Solr also provides API for different programming languages that can be embedded into enterprise applications for robust querying.
In this article, we will look at the UI based querying approach.
— Open the Solr UI
— Select the core you want to query on the left pane and click the Query option underneath.
— Enter the filtering conditions in the fq box and click “Execute Query” and viola!
Happy Searching!!