Search made easy for (web) developers

Alexander Reelsen
alexander@reelsen.net
@spinscale

Agenda

  • What is so important about search?
  • Scalability, Sharding & Replication
  • Configuration, Mapping & Analyzers
  • Querying, Facetting, Percolation
  • Modules, Plugins, Rivers & Tools
  • Production setup & living in the trenches

About me - Alexander Reelsen

  • Studied information systems
  • 10 years linux system engineering, converted to software engineering
  • Web framework enthusiast, fed up with complex java environment for simple webapps
  • Other interests: Scaling web architectures, Web 2.0 (nosql, search)
  • Author of Play framework cookbook
  • Working at Lusini GmbH, building a b2b ecommerce platform
  • Streetball/Basketball

What is so important about search?

Search is more than text search

Search must search for ids

Search must search for colors

Search must search for brands

Search must advice

Search must be intelligent

Search must aggregate

Why an own search engine?

  • Because you can - telling this your CTO doesn't work.
  • Your data, your search - noone spying...
  • Customize your search - Rank your own style
  • Customize your data - Extend your search?!
  • Best support and in-sourced know-how - Lower TCO
  • No blackbox - Lower TCO

elasticsearch in ten seconds

  • Java, based on Apache Lucene
  • Scales out, replicates, shards, fail-over
  • Schema-free
  • Document-based
  • Every interaction can be done via HTTP & JSON
  • References: Mozilla, StumbleUpon, Sony, Infochimps, Assistly, Klout

Standing on the shoulders of giants

  • Lucene, JBoss Netty, Jackson, log4j
  • Google Guice, Google Guava, MVEL, Groovy
  • Jodatime, JLine, snakeyaml
  • hamcrest, testng
  • sigar via JNA

Elasticsearch architecture

Single node setup

Replication

Sharding

Replication & sharding

Installation - takes two minutes

Configuration

  • config/elasticsearch.yml or config/elasticsearch.json
  • Application-wide settings (zen discovery, available analyzers)
  • index default configurations (number of shards)
  • Seperate logging file: config/logging.yml (simplified log4)

Configuration


discovery:
  zen:
    multicast.enabled: false

http:
  max_content_length: 100000

index:
  number_of_shards: 1

  analysis:
    analyzer:
      default:
        type: standard

      lowercase_analyzer:
        type: custom
        tokenizer: standard
        filter: [standard, lowercase]

Data representation in JSON


{
	"id": "1",
	"name" : "MacBook Air",
	"price": 1099,
	"descr" : "Some lengthy never-read description", 
	"attributes" : {
		"color" : "silver",
		"display" : 13.3,
		"ram" : 4
	}
}

Index your product

curl -X PUT localhost:9200/products/product/1 -d '{
	"id": "1",
	"name" : "MacBook Air",
	"price": 1099,
	"descr" : "Some lengthy never-read description", 
	"attributes" : {
		"color" : "silver",
		"display" : 13.3,
		"ram" : 4
	}
}'
http://localhost:9200/products/product/1

JSON as query language

http://host:9200/products/product/_search
{ "query" : { "term" : { "name": "MacBook Air" }}}
{ "query" : { "prefix" : { "name": "Mac" }}}
{ "query" : { "range" : { "price" : { "from" : 1000, "to": 2000 } } } }
{ "from": 0, "size": 10, "query" : { "term" : { "name": "MacBook Air" }}}
{ "sort" : { "name" :  { "order": "asc" } }, "query" : { "term" : { "name": "MacBook Air" }}}

JSON as query language

http://host:9200/products/product/_search

{ "took":206,"timed_out":false,
"_shards":{"total":1,"successful":1,"failed":0},
"hits":{ "total":1,"max_score":2.098612,
  "hits":[ {
    "_index":"products1","_type":"product","_id":"1",
    "_score":2.098612, "_source" : {
      "id": "1",
      "name" : "MacBook Air",
      "price": 1099,
      "descr" : "Some lengthy never-read description", 
      "attributes" : {
        "color" : "silver",
        "display" : 13.3,
        "ram" : 4
      }
}}]}}

Configuration - Mapping

  • On indexing the JSON document is parsed and all data types are extracted
  • Mapping fields to datatypes is done automatically on first indexing
  • Can be configured on a per-type basis
  • Strings can have their own analyzer
  • Sample types: float, long, boolean, date (+formatting), object
  • One field can have multiple fields analyzed differently (lowercase, query)

Sample mapping


{
    "product": {
        "properties": {
            "ProductId":            { "type": "string", "index": "not_analyzed" },

            "ProductEnabled":       { "type": "boolean" },
            "PiecesIncluded":       { "type": "long" },
            "LastModified":         { "type": "date", "format": "yyyy-MM-dd HH:mm:ss.SSS" },

            "AvailableInventory":   { "type": "float" },
            "Price":                { "type": "float" },

            "LongDescription":      { "type": "string", "include_in_all" : true },
            
            "ProductName" : {
                "type" : "multi_field",
                "include_in_all" : true,
                "fields" : {
                    "ProductName":  { "type": "string", "index": "not_analyzed" },
                    "lowercase":    { "type": "string", "analyzer": "lowercase_analyzer" },
                    "suggest" :     { "type": "string", "analyzer": "suggest_analyzer" }
                }
            }
        }
    }
}    
    

Configuration - Analyzers

  • An analyzer consists of a Tokenizer and an arbitrary amount of filters
  • Example:
  • suggest_analyzer:
      type: custom
      tokenizer: whitespace
      filter: [standard, lowercase, shingle]
  • Stripping html code:
    char_filter: html_strip

Java API - Creating a client

Settings settings = ImmutableSettings.settingsBuilder().
    put("cluster.name", clusterName).build();

InetSocketTransportAddress addr = 
    new InetSocketTransportAddress(host, port)

Client client = new TransportClient(settings).
    addTransportAddress(addr);

Starting an embedded server

File config = new File("elasticsearch-local.yml");
String config = FileUtils.readFileToString(config);

Builder settingsBuilder = ImmutableSettings.settingsBuilder().
    loadFromSource(config);

Node node = NodeBuilder.nodeBuilder().
    settings(settingsBuilder).node();

Client client = node.client();

Executing a query

CountRequestBuilder countRequestBuilder =
    new CountRequestBuilder(client)
        .setQuery(QueryBuilders.termQuery("foo", "bar"))
        .setIndices("products")
        .setTypes("product");
        
CountResponse response = 
    countRequestBuilder.execute().actionGet();
long count = response.count();

Search API overview

  • Index, Delete, Delete-By-Query, Get, Multiget, Bulk
  • Search/Count queries (term query, prefix query, id, fuzzy…)
  • Geo-based queries, TTL
  • More like this, Highlighting
  • Facetting, Percolation, Scripting

Search - Facetting

  • Facetting adds aggregated information to a standard search query
  • Term: Group results by a term
  • Range: Group by price or date ranges
  • Histogram: Group results in equally sized buckets, also as date histogram
  • Statistical: Include statistical data like min, max, sum, avg & some more
  • Geo distance: Group results around a coordinate

Facet query


SearchRequestBuilder searchRequestBuilder = new SearchRequestBuilder(client)
    .setIndices("products")
    .setTypes("product");

searchRequestBuilder.setQuery(QueryBuilders.prefixQuery("ProductName.suggest", "macbook"));

searchRequestBuilder.addFacet(FacetBuilders.termsFacet("categoryFacet").field("CategoryId"));

SearchResponse searchResponse = searchRequestBuilder.execute().actionGet();

TermsFacet facet = searchResponse.getFacets().facet(TermsFacet.class, "categoryFacet");
List entries = facet.entries();
String term = entries.get(0).term();
int count = entries.get(0).count();
        

Search - Scripting

This is where your own integration beats all others

  • Score down all your products without an image
  • Dont include them in your results
  • Score up products by an attribute like its product quality or stock
  • Apply math operations on fields to change score

Search API - Percolation

Implement a price agent for free!

curl -X PUT localhost:9200/_percolator/products/pricecheck -d '{
"query" : { 
  "bool" : {
    "must" : { "term" : { "name" : "MacBook Air" } }, 
    "must" : { "range" : { "price" : { "from" : 200, "to" : 999 } } }
    }
  }
}'
{"ok":true,"_index":"_percolator","_type":"products","_id":"pricecheck","_version":1}

curl -X PUT 'localhost:9200/products/product/1?percolate=*' -d '{ "price": 1000, "name" : "MacBook Air" }'
{"ok":true,"_index":"products","_type":"product","_id":"1","_version":1,"matches":[ ]}

curl -X PUT 'localhost:9200/products/product/2?percolate=*' -d '{ "price": 999, "name" : "MacBook Air" }'
{"ok":true,"_index":"products","_type":"product","_id":"2","_version":1,"matches":["pricecheck"]}

Indices API

  • Aliases, Analyze
  • Create, Delete, Exists, Open, Close, Optimize, Refresh, Flush, Settings
  • Get, Put, Delete Mapping
  • Get, update settings
  • Snapshot
  • Index templates (mappings + settings)
  • Stats, Status
  • Segments, Clear cache

Cluster API

  • Health, State, Settings
  • Nodes Info, Nodes Stats, Nodes Shutdown

Modules

  • REST, Thrift, Memcached, ZeroMQ
  • JMX
  • Scripting (MVEL, javascript, groovy, python, native)
  • Discovery: EC2, Zen
  • Cluster, Indices, Network, Transport

Plugins

  • Analysis: Smart Chinese, ICU, IK, Mmseg, Hunspell
  • Transport: Memcached, Thrift, ZeroMQ, Servlet
  • Scripting: javascript, groovy, python
  • Site plugins: BigDesk, Elasticsearch Head
  • Misc: Mapper attachments, Hadoop, AWS cloud, Mock Solr, Suggester, PartialUpdate

Rivers

  • Interface to import data into elasticsearch
  • CouchDB, Wikipedia, Twitter, RabbitMQ
  • RSS, MongoDB
  • Hint: When writing your own river, make sure you are implementing streaming

Tools

  • BigDesk, Elasticsearch Head
  • Chef, puppet
  • RPMs and debian packages
  • daikon CLI

BigDesk Screenshot

Elasticsearch-head Screenshot

Language support & software

  • java, groovy, python, perl, ruby, erlang, .net, clojure
  • Integrations: grails, django, rails, catalyst, flume, terrastore, hadoop, symfony2, drupal, couchdb, play framework, node.js
  • Software: Graylog2
  • Elasticsearch as SaaS: bonsai.io

Running in production

  • 220k products, one index, one shard (due to result grouping)
  • Almost all queries have a big facetting query part (with filters)
  • Don't expose your search engine to the internet!
  • Write your own river
  • Be prepared to upgrade every now and then

Thanks for listening!

Questions?

Slides available at
http://spinscale.github.com/
alexander@reelsen.net
@spinscale

Documentation & Credits