Groonga CloudSearch

Groonga CloudSearch blog

Using Groonga CloudSearch with aws_cloud_search gem

Introduction

This article describes how to use Groonga CloudSearch with aws_cloud_search.

Groonga CloudSearch is an Amazon CloudSearch compatible open source full text search server. With Groonga CloudSearch, you can try Amazon CloudSearch APIs on your local machine.

aws_cloud_search is a Ruby library (gem) which wraps Amazon CloudSearch APIs. You can use aws_cloud_search to index your documents and search them. Though aws_cloud_search itself does not support Groonga CloudSearch, with a small hack (by monkey patching), we can direct its requests toward Groonga CloudSearch instead of Amazon CloudSearch. That means we can use aws_cloud_search gem with Groonga CloudSearch.

Prerequisites

In this article, we assume that you

Prepare Groonga CloudSearch and example documents

First of all, let’s try searching with Groonga CloudSearch and aws_cloud_search. In this section, we will use the documents prepared in the tutorial to search for simplicity of explanation. You need to finish the tutorial before you proceed. The way to index your own documents with aws_cloud_search will described in the later section.

Install aws_cloud_search. We use RubyGems. Run gem install asw_cloud_search on your terminal.

$ gem install aws_cloud_search
Successfully installed aws_cloud_search-0.0.2
1 gem installed
Installing ri documentation for aws_cloud_search-0.0.2...
Installing RDoc documentation for aws_cloud_search-0.0.2...

Prepare a script to direct the requests to Groonga CloudSearch

As the URLs that aws_cloud_search to connect with are hard-coded, we need a small patch to modify them.

Save the following code as dirct_to_local_gcs.rb.

# A small hack to use Groonga CloudSearch.
# We override these three methods to direct requests Groonga CloudSearch
# working on the localhost:7575.
# We use http://xip.io/, which provides wildcard DNS for any IP address.
module AWSCloudSearch
  def self.search_url(domain, region="us-east-1")
    "http://search-#{domain}.#{region}.127.0.0.1.xip.io:7575"
  end

  def self.document_url(domain, region="us-east-1")
    "http://doc-#{domain}.#{region}.127.0.0.1.xip.io:7575"
  end

  def self.configuration_url
    "https://cloudsearch.us-east-1.127.0.0.1.xip.io:7575"
  end
end

This code overrides aws_cloud_search to direct its requests to Groonga CloudSearch, which is running on localhost:7575.

Search the documents

In order to illustrate how to make search requests with aws_cloud_search for Groonga CloudSearch, we create a small script to search the example domain on localhost:7575, which is created the tutorial.

Save the following code as search.rb.

#!/usr/bin/env ruby

require 'aws_cloud_search'
require './direct_to_local_gcs' # direct requests to localhost:7575

# Initiate a CloudSearch object corresponds to the example domain.
domain_name = 'example-00000000000000000000000000'
cloud_search = AWSCloudSearch::CloudSearch.new(domain_name)

# Take a query string from the command line argument.
query = ARGV.join(' ')

# Create a search request object for the query.
search_request = AWSCloudSearch::SearchRequest.new
search_request.q = query

# Issue the request.
search_response = cloud_search.search(search_request)

# Show the results.
puts "#{search_response.found} documents are found for the query '#{query}':"

search_response.hits.each do |hit|
  p hit
end

You can execute the search with the script by ruby search.rb [query]. Don’t forget to start gcs server on localhost:7575 beforehand (See details in the tutorial).

The output should be like the following:

$ ruby search.rb tokyo
3 documents are found for the query 'tokyo':
{"id"=>"id1", "data"=>{"_id"=>[1], "_key"=>["id1"], "address"=>["Shibuya, Tokyo, Japan"], "email_address"=>["info@razil.jp"], "name"=>["Brazil"]}}
{"id"=>"id3", "data"=>{"_id"=>[3], "_key"=>["id3"], "address"=>["Hongo, Tokyo, Japan"], "email_address"=>["info@clear-code.com"], "name"=>["ClearCode Inc."]}}
{"id"=>"id9", "data"=>{"_id"=>[9], "_key"=>["id9"], "address"=>["Tokyo, Japan"], "email_address"=>[""], "name"=>["Umbrella Corporation"]}}

It works. You can modify this script to fit on your needs.

Index your documents

This section describes the way to index your documents by aws_cloud_search. For explanation, we create a simple CUI tool to index an entry given from command line arguments.

Save the following code as index.rb.

#!/usr/bin/env ruby

require 'aws_cloud_search'
require './direct_to_local_gcs' # direct requests to localhost:7575

# Initiate a CloudSearch object corresponds to the example domain.
domain_name = 'example-00000000000000000000000000'
cloud_search = AWSCloudSearch::CloudSearch.new(domain_name)

# Take the data from command line arguments.
id, name, address, email_address = ARGV

# Create a document to be indexed.
document = AWSCloudSearch::Document.new

document.id = id
document.add_field :name, name
document.add_field :address, address
document.add_field :email_address, email_address

# Create a batch to index the document.
batch = AWSCloudSearch::DocumentBatch.new
batch.add_document document

# Issue the request.
response = cloud_search.documents_batch(batch)

# Show the response.
p response

The script index.rb takes four arguments: id, name, address and email_addess. Let us try to index a document.

$ ruby index.rb id11 "Snowy Corporation" "Tokyo, Japan" snowy@example.com
{"status"=>"success", "adds"=>1, "deletes"=>0}

The document is successfully indexed.

Search by query tokyo again to check if the new document is searchable.

$ ruby search.rb tokyo
4 documents are found for the query 'tokyo':
{"id"=>"id1", "data"=>{"_id"=>[1], "_key"=>["id1"], "address"=>["Shibuya, Tokyo, Japan"], "email_address"=>["info@razil.jp"], "name"=>["Brazil"]}}
{"id"=>"id3", "data"=>{"_id"=>[3], "_key"=>["id3"], "address"=>["Hongo, Tokyo, Japan"], "email_address"=>["info@clear-code.com"], "name"=>["ClearCode Inc."]}}
{"id"=>"id9", "data"=>{"_id"=>[9], "_key"=>["id9"], "address"=>["Tokyo, Japan"], "email_address"=>[""], "name"=>["Umbrella Corporation"]}}
{"id"=>"id11", "data"=>{"_id"=>[20], "_key"=>["id11"], "address"=>["Tokyo, Japan"], "email_address"=>["snowy@example.com"], "name"=>["Snowy Corporation"]}}

The number of hit documents have increased to 4 (formerly it was 3), as it includes the new document. The last document is that we have indexed by index.rb script.

Remove the documents

Removing document is done by the quite similar way to indexing. Save the following code as delete.rb.

#!/usr/bin/env ruby

require 'aws_cloud_search'
require './direct_to_local_gcs' # direct requests to localhost:7575

# Initiate a CloudSearch object corresponds to the example domain.
domain_name = 'example-00000000000000000000000000'
cloud_search = AWSCloudSearch::CloudSearch.new(domain_name)

# Take the document id to be deleted from the command line argument.
id = ARGV.shift

# Create a document to be deleted.
document = AWSCloudSearch::Document.new
document.id = id

# Create a batch to remove the document.
batch = AWSCloudSearch::DocumentBatch.new
batch.delete_document document

# Issue the request.
response = cloud_search.documents_batch(batch)

# Show the response
p response

In order to delete the document with id = id11 (the document added in the previous section), run ruby index.rb id11.

$ ruby delete.rb id11
{"status"=>"success", "adds"=>0, "deletes"=>1}

The removed entry, Snowy Corporation is no longer appeared in the search results.

$ ruby search.rb tokyo
3 documents are found for the query 'tokyo':
{"id"=>"id1", "data"=>{"_id"=>[1], "_key"=>["id1"], "address"=>["Shibuya, Tokyo, Japan"], "email_address"=>["info@razil.jp"], "name"=>["Brazil"]}}
{"id"=>"id3", "data"=>{"_id"=>[3], "_key"=>["id3"], "address"=>["Hongo, Tokyo, Japan"], "email_address"=>["info@clear-code.com"], "name"=>["ClearCode Inc."]}}
{"id"=>"id9", "data"=>{"_id"=>[9], "_key"=>["id9"], "address"=>["Tokyo, Japan"], "email_address"=>[""], "name"=>["Umbrella Corporation"]}}

Conclusions

We have introduced aws_cloud_search gem, which wraps Amazon CloudSearch. With a small modification (monkey patching) of aws_cloud_search gem, we can use it with Groonga CloudSearch. We have described about the patch and how to create scripts to search, index and remove the documents with aws_cloud_search and Groonga CloudSearch.