Category Archives: latest version

  • -

  • -

Introduction

Category : latest version

Goal of ElasticWarehouse is to organize your files, make them searchable and take care about fault tolerance. Thanks to ElasticWarehouse you can store terabytes of data in data cloud. In this guide you learn how to install and configure ElasticWarehouse cluster, how to import your files to the cluster and how to access them using simple or advanced API. For advanced usage is good to have understanding how ElasticSearch and Lucene work, because ElasticWarehouse has been build on the top of them.

ElasticWarehouse is an open-source project and it has nothing common with Elastic.co, except fact ElasticWarehouse has been build on the top of ElasticSearch.


  • -

Installation in 3 steps

Category : latest version

  1. Download latest standalone ElasticWarehouse package
  2. Extract (zip or tar.gz) the ElasticWarehouse official distribution to /opt/elasticwarehouse (C:\opt\elasticwarehouse on Windows) or different location, it’s up to you.
    cd /opt/elasticwarehouse
    tar -zxf elasticwarehouse-latest.tar.gz
    
  3. Launch elasticwarehouse.sh (ElasticWarehouse.bat on Windows)

Once you launch it, ElasticWarehouse will create a node client and create new or connect to existing cluster using multicast discovery.

Here is an output from successful run:

What’s next?

Check ElasticWarehouse status ….

curl -X GET http://localhost:10200/

Start more servers …


  • -

Post installation

Category : latest version

Your cluster build with single Node should be up-and-running. To check it, execute following command:

curl http://localhost:10200

in response you should get JSON status, like:

When you run ElasticWarehouse in embedded mode, you can also check ElasticSearch status, by:

curl http://localhost:9200

in response you should get typical ElasticSearch status, like:

Running as daemon (nix OS)

ElasticWarehouse is distributed with scripts to setup ElasticWarehouse as a daemon. To setup daemon simply launch (daemon will be running as user provided as a parameter):

sudo ./elasticwarehousedaemon_install.sh username

Now, to manage daemon state:

sudo /etc/init.d/elasticwarehousedaemon start
sudo /etc/init.d/elasticwarehousedaemon stop

or

sudo service elasticwarehousedaemon start
sudo service elasticwarehousedaemon stop

Running as service (Windows)

ElasticWarehouse is distributed with set of Batch files and nssm.exe (Service Manager wrapper). To install ElasticWarehouse as a service on Windows, run:

cd c:\opt\elasticwarehouse\bin
adminrun.bat elasticwarehouse service - install.bat

Such command installs ElasticWarehouse service and starts it. Logs file are placed in c:\opt\elasticwarehouse\logs\ folder.

To manage service state you can use Windows Service Manager:

ewservice

or command line tools:

cd c:\opt\elasticwarehouse\bin
adminrun.bat "elasticwarehouse service - uninstall.bat"
adminrun.bat "elasticwarehouse service - start.bat"
adminrun.bat "elasticwarehouse service - stop.bat"
adminrun.bat "elasticwarehouse service - restart.bat"

Note that adminrun.bat is not needed when running batch script under console with Administrative rights.


  • -

Configuration

ElasticWarehouse is distributed with default configuration optimized for most common configurations. Building ElasticSearch cluster can be very complex project, so we refer you to https://www.elastic.co/ website for more information about it. Here we focus on basic cluster configuration only.

Main configuration

ElasticWarehouse configuration files are in config folder

ls -l /opt/elasticwarehouse/config/

elasticsearch.yml
elasticwarehouse.yml

elasticsearch.yml is an ElasticSearch configuration file. Configuration file is used when ElasticWarehouse starts in embedded mode (default mode). In this mode ElasticWarehouse creates data ElasticSearch node and tries connect to existing cluster (defined in cluster.name) using multicast discovery.
To change Node configuration you can edit elasticsearch.yml and restart the Node. More information about configuration file you may find here.

elasticwarehouse.yml is main ElasticWarehouse configuration file. See table below for more details:

Group Key Type Default value Description
Mode definition mode.embedded boolean true Defines ElasticWarehouse instance work mode (one of: embedded or remote).
Remote mode specific elasticsearch.cluster string elasticwarehouse Defines cluster name to connect when instance works in remote mode (when mode.embedded is false)
elasticsearch.hosts string n/a host1,host2:port
Embedded mode specific grafana.port int 10500 Defines port Grafana to be listen on it. In case of binding exception ElasticWarehouse will try to use next available port, i.e. 10501, 10502, 10503 … etc
ElasticSearch index definitions elasticsearch.template.storage.name string elasticwarehousestorage Should be the same as elasticsearch.index.storage.name
elasticsearch.template.tasks.name string elasticwarehousetasks Should be the same as elasticsearch.index.tasks.name
elasticsearch.index.storage.name string elasticwarehousestorage Index name to store files
elasticsearch.index.storage.type string files Inside index we need to define type to store files. You can manually access files via ElasticSearch REST API, like: http://<host>:<port>/index/type/_search
elasticsearch.index.storage.childtype string childfiles Each file uploaded to the ElasticWarehouse cluster is parsed to get as much as possible information about it (i.e. for images it will be exif data, for PDF files it will be text file content). Some files like PDF or WORD may contain embedded files (like images, attachments or OLE objects). ElasticWarehouse extracts all such embedded files and store them in separate child type (one file stored in "type" may have many references to the "childfiles"). Thanks to that ElasticWarehouse is able to search in more advance way.
elasticsearch.index.tasks.name string elasticwarehousetasks Each operation like folder creation, files scan or upload etc is asynchronous and logged as task. Attribute defines index name to keep all tasks history (see _ewtask rest point for more details)
elasticsearch.index.tasks.type string tasks We store data inside type not inside index. You can manually access tasks via ElasticSearch REST API, like: http://<host>:<port>/index/type/_search
Global settings elasticwarehouse.api.port int 10200 Defines port API listen on it. In case of binding exception ElasticWarehouse will try to use next available port, i.e. 10201, 10202, 10203 … etc
log.level string DEBUG Log level. To limit log file size use INFO, WARN or ERROR
path.tmp string /tmp Temp folder location
exclude.files string avi mp4 mkv List of file extensions to be excluded and rejected by the cluster
thumb.size int 360 ElasticWarehouse generates thumb for any image uploaded to the cluster. Available sizes: 90, 180, 360, 720
tasks.max.number int 2 Maximum number of asynchronous tasks to be executed(i.e. asynchronous task is scan – see _ewtask for more details)
rrd.db.path string data folder ElasticWarehouse logs performance counters for monitoring purposes. As default EW creates all RRD databases in the same folder where ElasticSearch create Lucene indices
rrd.hostname string localhost name Set attribute explicitly when you run few ElasticWarehouse instances (nodes) on the same machine. If not set, then hostname will be used.
rrd.enabled boolean true Set to False to disable performance counters collector.
store.content boolean true When store.content=true then ElasticWarehouse behaves as data cloud (it stores extracted file meta information and file content inside the index). When store.content=false then ElasticWarehouse behaves like data indexer only – it doesn't sore binary file content, but only path to the orginal file. When you set to "false" you must configure store.folder
store.folder string /opt/upload When you upload file via _ewuplaod to the ElasticWarehouse and store.content=false, then file content will be saved to this folder.
store.movescanned boolean false When you use "scan" task to import files to the ElasticWarehouse cluster, you can choose whether to make a copy of original file or not. File copy is beeing copied to the location defined in store.folder.

Configuration file is loaded when ElasticWarehouse starts, so after each configuration change you must restart your ElasticWarehouse instance.

Note that some configuration changes like thumb.size, store.content, store.folder, store.movescanned, rrd.db.path etc. may require additional, manual maintenance work, so change them wisely.

For cases when you change thumb.size, we prepared dedicated task “rethumb”. This task recreates all thumbnails according to currently loaded settings.

Logging configuration

Logs are stored in logs folder as default (i.e.: c:\opt\elasticwarehouse\logs or /opt/elasticwarehouse/logs ). Logs folder and logs format can be changed by changing log4j.properties file stored in working folder for ElasticWarehouse process, i.e.: c:\opt\elasticwarehouse\bin\log4j.properties or /opt/elasticwarehouse/bin/log4j.properties .


  • -

Alternative installations

Category : latest version

If you already have ElasticSearch cluster, and you don’t want to add more nodes into it, you may use one of alternative configurations (remote or pluginnable)

Remote configuration

You may start single ElaticWarehouse instance as an “indexing gateway”. In this configuration your ElaticWarehouse instance is a client of existing ElasticSearch cluster and all write operations are executed by this instance. Search and fetch through such gateway is not effective (but possible), that’s why search queries should be executed directly on ElasticSearch cluster.
configuration3

 

Figure 1. ElasticWarehouse as a writing gateway to existing ElasticSearch cluster

How to configure

  1. Edit elasticwarehouse.yml
  2. Set:
    1. mode.embedded: false
    2. elasticsearch.hosts: “host1,host2:port”
  3. Launch ElasticWarehouse instance. Instance will use transport client to connect with hosts “host1,host2:port”.

Pluginnable configuration

ElaticWarehouse is also distributed as ElasticSearch plugin. You may install ElatsicWarehouse plugin on your existing ElasticSearch cluster and enjoy core ElaticWarehous features (comparison matrix)
configuration4

Figure 2. ElasticWarehouse as a plugin on existing ElasticSearch cluster

How to install plugin (ElasticSearch 1.x and ElasticSearch 2.x)

  1. Choose latest version of ElasticWarehouse plugin (i.e. 1.2.2)
  2. Determine ElasticSearch version you currently use. When you use:
    1. ElasticSearch 1.2.1, then download: elasticsearch-elasticwarehouseplugin-1.2.2-1.2.1.zip,
    2. ElasticSearch 1.7.0, then download: elasticsearch-elasticwarehouseplugin-1.2.2-1.7.0.zip,
    3. ElasticSearch 2.1.0, then download: elasticsearch-elasticwarehouseplugin-1.2.2-2.1.0.zip,
    4. and so on…
  3. Choose plugin distribution with dependencies or without. Distribution with dependencies is a big package but includes all jars in proper versions needed by ElasticWarehouse plugin to work. Distribution without dependencies is a small package, but requires all jar dependencies to be available in your class path.
  4. Go to ElasticSearch’s bin folder, and type:

Install version without dependencies (ElasticSearch 1.x):

plugin -install elasticwarehouseplugin -u http://elasticwarehouse.effisoft.eu/download.php?fname=elasticsearch-elasticwarehouseplugin-1.2.2-1.7.0.zip

-> Installing elasticwarehouseplugin...
Trying http://elasticwarehouse.effisoft.eu/download.php?fname=elasticsearch-elasticwarehouseplugin-1.2.2-1.2.1.zip...
Downloading ...........................................................................................DONE
Installed elasticwarehouseplugin into c:\opt\elasticsearch\plugins\elasticwarehouseplugin

Install version with dependencies (ElasticSearch 1.x):

plugin -install elasticwarehouseplugin -u http://elasticwarehouse.effisoft.eu/download.php?fname=elasticsearch-elasticwarehouseplugin-1.2.2-1.7.0-with-dependencies.zip

-> Installing elasticwarehouseplugin...
Trying http://elasticwarehouse.effisoft.eu/download.php?fname=elasticsearch-elasticwarehouseplugin-1.2.2-1.7.0-with-dependencies.zip...
Downloading ...................................................................................................DONE
Installed elasticwarehouseplugin into c:\opt\elasticsearch\plugins\elasticwarehouseplugin

Install version with dependencies (ElasticSearch 2.x):

plugin install http://elasticwarehouse.effisoft.eu/download.php?fname=elasticsearch-elasticwarehouseplugin-1.2.2-2.1.0-with-dependencies.zip

-> Installing from http://elasticwarehouse.effisoft.eu/download.php?fname=elasticsearch-elasticwarehouseplugin-1.2.2-2.1.0-with-dependencies.zip...
Plugins directory [/home/user/workspace/elasticsearch-2.1.0/plugins] does not exist. Creating...
Trying http://elasticwarehouse.effisoft.eu/download.php?fname=elasticsearch-elasticwarehouseplugin-1.2.2-2.1.0-with-dependencies.zip ...
Downloading ....................................................................................................DONE
Verifying http://elasticwarehouse.effisoft.eu/download.php?fname=elasticsearch-elasticwarehouseplugin-1.2.2-2.1.0-with-dependencies.zip checksums if available ...
NOTE: Unable to verify checksum for downloaded plugin (unable to find .sha1 or .md5 file to verify)
Installed elasticwarehouseplugin into /home/user/workspace/elasticsearch-2.1.0/plugins/elasticwarehouseplugin

Please contact with us in case of any questions or check “elasticwarehouse plugin installation known issues” page.


  • -

API Introduction

Category : latest version

ElasticWarehouse defines its own API, however advanced users may want to use ElasticSearch API as well. We strongly recommend to use ElasticWarehouse API for any write operations (to make sure generated Json documents are in expected format). We have no objections to use ElasticSearch API for search purposes, especially transport client is more efficient than HTTP API. Each ElasticWarehouse API request can be transformed and debugged which may help with building your own search requests. If you need some search creiteria, let us know about them or share it on github.

API type Port Description
ElasticWarehouse REST API 10200 Defines simple API to support most common file operations. Each request send to 10200 is being transformed into ElasticSearch request and subsequently executed on ElasticSearch cluster
ElasticWarehouse Grafana 10500 Grafana interface to visualize performance counters collected by ElasticWarehouse. Grafana interface is enabled in ElasticWarehouse embedded mode only.
ElasticSearch REST API 9200 See documentation on Elastic.co
ElasticSearch Transport client 9300 See documentation on Elastic.co
ElasticSearch Node client n/a See documentation on Elastic.co

  • -

ElasticWarehouse REST API

Category : latest version

ElasticWarehouse REST API defines 9 rest points:

  • _ewupload
  • _ewinfo
  • _ewget
  • _ewbrowse
  • _ewtask
  • _ewsearchall
  • _ewsearch
  • _ewsummary
  • _ewgraphite (not available in plugin version)

Each restpoint accepts different attributes send by GET or POST to build ElasticSearch compatible request. you can view ElasticSearch requets by adding showrequest=true to the URL.

API ports

Depends from chosen version, you can access REST points listed above via port 9200 (default ElasticSearch http module port) or 10200 (default ElasticWarehouse API port).

When running ElasticWarehouse plugin in existing ElasticSearch instance (plugin):

curl -XGET "http://hostname:9200/_ewbrowse"

When running ElasticWarehouse instance (embedded or remote modes):

curl -XGET "http://hostname:10200/_ewbrowse"

All examples in this guide use port 10200. Change it to 9200 when you chose ElasticWarehouse plugin version.


  • -

_ewupload

Category : latest version

This restpoint is used to upload single file to the cluster. Before we start make sure cluster is up-and-running, by sending simple curl command:

curl http://localhost:10200

Returned JSON response shows cluster status (see Post check installation).

How to upload single jpeg file to the cluster:

curl -XPOST "http://localhost:10200/_ewupload?folder=/files/mypictures/&filename=myimage.jpg" --data-binary @myimage.jpg

What happened?
/files/mypictures/ folder has been created on ElasticWarehouse cluster and myimage.jpg was successfully uploaded into it. Our myimage.jpg file is now accessible by id aYLro1V_TzO0tfLNmbp4gA. To navigate through ElasticWarehouse cluster you can use ewshell bash script (see community page) distributed with ElasticWarehouse package.

ewshell.sh -c browse /files/mypictures

Upload an update?

curl -XPOST "http://localhost:10200/_ewupload?folder=/files/mypictures/&filename=myimage.jpg&id=aYLro1V_TzO0tfLNmbp4gA" --data-binary @myimage.jpg

When you browse cluster, you see MODIFYDATE and MODIFYTIME are not empty after update.

ewshell.sh -c browse /files/mypictures

To manage uploaded file use tasks (use _ewtask rest point)

Parameter Requirement Type Description
folder mandatory string An attribute to define name of the folder where file to be uploaded. If folder doesn’t exist, then will be created.
filename mandatory string Target filename. File you upload may have different name on ElasticWarehouse cluster than on local file system. Each file actions are performed via ID, so cluster allows even to store many files with the same name in the same folder. To distinguish them, use file Id.
id optional string File id whose update is being requested. If id is not passed to the request, then file will be uploaded as a new one.
showrequest optional boolean When set to “true” ElasticWarehouse node prints converted JSON request on standard output. Converted JSON request can be executed directly on ElasticSearch cluster. It’s useful when you plan to build your own ElasticWarehouse client connected directly to the ElasticSearch cluster.

  • -

_ewinfo

Category : latest version

In most cases you get file ID only. _ewinfo allows you to get more file information from the provided id. When ID is not known, you can point file by providing folder and filename.

Parameter From version Requirement Type Description
id all optional string File id whose detailed information is being requested
showrequest all optional boolean When set to “true” ElasticWarehouse node prints converted JSON request on standard output. Converted JSON request can be executed directly on ElasticSearch cluster. It’s useful when you plan to build your own ElasticWarehouse client connected directly to the ElasticSearch cluster.
folder >=1.2.1 optional string Use pair folder/filename attributes to get file information when When ID is not known
filename >=1.2.1 optional string Use pair folder/filename attributes to get file information when When ID is not known
set >=1.2.1 optional boolean Default value is FALSE. Set to TRUE to modify particular file attributes
attribute >=1.2.1 optional string Used when set=true. Defines attribute name to be modified
value >=1.2.1 optional string Used when set=true. Defines new value of attribute

You can use ewshell wrapper to get file information:

ewshell.sh -c info 6pnHlt3XRRCr-PT6-mg8RA --showmeta 1

Or execute it directly via ElasticWarehouse REST API:

curl -XGET "http://localhost:10200/_ewinfo?id=aYLro1V_TzO0tfLNmbp4gA"

Sample output:

Sample responses from Microsoft Word file and Jpeg image stored in ElasticWarehouse cluster:

ewshell.sh -c info 9IDaZMEPQEWMTdHwZ8rqMw --showmeta 1
curl -XGET "http://localhost:10200/_ewinfo?id=9IDaZMEPQEWMTdHwZ8rqMw"

ewshell.sh -c info THQBSsj8Tk-Qy-qL6CTSHA --showmeta 1
curl -XGET "http://localhost:10200/_ewinfo?id=THQBSsj8Tk-Qy-qL6CTSHA"