Category Archives: Uncategorized

  • -

ElasticWarehouse version 1.2.3

Category : Uncategorized

Release note

  • version 1.2.3
    • bug fixes,
    • starting from this version, we support ES 2.x,
    • for ES 2.x version:
      • Tika upgraded to 1.11 (version for ES 1.x still uses Tika 1.7),
      • Kopf and Head plugins upgraded to latest versions,
      • builds done on Java 1.8

ElasticWarehouse standalone packages (to work in embedded and remote modes)

ElasticWarehouse plugin packages (to be hosted as ElasticSearch plugin)


  • -

ElasticWarehouse version 1.2.2

Category : Uncategorized

Release note

  • version 1.2.2
    • bug fixes

ElasticWarehouse standalone packages (to work in embedded and remote modes)

ElasticWarehouse plugin packages (to be hosted as ElasticSearch plugin)

 


  • -

ElasticWarehouse Plugin installation – known issues

Category : Uncategorized

Most common issues are related to jar dependencies. Since ES 2.x has JarHell checker, you may get errors during plugin installation or classically in the runtime. Below we collect most common issues and solutions to fix them.

Runtime exceptions, java.lang.ExceptionInInitializerError or java.lang.ClassNotFoundException when uploading specific file formats to ElasticWarehouse cluster

ElasticWarehouse uses Tika to parse file contents and file metadata. Tika has lot of dependencies and some of them to work correctly must be available in classpath. ElasticWarehouse package contains all needed dependencies in correct versions, but sometimes you may need to include them in classpath,

vim /bin/elasticsearch.in.sh

And edit ES_CLASSPATH variable by adding plugins folder (part marked bold). Remember to provide correct plugin version (in this example we used 1.2.2-2.1.0)

ES_CLASSPATH="$ES_HOME/lib/elasticsearch-2.1.0.jar:$ES_HOME/lib/*:$ES_HOME/plugins/elasticwarehouseplugin/*:$ES_HOME/plugins/elasticwarehouseplugin/elasticwarehouseplugin-1.2.2-2.1.0-jar-with-dependencies.jar"

Issue mostly occurs for:

  • *.atom (java.lang.NoClassDefFoundError: org/jdom/input/JDOMParseException)
  • *.xls, *.xlsx, *.ppt, *.pptx (java.lang.ClassNotFoundException: org.apache.poi.poifs.crypt.cryptoapi.CryptoAPIEncryptionInfoBuilder)

Installation error, java.lang.IllegalStateException

ElasticSearch 2.x has JarHell class to check dependencies. When dependencies are doubled, it will print something like below and stop installation with error code:

Exception in thread "main" java.lang.IllegalStateException: failed to load bundle [file:/opt/elasticwarehouseplugin-1.2.2-2.1.0-jar-with-dependencies.jar] due to jar hell
Likely root cause: java.lang.IllegalStateException: jar hell!
class: org.apache.poi.EmptyFileException
jar1: /home/user/workspace/elasticsearch-2.1.0/lib/poi-3.13.jar
jar2: /home/user/workspace/elasticsearch-2.1.0/plugins/elasticwarehouseplugin/elasticwarehouseplugin-1.2.2-2.1.0-jar-with-dependencies.jar
at org.elasticsearch.bootstrap.JarHell.checkClass(JarHell.java:280)
at org.elasticsearch.bootstrap.JarHell.checkJarHell(JarHell.java:186)
at org.elasticsearch.plugins.PluginsService.loadBundles(PluginsService.java:336)
at org.elasticsearch.plugins.PluginsService.(PluginsService.java:109)
at org.elasticsearch.node.Node.(Node.java:148)
at org.elasticsearch.node.Node.(Node.java:129)
at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:145)
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:178)
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:285)
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:35)
Refer to the log for complete error details

In such situation the best way is to deploy ElasticWarehouse plugin without dependencies and copy all *.jar dependencies manually to “<elastic_search>/lib/” folder.

./bin/plugin install http://elasticwarehouse.effisoft.eu/elasticwarehouse/elasticsearch-elasticwarehouseplugin-1.2.2-2.1.0.zip

List of dependencies can be taken from pom.xml file

java.security.AccessControlException

Exception in thread "Thread-11" java.security.AccessControlException: access denied ("java.io.FilePermission" "/home/user/myfiles" "read")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
at java.security.AccessController.checkPermission(AccessController.java:884)
at java.lang.SecurityManager.checkPermission(SecurityManager.java:549)
at java.lang.SecurityManager.checkRead(SecurityManager.java:888)
at java.io.File.list(File.java:1117)
at java.io.File.listFiles(File.java:1207)
at org.elasticwarehouse.core.parsers.FileTools.scanFolder(FileTools.java:82)
at org.elasticwarehouse.core.parsers.FileTools.scanFolder(FileTools.java:73)
at org.elasticwarehouse.tasks.ElasticWarehouseTaskScan.scanFolder(ElasticWarehouseTaskScan.java:155)
at org.elasticwarehouse.tasks.ElasticWarehouseTaskScan.access$200(ElasticWarehouseTaskScan.java:45)
at org.elasticwarehouse.tasks.ElasticWarehouseTaskScan$1.run(ElasticWarehouseTaskScan.java:111)

Solution 1:
Please check read access to provided location

Solution 2:
edit <jre location>/lib/security/java.policy to allow web application access a folder outside its deployment directory by adding line:

permission java.io.FilePermission "/home/user/myfiles/-", "read";

Here /- means any files or sub-folders inside this folder.You may also consider enabling everything when investigating above issue:

grant {
permission java.security.AllPermission;
}

Deploy ElasticWarehouse instance to play as a master in your cluster

Sometimes the easiest way is to add ElasticWarehouse node to your ElasticSearch cluster instead of using plugin. Such node (configured as node.master=true, node.data=false – elasticsearch.yml) won’t store any data, it will be part of your ElasticSearch cluster and it will play a role of ElasticWarehouse API node in your ElasticSearch cluster.


  • -

Configuration

ElasticWarehouse is distributed with default configuration optimized for most common configurations. Building ElasticSearch cluster can be very complex project, so we refer you to https://www.elastic.co/ website for more information about it. Here we focus on basic cluster configuration only.

Main configuration

ElasticWarehouse configuration files are in config folder

ls -l /opt/elasticwarehouse/config/

elasticsearch.yml
elasticwarehouse.yml

elasticsearch.yml is an ElasticSearch configuration file. Configuration file is used when ElasticWarehouse starts in embedded mode (default mode). In this mode ElasticWarehouse creates data ElasticSearch node and tries connect to existing cluster (defined in cluster.name) using multicast discovery.
To change Node configuration you can edit elasticsearch.yml and restart the Node. More information about configuration file you may find here.

elasticwarehouse.yml is main ElasticWarehouse configuration file. See table below for more details:

Group Key Type Default value Description
Mode definition mode.embedded boolean true Defines ElasticWarehouse instance work mode (one of: embedded or remote).
Remote mode specific elasticsearch.cluster string elasticwarehouse Defines cluster name to connect when instance works in remote mode (when mode.embedded is false)
elasticsearch.hosts string n/a host1,host2:port
Embedded mode specific grafana.port int 10500 Defines port Grafana to be listen on it. In case of binding exception ElasticWarehouse will try to use next available port, i.e. 10501, 10502, 10503 … etc
ElasticSearch index definitions elasticsearch.template.storage.name string elasticwarehousestorage Should be the same as elasticsearch.index.storage.name
elasticsearch.template.tasks.name string elasticwarehousetasks Should be the same as elasticsearch.index.tasks.name
elasticsearch.index.storage.name string elasticwarehousestorage Index name to store files
elasticsearch.index.storage.type string files Inside index we need to define type to store files. You can manually access files via ElasticSearch REST API, like: http://<host>:<port>/index/type/_search
elasticsearch.index.storage.childtype string childfiles Each file uploaded to the ElasticWarehouse cluster is parsed to get as much as possible information about it (i.e. for images it will be exif data, for PDF files it will be text file content). Some files like PDF or WORD may contain embedded files (like images, attachments or OLE objects). ElasticWarehouse extracts all such embedded files and store them in separate child type (one file stored in "type" may have many references to the "childfiles"). Thanks to that ElasticWarehouse is able to search in more advance way.
elasticsearch.index.tasks.name string elasticwarehousetasks Each operation like folder creation, files scan or upload etc is asynchronous and logged as task. Attribute defines index name to keep all tasks history (see _ewtask rest point for more details)
elasticsearch.index.tasks.type string tasks We store data inside type not inside index. You can manually access tasks via ElasticSearch REST API, like: http://<host>:<port>/index/type/_search
Global settings elasticwarehouse.api.port int 10200 Defines port API listen on it. In case of binding exception ElasticWarehouse will try to use next available port, i.e. 10201, 10202, 10203 … etc
log.level string DEBUG Log level. To limit log file size use INFO, WARN or ERROR
path.tmp string /tmp Temp folder location
exclude.files string avi mp4 mkv List of file extensions to be excluded and rejected by the cluster
thumb.size int 360 ElasticWarehouse generates thumb for any image uploaded to the cluster. Available sizes: 90, 180, 360, 720
tasks.max.number int 2 Maximum number of asynchronous tasks to be executed(i.e. asynchronous task is scan – see _ewtask for more details)
rrd.db.path string data folder ElasticWarehouse logs performance counters for monitoring purposes. As default EW creates all RRD databases in the same folder where ElasticSearch create Lucene indices
rrd.hostname string localhost name Set attribute explicitly when you run few ElasticWarehouse instances (nodes) on the same machine. If not set, then hostname will be used.
rrd.enabled boolean true Set to False to disable performance counters collector.
store.content boolean true When store.content=true then ElasticWarehouse behaves as data cloud (it stores extracted file meta information and file content inside the index). When store.content=false then ElasticWarehouse behaves like data indexer only – it doesn't sore binary file content, but only path to the orginal file. When you set to "false" you must configure store.folder
store.folder string /opt/upload When you upload file via _ewuplaod to the ElasticWarehouse and store.content=false, then file content will be saved to this folder.
store.movescanned boolean false When you use "scan" task to import files to the ElasticWarehouse cluster, you can choose whether to make a copy of original file or not. File copy is beeing copied to the location defined in store.folder.

Configuration file is loaded when ElasticWarehouse starts, so after each configuration change you must restart your ElasticWarehouse instance.

Note that some configuration changes like thumb.size, store.content, store.folder, store.movescanned, rrd.db.path etc. may require additional, manual maintenance work, so change them wisely.

For cases when you change thumb.size, we prepared dedicated task “rethumb”. This task recreates all thumbnails according to currently loaded settings.

Logging configuration

Logs are stored in logs folder as default (i.e.: c:\opt\elasticwarehouse\logs or /opt/elasticwarehouse/logs ). Logs folder and logs format can be changed by changing log4j.properties file stored in working folder for ElasticWarehouse process, i.e.: c:\opt\elasticwarehouse\bin\log4j.properties or /opt/elasticwarehouse/bin/log4j.properties .