Documentation: HowTo
This is a quick guide to build a metadata portal:
Prerequisites
To install panFMP, the following is required:
- Java 1.5 or newer
- Internet connection
- Enough disk space to hold search indexes
For a search interface (not included, must be written using the supplied API) you need either:
- Java 1.5 or newer and a web application container (e.g. Jetty, Tomcat, Sun Application/Web Server) [see notes below because classpath configuration may be difficult] for your servlets/JSPs.
- Scripting language with web services support (e.g., PHP >=5.2, older versions have bugs preventing correct use of web service). This is needed if you do not plan to use the native Java API and want to write your user interface in a scripting language. To use this, you must install the supplied AXIS web application in an Web Container (see above). To simplify this, panFMP contains a bundled and configured Jetty web server, configured to listen on localhost, port 8801.
Installation
Extract the binary package. The installation directory consists of the following subdirs:
- libs: contains all needed JAR files for building the classpath.
- repository: this directory stores the harvested metadata and contains an example config file.
- scripts: ready-to-use example shell scripts to start harvesting, running the Jetty web server, and maintaining the indexes.
- axis-webapp: This web application is configured for AXIS webservices. Please note, that panFMP does not work by plugging in the JAR files into a webapp. The JAR files must be in the classpath of the servlet container itsself [see below]! This web application is running on localhost, port 8801, when the bundled Jetty web server is started.
- javadocs: Documentation of the Java classes.
- examples: Real-life examples as basis for your own portal developments (written in PHP and Java)
Configuration
Creating a configuration file is the essential step to harvest the first metadata records. The subdirectory "repository" contains an example config file "config.xml". The example is configured for the DIF metadata format and harvests 3 data providers (PANGAEA, IFREMER, COPEPOD).
To create a new file, it is recommended to use the existing config.xml as basis. It is highly recommended to have experience and detailed knowledge of standard XML features like namespaces, namespace prefixes, QNames, and XPath! panFMP makes heavy use of namespaces in its config file to address features from XSLT language and parts of the harvested metadata (which normally also have their own namespace). It is also highly recommended to be familiar with XSLT, you may recognize known components inside the configuration.
First in the configuration file is the definition of the used metadata format inside <cfg:metadata/>. An XML schema inside <cfg:schema/> may be assigned, that is used to validate the metadata records coming from the harvester before analyzing them. The best is to declare namespace prefixes needed for accessing parts of the metadata schema directly inside the <cfg:metadata/> element (like in the example for DIF), just like it is done in XSLT.
Index fields (like columns in a relational database) are listed inside <cfg:fields/> and consist in the simple case of an XPath expression returning the field contents (inside <cfg:field/>) from the harvested metadata. Some example fields are declared in the config file. If you need to write a more complicated "algorithm" to assign values to fields, you may use <cfg:field-template> which is an extension to the simple XPath notation. Inside each <cfg:field-template/> may be a standard template from XSLT (see documentation of XSLT). For this to work, do not forget to declare the XSL namespace and bind it to a prefix e.g. "xsl:"! The results after executing the template are written into the index field.
For the beginning, just ignore the <cfg:filters/> and <cfg:variables/>, they are advanced features. Remove them from the sample file. Their meaning will be clear, when reading through the default config file and you are familar with all other concepts.
Each data provider is declared as a separate index inside <cfg:indexes/>. Beyond the file system directory path to the data storage location of the index (important: the index location must be different for each index / data provider!), each data provider needs a harvester class that is used to harvest, and the corresponding harvester properties (that list URLs, sets,... and other properties needed for harvesting). These properties are specific to each harvester class and are listed in the Java API documentation. A list of supported harvesters and their abilities are listed in the Java API documentation and a howto about choosing the right harvester is given here.
Harvesting
If you have developed your configuration file you can start harvesting. Go to the "scripts" directory and start the shell script "harvest.sh" / "harvest.cmd" from there. During the harvesting process log messages are written to console or a log file depending on your configuration in the corresponding LOG4J properties file (also in scripts directory).
After harvesting you will see indexes created in the directories given in the configuration file (below "repository"). You may look into them using the GUI software "Luke" (not bundled with panFMP).
In future panFMP will bring more debugging features for the XPath/XSLT support in the field definitions by providing a "fake" index builder, that only prints out fields and properties of harvested documents and does not index them. If you want to get more information during harvesting (good for finding errors in your XPath expressions), change the log level in the LOG4J properties file to "debug".
Developing Search Interfaces
There are two possible ways to implement a search interface (please note: panFMP is a programming library with some command line tools on the harvesting side, but there are no search interfaces available. All GUIs must be implemented by the metadata portal developer):
- Native Java API: This is the most flexible way. It is described almost complete in the Java API documentation.
- Web Service API: A simplified version of the native API is provided to build search interfaces that are not based on Java (e.g. PHP). The web frontend uses a web services client API (e.g. the SOAP extension of PHP 5.2) to access panFMP. For this to work, the AXIS interface of panFMP must be installed in a servlet container (see next section). To simplify this, panFMP contains a bundled and configured Jetty web server, configured to listen on localhost, port 8801. This search API is not fully documented until now, but the needed functions can be seen in the WSDL file (can be retrieved as described in AXIS documentation).
In the "examples" directory are two implementations using both APIs.
Integration of panFMP search interfaces in other infrastructures (Tomcat, PHP,...)
panFMP is a big software package consisting of a lot of third party components (mostly from the Apache Foundation). To work correct, panFMP needs to have the minimum requirements on these external libraries in its Java classpath. All needed external requirements are bundled in the binary package (in the "libs" directory).
Important packages are: Apache Lucene (v2.9.1 or above), Jakarta Commons Digester (v2.0 or above), XERCES (must support XInclude, tested is v2.9.0 or above), XALAN (must support JAXP XPath 1.0, tested is v2.7.0 or above). There are more packages in the distribution, mostly from Jakarta Commons.
Because these packages are rather new (and these new versions are really needed), it is important that you replace packages coming together with your servlet container (e.g., Tomcat, Sun Java System Web/Application Server, Jetty). The latter is less problematic (as shown with the bundled Jetty instance); the first two ones are really a pain. They often ship with very old versions of XERCES, XALAN and Digester. If you just put the panFMP JAR files into the webapp directory, they are ignored (because the servlet container has his own ones already loaded before panFMP's webapp is started).
To get panFMP working with them there are two possibilities:
- Replace all JAR files found in your servlet container distribution with the newer ones. But this is often not possible, because the classes to replace are mixed together and embedded in one "big" JAR file containing all packages together with the servlet container.
- Put the panFMP classpath in front of the one of the servlet container. This is possible by changing the configuration and startup scripts of your servlet container. After that, your servlet container will also use the newer versions, but this will normally work without problems.
After that you can deploy your web application without any panFMP JAR files (or only those you not added to the global classpath). LOG4J files are not needed for a servlet container as logging mechanisms are better featured there. Configure the logging (e.g. the log level) in your servlet container setup.
If you want to use panFMP with scripting languages, that have webservice support, refer to the previous section that describes how to install the web service API (that additionally needs Apache AXIS v1.4, which is also bundled) using the bundled Jetty web server (this is much less pain).