Class Harvester
- java.lang.Object
-
- de.pangaea.metadataportal.harvester.Harvester
-
- Direct Known Subclasses:
NoOpHarvester
,OAIHarvesterBase
,Rebuilder
,SingleFileEntitiesHarvester
public abstract class Harvester extends Object
Harvester interface to panFMP. This class is the abstract superclass of all harvesters. It also supplies an entry point for the command line interface.All panFMP harvesters support the following harvester properties:
harvestMessageStep
: After how many documents should a status message be printed out by the methodaddDocument(de.pangaea.metadataportal.processor.MetadataDocument)
? (default: 100)numThreads
: how many threads should process documents (XPath queries and XSL templates)? (default: 1) Raise this value, if the indexer waits to often for more documents and you have more than one processor. The optimal value is one lower than the number of processors. If you have very simple metadata documents (simple XML schmema) and few fields, lower values may be enough. The optimal value could only be found by testing.maxQueue
: size of queue for threads. (default 100 metadata documents)bulkSize
: size of bulk requests sent to Elasticsearch. (default 100 metadata documents)concurrentBulkRequests
: how many bulk requests can be sent in parallel to Elasticsearch. (default 1)maxBulkMemory
: maximum size of CBOR/JSON source for a bulk request. After a bulk gets larger than this, it will be submitted. Please note, that a bulk might get significantly larger, because the check is done after the document is added. Must be given using a unit like MB for megabytes. (default 5 MB)validate
: validate harvested documents against schema given in configuration? (default: true, if schema given)conversionErrorAction
: What to do if a conversion error occurs (e.g. number format error)? Can beSTOP
,IGNOREDOCUMENT
,DELETEDOCUMENT
(default is to stop conversion)
- Author:
- Uwe Schindler
-
-
Field Summary
Fields Modifier and Type Field Description protected Instant
fromDateReference
Date from which should be harvested (in time reference of the original server)protected int
harvestCount
Count of harvested documents.static String
HARVESTER_METADATA_FIELD_LAST_HARVESTED
protected int
harvestMessageStep
Step at whichaddDocument(de.pangaea.metadataportal.processor.MetadataDocument)
prints log messages.protected HarvesterConfig
iconfig
Harvester configurationprotected org.apache.commons.logging.Log
log
Logger instance (shared by all subclasses).protected DocumentProcessor
processor
Instance ofDocumentProcessor
that converts and updates the Elasticsearch instance in other threads.
-
Constructor Summary
Constructors Constructor Description Harvester(HarvesterConfig iconfig)
Default constructor.
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected void
addDocument(MetadataDocument mdoc)
Adds a document to theprocessor
working in the background.void
close(boolean cleanShutdown)
Closes harvester.MetadataDocument
createMetadataDocumentInstance()
Creates an instance of MetadataDocument and initializes it with the harvester config.protected void
deleteDocument(String identifier)
Queues the given ID for deletion.protected void
enumerateValidHarvesterPropertyNames(Set<String> props)
This method is used by subclasses to enumerate all available harvester properties that are implemented by them.void
finishReindex(boolean cleanShutdown)
Does cleanup work after rebuilding the index byRebuilder
.Set<String>
getValidHarvesterPropertyNames()
Return theSet
of harvester property names that this harvester supports.abstract void
harvest()
This method is called by the harvester afteropen(de.pangaea.metadataportal.processor.ElasticsearchConnection, java.lang.String)
'ing it.protected static boolean
isAllIndexes(String id)
boolean
isClosed()
Checks if harvester is closed.protected boolean
isDocumentOutdated(Instant lastModified)
Checks, if the supplied Datestamp needs harvesting.static void
main(String[] args)
External entry point to the harvester interface.void
open(ElasticsearchConnection es, String targetIndex)
Opens harvester for harvesting documents described by the givenHarvesterConfig
.void
prepareReindex(ElasticsearchConnection es, String targetIndex)
Prepares harvester for rebuilding the index byRebuilder
.static boolean
runHarvester(Config conf, String harvesterId)
Harvests one (harvesterId='name'
) or more (harvesterId='*'
) sources.protected static boolean
runHarvester(Config conf, String id, Class<? extends Harvester> harvesterClass)
Harvests one (harvesterId="name"
) or more (harvesterId="*"/"all"/null
) sources.protected void
setHarvestingDateReference(Instant harvestingDateReference)
Reference date of this harvesting event (in time reference of the original server).protected void
setValidIdentifiers(Set<String> validIdentifiers)
Set a set of all "seen" valid identifiers.
-
-
-
Field Detail
-
log
protected final org.apache.commons.logging.Log log
Logger instance (shared by all subclasses).
-
processor
protected DocumentProcessor processor
Instance ofDocumentProcessor
that converts and updates the Elasticsearch instance in other threads.
-
iconfig
protected final HarvesterConfig iconfig
Harvester configuration
-
harvestCount
protected int harvestCount
Count of harvested documents. Incremented byaddDocument(de.pangaea.metadataportal.processor.MetadataDocument)
.
-
harvestMessageStep
protected final int harvestMessageStep
Step at whichaddDocument(de.pangaea.metadataportal.processor.MetadataDocument)
prints log messages. Can be changed by the harvester propertyharvestMessageStep
.
-
fromDateReference
protected Instant fromDateReference
Date from which should be harvested (in time reference of the original server)
-
HARVESTER_METADATA_FIELD_LAST_HARVESTED
public static final String HARVESTER_METADATA_FIELD_LAST_HARVESTED
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
Harvester
public Harvester(HarvesterConfig iconfig)
Default constructor.
-
-
Method Detail
-
main
public static void main(String[] args)
External entry point to the harvester interface. Called from the Java command line with two parameters (config file, harvester name)
-
runHarvester
public static boolean runHarvester(Config conf, String harvesterId)
Harvests one (harvesterId='name'
) or more (harvesterId='*'
) sources. The harvester implementation is defined by the given configuration.
-
runHarvester
protected static boolean runHarvester(Config conf, String id, Class<? extends Harvester> harvesterClass)
Harvests one (harvesterId="name"
) or more (harvesterId="*"/"all"/null
) sources. The harvester implementation is defined by the given configuration or ifharvesterClass
is notnull
, the specified harvester will be used. This is used byRebuilder
. Public code should userunHarvester(Config,String)
.
-
isAllIndexes
protected static boolean isAllIndexes(String id)
-
open
public void open(ElasticsearchConnection es, String targetIndex) throws Exception
Opens harvester for harvesting documents described by the givenHarvesterConfig
. Opensprocessor
for usage inharvest()
method.- Throws:
Exception
- if an exception occurs during opening (various types of exceptions can be thrown).
-
prepareReindex
public void prepareReindex(ElasticsearchConnection es, String targetIndex) throws Exception
Prepares harvester for rebuilding the index byRebuilder
. By default this method does nothing, but can be overridden by subclasses that need to setup additional things.- Throws:
Exception
- if an exception occurs during opening (various types of exceptions can be thrown).
-
finishReindex
public void finishReindex(boolean cleanShutdown) throws Exception
Does cleanup work after rebuilding the index byRebuilder
. By default this method does nothing, but can be overridden by subclasses that need to shutdown additional things.- Throws:
Exception
- if an exception occurs during closing (various types of exceptions can be thrown). Exceptions can be thrown asynchronous and may not affect the correct document.
-
isClosed
public boolean isClosed()
Checks if harvester is closed.
-
close
public void close(boolean cleanShutdown) throws Exception
Closes harvester. All resources are freed and theprocessor
is closed.- Parameters:
cleanShutdown
- enables writing of status information to the Elasticsearch instance for the next harvesting. If an error occurred during harvesting this should not be done.- Throws:
Exception
- if an exception occurs during closing (various types of exceptions can be thrown). Exceptions can be thrown asynchronous and may not affect the correct document.
-
createMetadataDocumentInstance
public MetadataDocument createMetadataDocumentInstance()
Creates an instance of MetadataDocument and initializes it with the harvester config. This method should be overwritten, if a harvester uses another class.
-
addDocument
protected void addDocument(MetadataDocument mdoc) throws Exception
Adds a document to theprocessor
working in the background.- Throws:
BackgroundFailure
- if an error occurred in background thread. Exceptions can be thrown asynchronous and may not affect the currect document. The real exception is thrown again inclose(boolean)
.Exception
-
deleteDocument
protected void deleteDocument(String identifier) throws Exception
Queues the given ID for deletion. This delegates toaddDocument(de.pangaea.metadataportal.processor.MetadataDocument)
, with an empty document.- Throws:
BackgroundFailure
- if an error occurred in background thread. Exceptions can be thrown asynchronous and may not affect the currect document. The real exception is thrown again inclose(boolean)
.Exception
-
isDocumentOutdated
protected boolean isDocumentOutdated(Instant lastModified)
Checks, if the supplied Datestamp needs harvesting. This method can be used to find out, if a documents needs harvesting.
-
setHarvestingDateReference
protected void setHarvestingDateReference(Instant harvestingDateReference)
Reference date of this harvesting event (in time reference of the original server). This date is used on the next harvesting in variablefromDateReference
. As long as this is null, the harvester will not write or update the value in Elasticsearch.
-
setValidIdentifiers
protected void setValidIdentifiers(Set<String> validIdentifiers)
Set a set of all "seen" valid identifiers. Must be set, beforeclose(boolean)
is called, as the information is passed to the processor before finalizing the index.
-
enumerateValidHarvesterPropertyNames
protected void enumerateValidHarvesterPropertyNames(Set<String> props)
This method is used by subclasses to enumerate all available harvester properties that are implemented by them. Overwrite this method in your own implementation and append all harvester names to the suppliedSet
. The public API for client code requesting property names isgetValidHarvesterPropertyNames()
.- See Also:
getValidHarvesterPropertyNames()
-
getValidHarvesterPropertyNames
public final Set<String> getValidHarvesterPropertyNames()
Return theSet
of harvester property names that this harvester supports. This method is called onConfig
loading to check if all property names in the config file are correct. You cannot override this method in your own implementation, as this method is responsible for returning an unmodifieableSet
. For custom harvesters, append your property names inenumerateValidHarvesterPropertyNames(java.util.Set<java.lang.String>)
.
-
harvest
public abstract void harvest() throws Exception
This method is called by the harvester afteropen(de.pangaea.metadataportal.processor.ElasticsearchConnection, java.lang.String)
'ing it. Overwrite this method in your harvester class. This method should harvest files from somewhere, generateMetadataDocument
s and add them withaddDocument(de.pangaea.metadataportal.processor.MetadataDocument)
.- Throws:
Exception
- of any type.
-
-