Class OAIHarvesterBase
- java.lang.Object
-
- de.pangaea.metadataportal.harvester.Harvester
-
- de.pangaea.metadataportal.harvester.OAIHarvesterBase
-
- Direct Known Subclasses:
OAIHarvester
,OAIStaticRepositoryHarvester
public abstract class OAIHarvesterBase extends Harvester
Abstract base class for OAI harvesting support in panFMP. Use one of the subclasses for harvesting OAI-PMH or OAI Static Repositories.This harvester supports the following additional harvester properties:
setSpec
: OAI set to harvest (default: none)retryCount
: how often retry on HTTP errors? (default: 5)retryAfterSeconds
: time between retries in seconds (default: 60)timeoutAfterSeconds
: HTTP Timeout for harvesting in secondsauthorizationHeader
: Optional 'Authorization' HTTP header contents to be sent with request.metadataPrefix
: OAI metadata prefix to harvestidentifierPrefix
: prepend all identifiers returned by OAI with this stringignoreDatestamps
: does full harvesting, while ignoring all datestamps. They are saved, but ignored, if invalid.deleteMissingDocuments
: remove documents after harvesting that were deleted from source (maybe a heavy operation). The harvester only does this on full (not on incremental harvesting). (default: true)
- Author:
- Uwe Schindler
-
-
Field Summary
Fields Modifier and Type Field Description protected String
authorizationHeader
the authorizationHeader from configurationstatic int
DEFAULT_RETRY_COUNT
static int
DEFAULT_RETRY_TIME
static int
DEFAULT_TIMEOUT
protected boolean
deleteMissingDocuments
If enabled, on any kind of full harvesting it will track all valid identifiers and delete all of them not seen in index.protected boolean
filterIncomingSets
The harvester should filter incoming documents according to its set metadata.protected HttpClient
httpClient
HttpClient to use, configured with correct connect timeout.protected String
identifierPrefix
prepend all identifiers returned by OAI with this stringprotected boolean
ignoreDatestamps
If enabled, does full harvesting, while ignoring all datestamps (default isfalse
).protected String
metadataPrefix
the used metadata prefix from the configurationstatic String
OAI_NS
static String
OAI_STATICREPOSITORY_NS
protected int
retryCount
the retryCount from configurationprotected int
retryTime
the retryTime from configurationprotected Set<String>
sets
the sets to harvest from the configuration,null
to harvest allprotected Duration
timeout
the timeout from configurationstatic String
USER_AGENT
-
Fields inherited from class de.pangaea.metadataportal.harvester.Harvester
fromDateReference, harvestCount, HARVESTER_METADATA_FIELD_LAST_HARVESTED, harvestMessageStep, iconfig, log, processor
-
-
Constructor Summary
Constructors Constructor Description OAIHarvesterBase(HarvesterConfig iconfig)
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description void
addDocument(MetadataDocument mdoc)
Adds a document to theHarvester.processor
working in the background.protected void
cancelMissingDocumentDelete()
Disable the property "deleteMissingDocuments" for this instance.void
close(boolean cleanShutdown)
Closes harvester.MetadataDocument
createMetadataDocumentInstance()
Creates an instance of MetadataDocument and initializes it with the harvester config.protected boolean
doParse(Supplier<ExtendedDigester> digSupplier, String url, AtomicReference<Instant> checkModifiedDate)
Harvests a URL using the suplied digester.protected void
enableMissingDocumentDelete()
Enable unseen document deletes.protected void
enumerateValidHarvesterPropertyNames(Set<String> props)
This method is used by subclasses to enumerate all available harvester properties that are implemented by them.protected EntityResolver
getEntityResolver(EntityResolver parent)
Returns anEntityResolver
that resolves all HTTP-URLS usinggetInputSource(java.net.URI, java.util.concurrent.atomic.AtomicReference<java.time.Instant>)
.protected InputSource
getInputSource(URI url, AtomicReference<Instant> checkModifiedDate)
Returns a SAXInputSource
for retrieving stream data of an URL.protected org.apache.commons.digester.ObjectCreationFactory
getMetadataDocumentFactory()
Returns a factory for creating theMetadataDocument
s in Digester code (usingFactoryCreateRule
).void
open(ElasticsearchConnection es, String targetIndex)
Opens harvester for harvesting documents described by the givenHarvesterConfig
.protected abstract void
recreateDigester()
Recreates all digesters that are used by parsing the OAI XML.protected void
reset()
Resets the internal variables.-
Methods inherited from class de.pangaea.metadataportal.harvester.Harvester
deleteDocument, finishReindex, getValidHarvesterPropertyNames, harvest, isAllIndexes, isClosed, isDocumentOutdated, main, prepareReindex, runHarvester, runHarvester, setHarvestingDateReference, setValidIdentifiers
-
-
-
-
Field Detail
-
OAI_NS
public static final String OAI_NS
- See Also:
- Constant Field Values
-
OAI_STATICREPOSITORY_NS
public static final String OAI_STATICREPOSITORY_NS
- See Also:
- Constant Field Values
-
DEFAULT_RETRY_TIME
public static final int DEFAULT_RETRY_TIME
- See Also:
- Constant Field Values
-
DEFAULT_RETRY_COUNT
public static final int DEFAULT_RETRY_COUNT
- See Also:
- Constant Field Values
-
DEFAULT_TIMEOUT
public static final int DEFAULT_TIMEOUT
- See Also:
- Constant Field Values
-
USER_AGENT
public static final String USER_AGENT
-
metadataPrefix
protected final String metadataPrefix
the used metadata prefix from the configuration
-
identifierPrefix
protected final String identifierPrefix
prepend all identifiers returned by OAI with this string
-
sets
protected final Set<String> sets
the sets to harvest from the configuration,null
to harvest all
-
retryCount
protected final int retryCount
the retryCount from configuration
-
retryTime
protected final int retryTime
the retryTime from configuration
-
timeout
protected final Duration timeout
the timeout from configuration
-
authorizationHeader
protected final String authorizationHeader
the authorizationHeader from configuration
-
ignoreDatestamps
protected final boolean ignoreDatestamps
If enabled, does full harvesting, while ignoring all datestamps (default isfalse
). They are saved, but ignored, if invalid.
-
deleteMissingDocuments
protected final boolean deleteMissingDocuments
If enabled, on any kind of full harvesting it will track all valid identifiers and delete all of them not seen in index.
-
httpClient
protected final HttpClient httpClient
HttpClient to use, configured with correct connect timeout.
-
filterIncomingSets
protected boolean filterIncomingSets
The harvester should filter incoming documents according to its set metadata. Should be disabled for OAI-PMH protocol with only one set. Default istrue
.
-
-
Constructor Detail
-
OAIHarvesterBase
public OAIHarvesterBase(HarvesterConfig iconfig)
-
-
Method Detail
-
open
public void open(ElasticsearchConnection es, String targetIndex) throws Exception
Description copied from class:Harvester
Opens harvester for harvesting documents described by the givenHarvesterConfig
. OpensHarvester.processor
for usage inHarvester.harvest()
method.
-
addDocument
public void addDocument(MetadataDocument mdoc) throws Exception
Description copied from class:Harvester
Adds a document to theHarvester.processor
working in the background.- Overrides:
addDocument
in classHarvester
- Throws:
BackgroundFailure
- if an error occurred in background thread. Exceptions can be thrown asynchronous and may not affect the currect document. The real exception is thrown again inHarvester.close(boolean)
.Exception
-
createMetadataDocumentInstance
public MetadataDocument createMetadataDocumentInstance()
Description copied from class:Harvester
Creates an instance of MetadataDocument and initializes it with the harvester config. This method should be overwritten, if a harvester uses another class.- Overrides:
createMetadataDocumentInstance
in classHarvester
-
getMetadataDocumentFactory
protected org.apache.commons.digester.ObjectCreationFactory getMetadataDocumentFactory()
Returns a factory for creating theMetadataDocument
s in Digester code (usingFactoryCreateRule
).- See Also:
createMetadataDocumentInstance()
-
recreateDigester
protected abstract void recreateDigester()
Recreates all digesters that are used by parsing the OAI XML. This method is called initiall once and later on network errors before parsing same document again. This allows to recover from document parsing failing somewhere in the middle of a document.
-
doParse
protected boolean doParse(Supplier<ExtendedDigester> digSupplier, String url, AtomicReference<Instant> checkModifiedDate) throws Exception
Harvests a URL using the suplied digester.- Parameters:
digSupplier
- aSupplier
that gives access to a (possibly recreated) digester instance.url
- the URL is parsed by this digester instance.checkModifiedDate
- for static repositories, it is possible to give a reference to aInstant
for checking the last modification, in this casefalse
is returned, if the URL was not modified. If it was modified, the reference contains a newDate
object with the new modification date. Supplynull
for no checking of last modification, a last modification date is then not returned back (as there is no reference).- Returns:
true
if harvested,false
if not modified and no harvesting was done.- Throws:
Exception
-
getEntityResolver
protected EntityResolver getEntityResolver(EntityResolver parent)
Returns anEntityResolver
that resolves all HTTP-URLS usinggetInputSource(java.net.URI, java.util.concurrent.atomic.AtomicReference<java.time.Instant>)
.- Parameters:
parent
- anEntityResolver
that receives all unprocessed requests- See Also:
getInputSource(java.net.URI, java.util.concurrent.atomic.AtomicReference<java.time.Instant>)
-
getInputSource
protected InputSource getInputSource(URI url, AtomicReference<Instant> checkModifiedDate) throws IOException
Returns a SAXInputSource
for retrieving stream data of an URL. It is optimized for compression of the HTTP(S) protocol and timeout checking.- Parameters:
url
- the URL to opencheckModifiedDate
- for static repositories, it is possible to give a reference to aInstant
for checking the last modification, in this casenull
is returned, if the URL was not modified. If it was modified, the reference contains a newDate
object with the new modification date. Supplynull
for no checking of last modification, a last modification date is then not returned back (as there is no reference).- Throws:
IOException
- See Also:
getEntityResolver(org.xml.sax.EntityResolver)
-
reset
protected void reset()
Resets the internal variables.
-
enableMissingDocumentDelete
protected void enableMissingDocumentDelete()
Enable unseen document deletes. This should be enabled by harvester before callingaddDocument(MetadataDocument)
, so tracking can be enabled.
-
cancelMissingDocumentDelete
protected void cancelMissingDocumentDelete()
Disable the property "deleteMissingDocuments" for this instance. This can be used, when the container (like a ZIP file was not modified), and all containing documents are not enumerated. To prevent deletion of all these documents call this.
-
close
public void close(boolean cleanShutdown) throws Exception
Description copied from class:Harvester
Closes harvester. All resources are freed and theHarvester.processor
is closed.- Overrides:
close
in classHarvester
- Parameters:
cleanShutdown
- enables writing of status information to the Elasticsearch instance for the next harvesting. If an error occurred during harvesting this should not be done.- Throws:
Exception
- if an exception occurs during closing (various types of exceptions can be thrown). Exceptions can be thrown asynchronous and may not affect the correct document.
-
enumerateValidHarvesterPropertyNames
protected void enumerateValidHarvesterPropertyNames(Set<String> props)
Description copied from class:Harvester
This method is used by subclasses to enumerate all available harvester properties that are implemented by them. Overwrite this method in your own implementation and append all harvester names to the suppliedSet
. The public API for client code requesting property names isHarvester.getValidHarvesterPropertyNames()
.- Overrides:
enumerateValidHarvesterPropertyNames
in classHarvester
- See Also:
Harvester.getValidHarvesterPropertyNames()
-
-