Class OAIHarvesterBase
- java.lang.Object
- 
- de.pangaea.metadataportal.harvester.Harvester
- 
- de.pangaea.metadataportal.harvester.OAIHarvesterBase
 
 
- 
- Direct Known Subclasses:
- OAIHarvester,- OAIStaticRepositoryHarvester
 
 public abstract class OAIHarvesterBase extends Harvester Abstract base class for OAI harvesting support in panFMP. Use one of the subclasses for harvesting OAI-PMH or OAI Static Repositories.This harvester supports the following additional harvester properties: - setSpec: OAI set to harvest (default: none)
- retryCount: how often retry on HTTP errors? (default: 5)
- retryAfterSeconds: time between retries in seconds (default: 60)
- timeoutAfterSeconds: HTTP Timeout for harvesting in seconds
- authorizationHeader: Optional 'Authorization' HTTP header contents to be sent with request.
- metadataPrefix: OAI metadata prefix to harvest
- identifierPrefix: prepend all identifiers returned by OAI with this string
- ignoreDatestamps: does full harvesting, while ignoring all datestamps. They are saved, but ignored, if invalid.
- deleteMissingDocuments: remove documents after harvesting that were deleted from source (maybe a heavy operation). The harvester only does this on full (not on incremental harvesting). (default: true)
 - Author:
- Uwe Schindler
 
- 
- 
Field SummaryFields Modifier and Type Field Description protected StringauthorizationHeaderthe authorizationHeader from configurationstatic intDEFAULT_RETRY_COUNTstatic intDEFAULT_RETRY_TIMEstatic intDEFAULT_TIMEOUTprotected booleandeleteMissingDocumentsIf enabled, on any kind of full harvesting it will track all valid identifiers and delete all of them not seen in index.protected booleanfilterIncomingSetsThe harvester should filter incoming documents according to its set metadata.protected HttpClienthttpClientHttpClient to use, configured with correct connect timeout.protected StringidentifierPrefixprepend all identifiers returned by OAI with this stringprotected booleanignoreDatestampsIf enabled, does full harvesting, while ignoring all datestamps (default isfalse).protected StringmetadataPrefixthe used metadata prefix from the configurationstatic StringOAI_NSstatic StringOAI_STATICREPOSITORY_NSprotected intretryCountthe retryCount from configurationprotected intretryTimethe retryTime from configurationprotected Set<String>setsthe sets to harvest from the configuration,nullto harvest allprotected Durationtimeoutthe timeout from configurationstatic StringUSER_AGENT- 
Fields inherited from class de.pangaea.metadataportal.harvester.HarvesterfromDateReference, harvestCount, HARVESTER_METADATA_FIELD_LAST_HARVESTED, harvestMessageStep, iconfig, log, processor
 
- 
 - 
Constructor SummaryConstructors Constructor Description OAIHarvesterBase(HarvesterConfig iconfig)
 - 
Method SummaryAll Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description voidaddDocument(MetadataDocument mdoc)Adds a document to theHarvester.processorworking in the background.protected voidcancelMissingDocumentDelete()Disable the property "deleteMissingDocuments" for this instance.voidclose(boolean cleanShutdown)Closes harvester.MetadataDocumentcreateMetadataDocumentInstance()Creates an instance of MetadataDocument and initializes it with the harvester config.protected booleandoParse(Supplier<ExtendedDigester> digSupplier, String url, AtomicReference<Instant> checkModifiedDate)Harvests a URL using the suplied digester.protected voidenableMissingDocumentDelete()Enable unseen document deletes.protected voidenumerateValidHarvesterPropertyNames(Set<String> props)This method is used by subclasses to enumerate all available harvester properties that are implemented by them.protected EntityResolvergetEntityResolver(EntityResolver parent)Returns anEntityResolverthat resolves all HTTP-URLS usinggetInputSource(java.net.URI, java.util.concurrent.atomic.AtomicReference<java.time.Instant>).protected InputSourcegetInputSource(URI url, AtomicReference<Instant> checkModifiedDate)Returns a SAXInputSourcefor retrieving stream data of an URL.protected org.apache.commons.digester.ObjectCreationFactorygetMetadataDocumentFactory()Returns a factory for creating theMetadataDocuments in Digester code (usingFactoryCreateRule).voidopen(ElasticsearchConnection es, String targetIndex)Opens harvester for harvesting documents described by the givenHarvesterConfig.protected abstract voidrecreateDigester()Recreates all digesters that are used by parsing the OAI XML.protected voidreset()Resets the internal variables.- 
Methods inherited from class de.pangaea.metadataportal.harvester.HarvesterdeleteDocument, finishReindex, getValidHarvesterPropertyNames, harvest, isAllIndexes, isClosed, isDocumentOutdated, main, prepareReindex, runHarvester, runHarvester, setHarvestingDateReference, setValidIdentifiers
 
- 
 
- 
- 
- 
Field Detail- 
OAI_NSpublic static final String OAI_NS - See Also:
- Constant Field Values
 
 - 
OAI_STATICREPOSITORY_NSpublic static final String OAI_STATICREPOSITORY_NS - See Also:
- Constant Field Values
 
 - 
DEFAULT_RETRY_TIMEpublic static final int DEFAULT_RETRY_TIME - See Also:
- Constant Field Values
 
 - 
DEFAULT_RETRY_COUNTpublic static final int DEFAULT_RETRY_COUNT - See Also:
- Constant Field Values
 
 - 
DEFAULT_TIMEOUTpublic static final int DEFAULT_TIMEOUT - See Also:
- Constant Field Values
 
 - 
USER_AGENTpublic static final String USER_AGENT 
 - 
metadataPrefixprotected final String metadataPrefix the used metadata prefix from the configuration
 - 
identifierPrefixprotected final String identifierPrefix prepend all identifiers returned by OAI with this string
 - 
setsprotected final Set<String> sets the sets to harvest from the configuration,nullto harvest all
 - 
retryCountprotected final int retryCount the retryCount from configuration
 - 
retryTimeprotected final int retryTime the retryTime from configuration
 - 
timeoutprotected final Duration timeout the timeout from configuration
 - 
authorizationHeaderprotected final String authorizationHeader the authorizationHeader from configuration
 - 
ignoreDatestampsprotected final boolean ignoreDatestamps If enabled, does full harvesting, while ignoring all datestamps (default isfalse). They are saved, but ignored, if invalid.
 - 
deleteMissingDocumentsprotected final boolean deleteMissingDocuments If enabled, on any kind of full harvesting it will track all valid identifiers and delete all of them not seen in index.
 - 
httpClientprotected final HttpClient httpClient HttpClient to use, configured with correct connect timeout.
 - 
filterIncomingSetsprotected boolean filterIncomingSets The harvester should filter incoming documents according to its set metadata. Should be disabled for OAI-PMH protocol with only one set. Default istrue.
 
- 
 - 
Constructor Detail- 
OAIHarvesterBasepublic OAIHarvesterBase(HarvesterConfig iconfig) 
 
- 
 - 
Method Detail- 
openpublic void open(ElasticsearchConnection es, String targetIndex) throws Exception Description copied from class:HarvesterOpens harvester for harvesting documents described by the givenHarvesterConfig. OpensHarvester.processorfor usage inHarvester.harvest()method.
 - 
addDocumentpublic void addDocument(MetadataDocument mdoc) throws Exception Description copied from class:HarvesterAdds a document to theHarvester.processorworking in the background.- Overrides:
- addDocumentin class- Harvester
- Throws:
- BackgroundFailure- if an error occurred in background thread. Exceptions can be thrown asynchronous and may not affect the currect document. The real exception is thrown again in- Harvester.close(boolean).
- Exception
 
 - 
createMetadataDocumentInstancepublic MetadataDocument createMetadataDocumentInstance() Description copied from class:HarvesterCreates an instance of MetadataDocument and initializes it with the harvester config. This method should be overwritten, if a harvester uses another class.- Overrides:
- createMetadataDocumentInstancein class- Harvester
 
 - 
getMetadataDocumentFactoryprotected org.apache.commons.digester.ObjectCreationFactory getMetadataDocumentFactory() Returns a factory for creating theMetadataDocuments in Digester code (usingFactoryCreateRule).- See Also:
- createMetadataDocumentInstance()
 
 - 
recreateDigesterprotected abstract void recreateDigester() Recreates all digesters that are used by parsing the OAI XML. This method is called initiall once and later on network errors before parsing same document again. This allows to recover from document parsing failing somewhere in the middle of a document.
 - 
doParseprotected boolean doParse(Supplier<ExtendedDigester> digSupplier, String url, AtomicReference<Instant> checkModifiedDate) throws Exception Harvests a URL using the suplied digester.- Parameters:
- digSupplier- a- Supplierthat gives access to a (possibly recreated) digester instance.
- url- the URL is parsed by this digester instance.
- checkModifiedDate- for static repositories, it is possible to give a reference to a- Instantfor checking the last modification, in this case- falseis returned, if the URL was not modified. If it was modified, the reference contains a new- Dateobject with the new modification date. Supply- nullfor no checking of last modification, a last modification date is then not returned back (as there is no reference).
- Returns:
- trueif harvested,- falseif not modified and no harvesting was done.
- Throws:
- Exception
 
 - 
getEntityResolverprotected EntityResolver getEntityResolver(EntityResolver parent) Returns anEntityResolverthat resolves all HTTP-URLS usinggetInputSource(java.net.URI, java.util.concurrent.atomic.AtomicReference<java.time.Instant>).- Parameters:
- parent- an- EntityResolverthat receives all unprocessed requests
- See Also:
- getInputSource(java.net.URI, java.util.concurrent.atomic.AtomicReference<java.time.Instant>)
 
 - 
getInputSourceprotected InputSource getInputSource(URI url, AtomicReference<Instant> checkModifiedDate) throws IOException Returns a SAXInputSourcefor retrieving stream data of an URL. It is optimized for compression of the HTTP(S) protocol and timeout checking.- Parameters:
- url- the URL to open
- checkModifiedDate- for static repositories, it is possible to give a reference to a- Instantfor checking the last modification, in this case- nullis returned, if the URL was not modified. If it was modified, the reference contains a new- Dateobject with the new modification date. Supply- nullfor no checking of last modification, a last modification date is then not returned back (as there is no reference).
- Throws:
- IOException
- See Also:
- getEntityResolver(org.xml.sax.EntityResolver)
 
 - 
resetprotected void reset() Resets the internal variables.
 - 
enableMissingDocumentDeleteprotected void enableMissingDocumentDelete() Enable unseen document deletes. This should be enabled by harvester before callingaddDocument(MetadataDocument), so tracking can be enabled.
 - 
cancelMissingDocumentDeleteprotected void cancelMissingDocumentDelete() Disable the property "deleteMissingDocuments" for this instance. This can be used, when the container (like a ZIP file was not modified), and all containing documents are not enumerated. To prevent deletion of all these documents call this.
 - 
closepublic void close(boolean cleanShutdown) throws ExceptionDescription copied from class:HarvesterCloses harvester. All resources are freed and theHarvester.processoris closed.- Overrides:
- closein class- Harvester
- Parameters:
- cleanShutdown- enables writing of status information to the Elasticsearch instance for the next harvesting. If an error occurred during harvesting this should not be done.
- Throws:
- Exception- if an exception occurs during closing (various types of exceptions can be thrown). Exceptions can be thrown asynchronous and may not affect the correct document.
 
 - 
enumerateValidHarvesterPropertyNamesprotected void enumerateValidHarvesterPropertyNames(Set<String> props) Description copied from class:HarvesterThis method is used by subclasses to enumerate all available harvester properties that are implemented by them. Overwrite this method in your own implementation and append all harvester names to the suppliedSet. The public API for client code requesting property names isHarvester.getValidHarvesterPropertyNames().- Overrides:
- enumerateValidHarvesterPropertyNamesin class- Harvester
- See Also:
- Harvester.getValidHarvesterPropertyNames()
 
 
- 
 
-