Class WebCrawlingHarvester
- java.lang.Object
-
- de.pangaea.metadataportal.harvester.Harvester
-
- de.pangaea.metadataportal.harvester.SingleFileEntitiesHarvester
-
- de.pangaea.metadataportal.harvester.WebCrawlingHarvester
-
public class WebCrawlingHarvester extends SingleFileEntitiesHarvester
Harvester for traversing websites and harvesting XML documents. If thebaseURL
(from config) contains a XML file with the correct MIME type, it is directly harvested. A html webpage is analyzed and all links are followed and checked for XML files with correct MIME type. This is done recursively, but harvesting does not escape the server andbaseURL
directory.This harvester supports the following additional harvester properties:
baseUrl
: URL to start crawling (should point to a HTML page).retryCount
: how often retry on HTTP errors? (default: 5)retryAfterSeconds
: time between retries in seconds (default: 60)timeoutAfterSeconds
: HTTP Timeout for harvesting in secondsauthorizationHeader
: Optional 'Authorization' HTTP header contents to be sent with request.filenameFilter
: regex to match the filename. The regex is applied against the whole filename (this is like ^pattern$)! (default: none)contentTypes
: MIME types of documents to index (maybe additionally limited byfilenameFilter
). (default: "text/xml,application/xml")excludeUrlPattern
: A regex that is applied to all URLs appearing during harvesting process. URLs with matching patterns (partial matches allowed, use ^,$ for start/end matches) are excluded and not further traversed. (default: none)pauseBetweenRequests
: to not overload server that is harvested, wait XX milliseconds after each HTTP request (default: none)
- Author:
- Uwe Schindler
-
-
Field Summary
Fields Modifier and Type Field Description static int
DEFAULT_RETRY_COUNT
static int
DEFAULT_RETRY_TIME
static int
DEFAULT_TIMEOUT
static Set<String>
HTML_CONTENT_TYPES
static String
HTML_SAX_PARSER_CLASS
This is the parser class used to parse HTML documents to collect URLs for crawling.static String
USER_AGENT
-
Fields inherited from class de.pangaea.metadataportal.harvester.Harvester
fromDateReference, harvestCount, HARVESTER_METADATA_FIELD_LAST_HARVESTED, harvestMessageStep, iconfig, log, processor
-
-
Constructor Summary
Constructors Constructor Description WebCrawlingHarvester(HarvesterConfig iconfig)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
enumerateValidHarvesterPropertyNames(Set<String> props)
This method is used by subclasses to enumerate all available harvester properties that are implemented by them.void
harvest()
This method is called by the harvester afterHarvester.open(de.pangaea.metadataportal.processor.ElasticsearchConnection, java.lang.String)
'ing it.-
Methods inherited from class de.pangaea.metadataportal.harvester.SingleFileEntitiesHarvester
addDocument, addDocument, cancelMissingDocumentDelete, close
-
Methods inherited from class de.pangaea.metadataportal.harvester.Harvester
addDocument, createMetadataDocumentInstance, deleteDocument, finishReindex, getValidHarvesterPropertyNames, isAllIndexes, isClosed, isDocumentOutdated, main, open, prepareReindex, runHarvester, runHarvester, setHarvestingDateReference, setValidIdentifiers
-
-
-
-
Field Detail
-
DEFAULT_RETRY_TIME
public static final int DEFAULT_RETRY_TIME
- See Also:
- Constant Field Values
-
DEFAULT_RETRY_COUNT
public static final int DEFAULT_RETRY_COUNT
- See Also:
- Constant Field Values
-
DEFAULT_TIMEOUT
public static final int DEFAULT_TIMEOUT
- See Also:
- Constant Field Values
-
HTML_SAX_PARSER_CLASS
public static final String HTML_SAX_PARSER_CLASS
This is the parser class used to parse HTML documents to collect URLs for crawling. If this class is not in your classpath, the harvester will fail on startup inHarvester.open(de.pangaea.metadataportal.processor.ElasticsearchConnection, java.lang.String)
. If you change the implementation (possibly in future a HTML parser is embedded in XERCES), change this. Do not forget to revisit the features for this parser in the parsing method.- See Also:
- Constant Field Values
-
USER_AGENT
public static final String USER_AGENT
-
-
Constructor Detail
-
WebCrawlingHarvester
public WebCrawlingHarvester(HarvesterConfig iconfig) throws Exception
- Throws:
Exception
-
-
Method Detail
-
harvest
public void harvest() throws Exception
Description copied from class:Harvester
This method is called by the harvester afterHarvester.open(de.pangaea.metadataportal.processor.ElasticsearchConnection, java.lang.String)
'ing it. Overwrite this method in your harvester class. This method should harvest files from somewhere, generateMetadataDocument
s and add them withHarvester.addDocument(de.pangaea.metadataportal.processor.MetadataDocument)
.
-
enumerateValidHarvesterPropertyNames
protected void enumerateValidHarvesterPropertyNames(Set<String> props)
Description copied from class:Harvester
This method is used by subclasses to enumerate all available harvester properties that are implemented by them. Overwrite this method in your own implementation and append all harvester names to the suppliedSet
. The public API for client code requesting property names isHarvester.getValidHarvesterPropertyNames()
.- Overrides:
enumerateValidHarvesterPropertyNames
in classSingleFileEntitiesHarvester
- See Also:
Harvester.getValidHarvesterPropertyNames()
-
-