new URLCreator(message_obj)
Represents URL and it's crawled details.
has parsing functions.
Parameters:
Name | Type | Description |
---|---|---|
message_obj |
Message |
- Source:
Methods
url(url_input, d, p)
Returns a URL object. With url details and helper methods.
Parameters:
Name | Type | Description |
---|---|---|
url_input |
String | |
d |
String | domain |
p |
String | parent |
- Source:
(private, inner) extractDomain(url)
Extractes the domain from url.
Parameters:
Name | Type | Description |
---|---|---|
url |
String |
- Source:
(private, inner) getFileType(url)
Returns 'file' or 'webpage' based on URL and tika config.
Parameters:
Name | Type | Description |
---|---|---|
url |
String |
- Source:
(private, inner) isAccepted(url, domain) → {boolean}
Returns accepted or rejected status based on the regexes in config.
Parameters:
Name | Type | Description |
---|---|---|
url |
String | |
domain |
String |
- Source:
Returns:
- Type
- boolean
(private, inner) normalizeDomain(url)
Normalizes domain.
Parameters:
Name | Type | Description |
---|---|---|
url |
String |
- Source:
(private, inner) normalizeProtocol(url)
Normalizes protocol to http:
Parameters:
Name | Type | Description |
---|---|---|
url |
String |
- Source:
(private, inner) normalizeURL(url)
Normalize a url.
Parameters:
Name | Type | Description |
---|---|---|
url |
String |
- Source:
(private, inner) nutchStyleURLKey(url)
Returns nutch style url.
Parameters:
Name | Type | Description |
---|---|---|
url |
String |
- Source:
(private, inner) sortedParams(url)
Sorts param from the url.
Parameters:
Name | Type | Description |
---|---|---|
url |
String |
- Source: