new ChildManager(message_obj)
Manages workers owned by the bot and manages
the communication and cordination between workers.
Parameters:
Name | Type | Description |
---|---|---|
message_obj |
Message |
- Source:
Members
(private, inner) bloom :BloomFilter
Bloom filter is used to reduce duplicate error from db for url seen test.
n
Number of items in the filter
p
Probability of false positives, float between 0 and 1 or a number indicating 1-in-p
m
Number of bits in the filter
k
Number of hash functions
n = 10,000,000, p = 1.0E-6 (1 in 1,000,000) → m = 287,551,752 (34.28MB), k = 20
http://hur.st/bloomfilter
Type:
- BloomFilter
- Source:
(private, inner) BLOOM_K
Bloom filter k value
- Source:
(private, inner) bloom_length
Tracks size of bloom filter
- Source:
(private, inner) BLOOM_M
Bloom filter m value
- Source:
(private, inner) BLOOM_N
Bloom filter n value
- Source:
(private, inner) prev_domain_grp
Used as a queue, so that different domain group buckets are fetched from db for crawling.
- Source:
Methods
flushInlinks(fn)
In case of clean up,flushInlinks into the db.
Parameters:
Name | Type | Description |
---|---|---|
fn |
function | callback |
- Source:
getActiveChilds()
Get the number of active childs in the manager
- Source:
isManagerLocked()
Returns the state of the starter function
- Source:
killWorkers(fn)
Kill the workers spawned by the child manager.
Parameters:
Name | Type | Description |
---|---|---|
fn |
function | callback |
- Source:
setManagerLocked(state)
Locks or unlocks the interval running starter function.
Parameters:
Name | Type | Description |
---|---|---|
state |
boolean | true/false the lock |
- Source:
(private, inner) childFeedback(data)
Recieves message from all the workers.
Parameters:
Name | Type | Description |
---|---|---|
data |
Object | {"bot": "spawn", "insertRssFeed": [link.details.url, feeds]} |
- Source:
(private, inner) createChild(bucket_links, hash, refresh_label)
Spawns a new child process for the normal queue.
Parameters:
Name | Type | Description |
---|---|---|
bucket_links |
Object | Fetched batch by getNextBatch |
hash |
String | Batch hash id |
refresh_label |
String | Fetch Interval of the batch |
- Source:
(private, inner) createChild_for_failed_queue(bucket_links, hash, refresh_label)
Spawns a new child process for the failed queue.
Parameters:
Name | Type | Description |
---|---|---|
bucket_links |
Object | Fetched batch by getNextBatch |
hash |
String | Batch hash id |
refresh_label |
String | Fetch Interval of the batch |
- Source:
(private, inner) elasticIndex(dict)
Indexes js Object into Elasticsearch.
If elasticsearch enabled from config.
Parameters:
Name | Type | Description |
---|---|---|
dict |
JSON |
- Source:
(private, inner) msg()
Used to call Logger object with the caller function name.
- Source:
(private, inner) nextBatch(fn)
Called by starter to fetch next batch from db.
Parameters:
Name | Type | Description |
---|---|---|
fn |
function | callback |
- Source:
(private, inner) nextFailedBatch()
Fetches a batch from failed queue.
- Source:
(private, inner) rss_links_updator()
Fetches rss files and updates links from it.
Rss file links are provided from the rss collection of crawler.
This function is run in a setInterval.
- Source:
(private, inner) starter()
Responsible for allocating vacant childs
This function is run continously in an interval to check and
realocate workers.
- Source:
(private, inner) startTika()
Launches a child process for pdf requests
- Source: