This has been generated by the StormCrawler Maven Archetype as a starting point for building your own crawler.
Have a look at the code and resources and modify them to your heart's content. 

# Prerequisites

You need to install Apache Storm. The instructions on [setting up a Storm cluster](https://storm.apache.org/releases/2.6.2/Setting-up-a-Storm-cluster.html) should help. 

You also need to have an instance of URLFrontier running. See [the URLFrontier README](https://github.com/crawler-commons/url-frontier/tree/master/service); the easiest way is to use Docker, like so:

```
docker pull crawlercommons/url-frontier
docker run --rm --name frontier -p 7071:7071  crawlercommons/url-frontier
```

# Compilation

Generate an uberjar with

``` sh
mvn clean package
```

# URL injection

The next step is to inject URLs into URLFrontier, using the [client](https://github.com/crawler-commons/url-frontier/tree/master/client). Fortunately, it is added as a dependency to this project so all
you need to do is

``` sh
java -cp  target/${artifactId}-${version}.jar crawlercommons.urlfrontier.client.Client PutURLs -f seeds.txt
```

where _seeds.txt_ is a file containing URLs to inject, with one URL per line.

# Running the crawl

You can now submit the flux topology using the storm command:

``` sh
storm local target/${artifactId}-${version}.jar  org.apache.storm.flux.Flux crawler.flux --local-ttl 3600
```

Note that in local mode, Flux uses a default TTL for the topology of 20 secs. The command above runs the topology for 1 hour.

Alternatively, you can use `storm jar` to start the topology in distributed mode, where it will run indefinitely.
It is best to run the topology with `storm jar` to benefit from the Storm UI and logging. In that case, the topology runs continuously, as intended.
