I would simplify your solution a little bit. Some classes are unnecessary.
##Sugestion for CrawlerApp
Sugestion for CrawlerApp
private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
// number of active threads...
private static final AtomicInteger NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
crawler = new Crawler();
initializeApp();
startCrawling();
}
private static void startCrawling() throws InterruptedException {
while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
Thread.sleep(100);
}
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
Properties config = new Properties();
try {
config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
String[] horizon = config.getProperty("urls").split(",");
for (String link : horizon) {
URL url = new URL(link);
executor.submit(new CrawlerTask(url, crawler, executor, NUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
}
} catch (IOException e) {
e.printStackTrace();
}
}
There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page than block all the execution.
CrawlerTask
private void crawlTask() {
if (crawler.getUrlVisited().contains(url)){
return ;
}
new Scraper().scrape(url);
crawler.addURLToVisited(url);
}
I would simplify your solution a little bit. Some classes are unnecessary.
##Sugestion for CrawlerApp
private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
// number of active threads...
private static final AtomicInteger NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
crawler = new Crawler();
initializeApp();
startCrawling();
}
private static void startCrawling() throws InterruptedException {
while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
Thread.sleep(100);
}
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
Properties config = new Properties();
try {
config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
String[] horizon = config.getProperty("urls").split(",");
for (String link : horizon) {
URL url = new URL(link);
executor.submit(new CrawlerTask(url, crawler, executor, NUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
}
} catch (IOException e) {
e.printStackTrace();
}
}
There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page than block all the execution.
CrawlerTask
private void crawlTask() {
if (crawler.getUrlVisited().contains(url)){
return ;
}
new Scraper().scrape(url);
crawler.addURLToVisited(url);
}
I would simplify your solution a little bit. Some classes are unnecessary.
Sugestion for CrawlerApp
private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
// number of active threads...
private static final AtomicInteger NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
crawler = new Crawler();
initializeApp();
startCrawling();
}
private static void startCrawling() throws InterruptedException {
while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
Thread.sleep(100);
}
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
Properties config = new Properties();
try {
config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
String[] horizon = config.getProperty("urls").split(",");
for (String link : horizon) {
URL url = new URL(link);
executor.submit(new CrawlerTask(url, crawler, executor, NUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
}
} catch (IOException e) {
e.printStackTrace();
}
}
There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page than block all the execution.
CrawlerTask
private void crawlTask() {
if (crawler.getUrlVisited().contains(url)){
return ;
}
new Scraper().scrape(url);
crawler.addURLToVisited(url);
}
I would simplify your solution a little bit. Some classes are unnecessary.
##Sugestion for CrawlerApp
private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
// number of active threads...
private static final AtomicInteger NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
crawler = new Crawler();
initializeApp();
startCrawling();
}
private static void startCrawling() throws InterruptedException {
while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
Thread.sleep(100);
}
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
Properties config = new Properties();
try {
config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
String[] horizon = config.getProperty("urls").split(",");
for (String link : horizon) {
URL url = new URL(link);
executor.submit(new CrawlerTask(url, crawler, executor, NUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
}
} catch (IOException e) {
e.printStackTrace();
}
}
There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page than block all the execution.
CrawlerTask
private void crawlTask() {
if (crawler.getUrlVisited().contains(url)){
return ;
}
new Scraper().scrape(url);
crawler.addURLToVisited(url);
}
Edited
I changed the code a little bit.
I would simplify your solution a little bit. Some classes are unnecessary.
##Sugestion for CrawlerApp
private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
// number of active threads...
private static final AtomicInteger NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
crawler = new Crawler();
initializeApp();
startCrawling();
}
private static void startCrawling() throws InterruptedException {
while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
Thread.sleep(100);
}
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
Properties config = new Properties();
try {
config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
String[] horizon = config.getProperty("urls").split(",");
for (String link : horizon) {
URL url = new URL(link);
executor.submit(new CrawlerTask(url, crawler, executor, NUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
}
} catch (IOException e) {
e.printStackTrace();
}
}
There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page than block all the execution.
CrawlerTask
private void crawlTask() {
if (crawler.getUrlVisited().contains(url)){
return ;
}
new Scraper().scrape(url);
crawler.addURLToVisited(url);
}
Edited
I changed the code a little bit.
I would simplify your solution a little bit. Some classes are unnecessary.
##Sugestion for CrawlerApp
private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
// number of active threads...
private static final AtomicInteger NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
crawler = new Crawler();
initializeApp();
startCrawling();
}
private static void startCrawling() throws InterruptedException {
while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
Thread.sleep(100);
}
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
Properties config = new Properties();
try {
config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
String[] horizon = config.getProperty("urls").split(",");
for (String link : horizon) {
URL url = new URL(link);
executor.submit(new CrawlerTask(url, crawler, executor, NUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
}
} catch (IOException e) {
e.printStackTrace();
}
}
There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page than block all the execution.
CrawlerTask
private void crawlTask() {
if (crawler.getUrlVisited().contains(url)){
return ;
}
new Scraper().scrape(url);
crawler.addURLToVisited(url);
}
I would simplify your solution a little bit. Some classes are unnecessary.
##Sugestion for CrawlerApp
private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
// number of active threads...
private static final AtomicInteger ATOMIC_INTEGER =NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
crawler = new Crawler();
initializeApp();
startCrawling();
}
private static void startCrawling() throws InterruptedException {
while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
Thread.sleep(100);
}
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
Properties config = new Properties();
try {
config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
String[] horizon = config.getProperty("urls").split(",");
for (String link : horizon) {
URL url = new URL(link);
executor.submit(new CrawlerTask(url, crawler, queueexecutor, ATOMIC_INTEGERNUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
}
} catch (IOException e) {
e.printStackTrace();
}
}
There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page thenthan block all the execution.
CrawlerTask
private void crawlTask() {
if (crawler.getUrlVisited().contains(url)){
return ;
}
new Scraper().scrape(url);
crawler.addURLToVisited(url);
}
Edited
I changed the code a little bit.
I would simplify your solution a little bit. Some classes are unnecessary.
##Sugestion for CrawlerApp
private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
private static final AtomicInteger ATOMIC_INTEGER = new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
crawler = new Crawler();
initializeApp();
startCrawling();
}
private static void startCrawling() throws InterruptedException {
while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
Thread.sleep(100);
}
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
Properties config = new Properties();
try {
config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
String[] horizon = config.getProperty("urls").split(",");
for (String link : horizon) {
URL url = new URL(link);
executor.submit(new CrawlerTask(url, crawler, queue, ATOMIC_INTEGER));
}
} catch (IOException e) {
e.printStackTrace();
}
}
There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page then block all the execution.
CrawlerTask
private void crawlTask() {
if (crawler.getUrlVisited().contains(url)){
return ;
}
new Scraper().scrape(url);
crawler.addURLToVisited(url);
}
I would simplify your solution a little bit. Some classes are unnecessary.
##Sugestion for CrawlerApp
private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
// number of active threads...
private static final AtomicInteger NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
crawler = new Crawler();
initializeApp();
startCrawling();
}
private static void startCrawling() throws InterruptedException {
while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
Thread.sleep(100);
}
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
Properties config = new Properties();
try {
config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
String[] horizon = config.getProperty("urls").split(",");
for (String link : horizon) {
URL url = new URL(link);
executor.submit(new CrawlerTask(url, crawler, executor, NUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
}
} catch (IOException e) {
e.printStackTrace();
}
}
There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page than block all the execution.
CrawlerTask
private void crawlTask() {
if (crawler.getUrlVisited().contains(url)){
return ;
}
new Scraper().scrape(url);
crawler.addURLToVisited(url);
}
Edited
I changed the code a little bit.