Return to Answer

Commonmark migration

Source Link

edited Jun 10, 2020 at 13:24

Community Bot

edited Jun 10, 2020 at 13:24

Community Bot

I would simplify your solution a little bit. Some classes are unnecessary.

##Sugestion for CrawlerApp

Sugestion for CrawlerApp

private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
 // number of active threads...
private static final AtomicInteger NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
 TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
 crawler = new Crawler();
 initializeApp();
 startCrawling();
}
private static void startCrawling() throws InterruptedException {
 while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
 Thread.sleep(100);
 }
 executor.shutdown();
 executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
 Properties config = new Properties();
 try {
 config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
 String[] horizon = config.getProperty("urls").split(",");
 for (String link : horizon) {
 URL url = new URL(link);
 executor.submit(new CrawlerTask(url, crawler, executor, NUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
 }
 } catch (IOException e) {
 e.printStackTrace();
 }
}

There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page than block all the execution.

CrawlerTask

 private void crawlTask() {
 if (crawler.getUrlVisited().contains(url)){
 return ;
 }
 new Scraper().scrape(url);
 crawler.addURLToVisited(url);
}

I would simplify your solution a little bit. Some classes are unnecessary.

##Sugestion for CrawlerApp

private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
 // number of active threads...
private static final AtomicInteger NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
 TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
 crawler = new Crawler();
 initializeApp();
 startCrawling();
}
private static void startCrawling() throws InterruptedException {
 while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
 Thread.sleep(100);
 }
 executor.shutdown();
 executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
 Properties config = new Properties();
 try {
 config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
 String[] horizon = config.getProperty("urls").split(",");
 for (String link : horizon) {
 URL url = new URL(link);
 executor.submit(new CrawlerTask(url, crawler, executor, NUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
 }
 } catch (IOException e) {
 e.printStackTrace();
 }
}

There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page than block all the execution.

CrawlerTask

 private void crawlTask() {
 if (crawler.getUrlVisited().contains(url)){
 return ;
 }
 new Scraper().scrape(url);
 crawler.addURLToVisited(url);
}

I would simplify your solution a little bit. Some classes are unnecessary.

Sugestion for CrawlerApp

private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
 // number of active threads...
private static final AtomicInteger NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
 TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
 crawler = new Crawler();
 initializeApp();
 startCrawling();
}
private static void startCrawling() throws InterruptedException {
 while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
 Thread.sleep(100);
 }
 executor.shutdown();
 executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
 Properties config = new Properties();
 try {
 config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
 String[] horizon = config.getProperty("urls").split(",");
 for (String link : horizon) {
 URL url = new URL(link);
 executor.submit(new CrawlerTask(url, crawler, executor, NUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
 }
 } catch (IOException e) {
 e.printStackTrace();
 }
}

There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page than block all the execution.

CrawlerTask

 private void crawlTask() {
 if (crawler.getUrlVisited().contains(url)){
 return ;
 }
 new Scraper().scrape(url);
 crawler.addURLToVisited(url);
}

removed edid-scars

Source Link

edited May 30, 2016 at 12:48

Vogel612

edited May 30, 2016 at 12:48

Vogel612

25.5k
7
59
141

I would simplify your solution a little bit. Some classes are unnecessary.

##Sugestion for CrawlerApp

private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
 // number of active threads...
private static final AtomicInteger NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
 TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
 crawler = new Crawler();
 initializeApp();
 startCrawling();
}
private static void startCrawling() throws InterruptedException {
 while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
 Thread.sleep(100);
 }
 executor.shutdown();
 executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
 Properties config = new Properties();
 try {
 config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
 String[] horizon = config.getProperty("urls").split(",");
 for (String link : horizon) {
 URL url = new URL(link);
 executor.submit(new CrawlerTask(url, crawler, executor, NUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
 }
 } catch (IOException e) {
 e.printStackTrace();
 }
}

There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page than block all the execution.

CrawlerTask

 private void crawlTask() {
 if (crawler.getUrlVisited().contains(url)){
 return ;
 }
 new Scraper().scrape(url);
 crawler.addURLToVisited(url);
}

Edited

I changed the code a little bit.

I would simplify your solution a little bit. Some classes are unnecessary.

##Sugestion for CrawlerApp

private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
 // number of active threads...
private static final AtomicInteger NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
 TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
 crawler = new Crawler();
 initializeApp();
 startCrawling();
}
private static void startCrawling() throws InterruptedException {
 while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
 Thread.sleep(100);
 }
 executor.shutdown();
 executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
 Properties config = new Properties();
 try {
 config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
 String[] horizon = config.getProperty("urls").split(",");
 for (String link : horizon) {
 URL url = new URL(link);
 executor.submit(new CrawlerTask(url, crawler, executor, NUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
 }
 } catch (IOException e) {
 e.printStackTrace();
 }
}

There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page than block all the execution.

CrawlerTask

 private void crawlTask() {
 if (crawler.getUrlVisited().contains(url)){
 return ;
 }
 new Scraper().scrape(url);
 crawler.addURLToVisited(url);
}

Edited

I changed the code a little bit.

I would simplify your solution a little bit. Some classes are unnecessary.

##Sugestion for CrawlerApp

private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
 // number of active threads...
private static final AtomicInteger NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
 TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
 crawler = new Crawler();
 initializeApp();
 startCrawling();
}
private static void startCrawling() throws InterruptedException {
 while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
 Thread.sleep(100);
 }
 executor.shutdown();
 executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
 Properties config = new Properties();
 try {
 config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
 String[] horizon = config.getProperty("urls").split(",");
 for (String link : horizon) {
 URL url = new URL(link);
 executor.submit(new CrawlerTask(url, crawler, executor, NUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
 }
 } catch (IOException e) {
 e.printStackTrace();
 }
}

There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page than block all the execution.

CrawlerTask

 private void crawlTask() {
 if (crawler.getUrlVisited().contains(url)){
 return ;
 }
 new Scraper().scrape(url);
 crawler.addURLToVisited(url);
}

added 151 characters in body

Source Link

edited May 30, 2016 at 12:32

rdllopes

edited May 30, 2016 at 12:32

rdllopes

I would simplify your solution a little bit. Some classes are unnecessary.

##Sugestion for CrawlerApp

private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
 // number of active threads...
private static final AtomicInteger ATOMIC_INTEGER =NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
 TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
 crawler = new Crawler();
 initializeApp();
 startCrawling();
}
private static void startCrawling() throws InterruptedException {
 while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
 Thread.sleep(100);
 }
 executor.shutdown();
 executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
 Properties config = new Properties();
 try {
 config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
 String[] horizon = config.getProperty("urls").split(",");
 for (String link : horizon) {
 URL url = new URL(link);
 executor.submit(new CrawlerTask(url, crawler, queueexecutor, ATOMIC_INTEGERNUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
 }
 } catch (IOException e) {
 e.printStackTrace();
 }
}

There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page thenthan block all the execution.

CrawlerTask

 private void crawlTask() {
 if (crawler.getUrlVisited().contains(url)){
 return ;
 }
 new Scraper().scrape(url);
 crawler.addURLToVisited(url);
}

Edited

I changed the code a little bit.

I would simplify your solution a little bit. Some classes are unnecessary.

##Sugestion for CrawlerApp

private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
private static final AtomicInteger ATOMIC_INTEGER = new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
 TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
 crawler = new Crawler();
 initializeApp();
 startCrawling();
}
private static void startCrawling() throws InterruptedException {
 while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
 Thread.sleep(100);
 }
 executor.shutdown();
 executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
 Properties config = new Properties();
 try {
 config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
 String[] horizon = config.getProperty("urls").split(",");
 for (String link : horizon) {
 URL url = new URL(link);
 executor.submit(new CrawlerTask(url, crawler, queue, ATOMIC_INTEGER));
 }
 } catch (IOException e) {
 e.printStackTrace();
 }
}

There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page then block all the execution.

CrawlerTask

 private void crawlTask() {
 if (crawler.getUrlVisited().contains(url)){
 return ;
 }
 new Scraper().scrape(url);
 crawler.addURLToVisited(url);
}

I would simplify your solution a little bit. Some classes are unnecessary.

##Sugestion for CrawlerApp

private static final Integer WORKER_LIMIT = 10;
private static final BlockingQueue queue = new LinkedBlockingQueue<Runnable>();
 // number of active threads...
private static final AtomicInteger NUMBER_ACTIVE_THREADS= new AtomicInteger(0);
private final static ExecutorService executor = new ThreadPoolExecutor(WORKER_LIMIT, WORKER_LIMIT, 0L,
 TimeUnit.MILLISECONDS, queue);
private static Crawler crawler;
public static void main(String[] args) throws InterruptedException {
 crawler = new Crawler();
 initializeApp();
 startCrawling();
}
private static void startCrawling() throws InterruptedException {
 while (ATOMIC_INTEGER.intValue() > 0 || !queue.isEmpty()){
 Thread.sleep(100);
 }
 executor.shutdown();
 executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
}
private static void initializeApp() {
 Properties config = new Properties();
 try {
 config.load(CrawlerApp.class.getClassLoader().getResourceAsStream("url-horizon.properties"));
 String[] horizon = config.getProperty("urls").split(",");
 for (String link : horizon) {
 URL url = new URL(link);
 executor.submit(new CrawlerTask(url, crawler, executor, NUMBER_ACTIVE_THREADS)); // don't forget to increase/decrease atomic_integer.
 }
 } catch (IOException e) {
 e.printStackTrace();
 }
}

There is a huge contention problem in CrawlerTask. Reduce the synchronized area. Better visit twice a page than block all the execution.

CrawlerTask

 private void crawlTask() {
 if (crawler.getUrlVisited().contains(url)){
 return ;
 }
 new Scraper().scrape(url);
 crawler.addURLToVisited(url);
}

Edited

I changed the code a little bit.

Fixing problem with ending task

Source Link

edited May 30, 2016 at 11:51

rdllopes

edited May 30, 2016 at 11:51

rdllopes

Source Link

answered May 30, 2016 at 11:32

rdllopes

answered May 30, 2016 at 11:32

rdllopes

lang-java