I asked the same question Web scraper for e-commerce sites yesterday and I'm now posting the revised code here.
I'm building web scraper application which takes name, code and price from few sites. I thought factory pattern would fit in my application. I would like to someone review my code and tell if I'm missing something.
I have class Item
which holds scraped data.
public class Item
{
public string Code { get; set; }
public string Name { get; set; }
public string Price { get; set; }
}
An interfacem which has a method RunScrapingAsync
with a list of item codes as the single parameter, which I need to scrape.
public interface IWebScraper
{
Task<List<Item>> RunScrapingAsync(List<string> itemCodes);
}
Then I have implementations for three scrapers (Amazon, EBay, AliExpress):
public class AmazonWebScraper : IWebScraper
{
private static HttpClient client;
public List<string> ItemCodes { get; set; }
public AmazonWebScraper()
{
client = new HttpClient(new HttpClientHandler() { Proxy = null });
client.BaseAddress = new Uri("https://amazon.com");
}
public async Task<List<Item>> RunScrapingAsync(List<string> itemCodes)
{
ConcurrentBag<Item> itemsConcurrentBag = new ConcurrentBag<Item>();
//for simplicity this logic is not important no need to go in details
return itemsConcurrentBag.ToList();
}
}
public class EBayWebScraper : IWebScraper
{
private static HttpClient client;
public List<string> ItemCodes { get; set; }
public EBayWebScraper()
{
client = new HttpClient(new HttpClientHandler() { Proxy = null });
client.BaseAddress = new Uri("https://ebay.com");
}
public async Task<List<Item>> RunScrapingAsync(List<string> itemCodes)
{
ConcurrentBag<Item> itemsConcurrentBag = new ConcurrentBag<Item>();
//for simplicity this logic is not important no need to go in details
return itemsConcurrentBag.ToList();
}
}
public class AliExpressWebScraper : IWebScraper
{
private static HttpClient client;
public List<string> ItemCodes { get; set; }
public AliExpressWebScraper()
{
client = new HttpClient(new HttpClientHandler() { Proxy = null });
client.BaseAddress = new Uri("https://aliexpress.com");
}
public async Task<List<Item>> RunScrapingAsync(List<string> itemCodes)
{
ConcurrentBag<Item> itemsConcurrentBag = new ConcurrentBag<Item>();
//for simplicity this logic is not important no need to go in details
return itemsConcurrentBag.ToList();
}
}
Here is my factory class WebScraperFactory
:
public enum WebSite
{
Amazon,
EBay,
AliExpress
}
public class WebScraperFactory
{
public IWebScraper Create(WebSite website)
{
switch (website)
{
case WebSite.Amazon:
return new AmazonWebScraper();
case WebSite.EBay:
return new EBayWebScraper();
case WebSite.AliExpress:
return new AliExpressWebScraper();
default:
throw new NotImplementedException($"Not implemented create method in scraper factory for website {webSite}");
}
}
}
The WebScraper
class, which holds all scrapers in a dictionary and uses them in the Execute
method for the provided WebSite
.
public class WebScraper
{
private readonly WebScraperFactory _webScraperFactory;
public WebScraper()
{
_webScraperFactory = new WebScraperFactory();
}
public async Task<List<Item>> Execute(WebSite webSite, List<string> itemCodes) =>
await _webScraperFactory.Create(webSite).RunScrapingAsync(itemCodes);
}
This is a WinForm app, so users have the option to run one or more scrapers (they are not all mandatory to run). So if a user chooses to run Amazon and AliExpress, it will choose two files with codes, adds them to the dictionary and calls the webscraper factory on every chosen website.
Example usage:
var codes = new Dictionary<WebSite, List<string>>
{
{WebSite.Amazon, amazonCodes},
{WebSite.AliExpress, aliExpressCodes}
}
var items = new Dictionary<WebSite, List<Item>>
{
{WebSite.Amazon, null},
{WebSite.AliExpress, null}
}
var webScraper = new WebScraper();
foreach(var webSite in websitesItemCodes.Keys)
{
items[webSite] = await webScraper.Execute(webSite, codes[webSite]);
}
1 Answer 1
IWebScraper
- Based on the implementations it seems like you would gain more if you would define this interface as an abstract class to have a common place for the shared logic
public abstract class WebScraperBase
{
private readonly HttpClient client;
public WebScraperBase(Uri domain)
=> client = new HttpClient() { BaseAddress = domain };
public async Task<List<Item>> RunScrapingAsync(List<string> itemCodes, CancellationToken token = default)
{
ConcurrentBag<Item> itemsConcurrentBag = new();
await ScrapeAsync(itemCodes, itemsConcurrentBag, token);
return itemsConcurrentBag.ToList();
}
protected abstract Task ScrapeAsync(List<string> itemCodes, ConcurrentBag<Item> items, CancellationToken token);
}
- I think the whole handler object is unnecessary:
new HttpClientHandler() { Proxy = null }
- I changed your
RunScrapingAsync
to be a template method- I've added a
CancellationToken
as a parameter to allow user cancellation/timeout - I also defined a
ScrapeAsync
method as a step method
- I've added a
- I'm not sure why do have a class level
List<string>
property and aList<string>
parameter as well- I've removed the former because based on the shared code fragment it was not in use
- If it was in use inside the not shown part of the code under the
RunScrapingAsync
then please be aware that this property could be modified while the async method is running!!!!
XYZWebScraper
- After the above changes the implementation of a given scraper could look like this
public class AliExpressWebScraper : WebScraperBase
{
public AliExpressWebScraper() : base(new Uri("https://aliexpress.com"))
{}
protected async override Task ScrapeAsync(List<string> itemCodes, ConcurrentBag<Item> items, CancellationToken token)
{
//for simplicity this logic is not important no need to go in
}
}
WebScraperFactory
- If you could take advantage of switch expression then you can rewrite your
Create
like this
public static WebScraperBase Create(WebSite website)
=> website switch
{
WebSite.Amazon => new AmazonWebScraper(),
WebSite.EBay => new EBayWebScraper(),
WebSite.AliExpress => new AliExpressWebScraper(),
_ => throw new NotImplementedException($"Not implemented create method in scraper factory for website {webSite}"),
};
WebScraper
- I do believe this whole class is unnecessary if you define the
WebScraperFactory
and itsCreate
method asstatic
(like above)
UPDATE #1
I completely overlooked that the client
was defined as static
inside the WebScraperBase
. That implementation does not work so here are two alternatives:
- Store the static
HttpClient
instances inside theWebScraperFactory
and pass them to the ctor of the concrete scrapper instances
WebScraperFactory
private static Dictionary<WebSite, HttpClient> clients = new Dictionary<WebSite, HttpClient>
{
{ WebSite.Amazon, new HttpClient() { BaseAddress = new Uri("https://amazon.com") } },
{ WebSite.EBay, new HttpClient() { BaseAddress = new Uri("https://ebay.com") } },
{ WebSite.AliExpress, new HttpClient() { BaseAddress = new Uri("https://aliexpress.com") } }
};
public static WebScraperBase Create(WebSite website)
{
var client = clients.ContainsKey(website) ? clients[website] : throw new InvalidOperationException("...");
switch(website)
{
case WebSite.Amazon: return new AmazonWebScraper(client);
case WebSite.EBay: return new EBayWebScraper(client);
case WebSite.AliExpress: return new AliExpressWebScraper(client);
default: throw new NotImplementedException("...");
};
}
WebScraperBase
private readonly HttpClient client;
public WebScraperBase(HttpClient client)
=> this.client = client;
- Use
IHttpClientFactory
to delegate the life cycle management of the underlyingHttpClientHandler
s to a dedicated component- Because you are using .NET Framework, not .NET Core you need to do the following workaround to be able to use
IHttpClientFactory
- Because you are using .NET Framework, not .NET Core you need to do the following workaround to be able to use
WebScraperFactory
private static readonly IServiceProvider serviceProvider;
static WebScraperFactory()
{
var serviceCollection = new ServiceCollection().AddHttpClient();
serviceCollection.AddHttpClient(WebSite.Amazon.ToString(), c => c.BaseAddress = new Uri("https://amazon.com"));
serviceCollection.AddHttpClient(WebSite.EBay.ToString(), c => c.BaseAddress = new Uri("https://ebay.com"));
serviceCollection.AddHttpClient(WebSite.AliExpress.ToString(), c => c.BaseAddress = new Uri("https://aliexpress.com"));
serviceProvider = serviceCollection.BuildServiceProvider();
}
public static WebScraperBase Create(WebSite website)
{
var factory = serviceProvider.GetRequiredService<IHttpClientFactory>();
var client = factory.CreateClient(website.ToString());
if(client.BaseAddress == null) throw new InvalidOperationException("...");
switch(website)
{
case WebSite.Amazon: return new AmazonWebScraper(client);
case WebSite.EBay: return new EBayWebScraper(client);
case WebSite.AliExpress: return new AliExpressWebScraper(client);
default: throw new NotImplementedException("...");
};
}
-
\$\begingroup\$ Application is written in .Net framework so I can't use pattern matching switch. Would use it if I could. \$\endgroup\$Xyloto– Xyloto2023年01月17日 18:59:45 +00:00Commented Jan 17, 2023 at 18:59
-
\$\begingroup\$ @Xyloto Even thought I read WinForms, I did not realisize it means dotNet Fw in this case... \$\endgroup\$Peter Csala– Peter Csala2023年01月17日 19:37:42 +00:00Commented Jan 17, 2023 at 19:37
-
\$\begingroup\$
abstract
classes cannot havestatic
members. I suppose that is because an abstract class is never instantiated. \$\endgroup\$radarbob– radarbob2023年01月18日 02:14:06 +00:00Commented Jan 18, 2023 at 2:14 -
\$\begingroup\$ @radarbob Quite frankly I overlooked that part. Let me update my post. \$\endgroup\$Peter Csala– Peter Csala2023年01月18日 07:11:15 +00:00Commented Jan 18, 2023 at 7:11
-
1\$\begingroup\$ @Xyloto I've updated my post please revisit it \$\endgroup\$Peter Csala– Peter Csala2023年01月18日 07:39:48 +00:00Commented Jan 18, 2023 at 7:39
static
HttpClient
but assigning it in a non-static constructor is, for all intents and purposes, wrong. Either remove thestatic
, or inject the entireHttpClient
into your object and have it truly static elsewhere. \$\endgroup\$