Creating Custom Web Crawler with Dotnet Core using Entity Framework Core and C#
There are most of web scraping and web crawler frameworks existing on different infrastructures. But when it comes to dotnet environments you have not such an option that you can find your tool which accommodate your custom requirements.
In this article, we will implement custom web crawler and use this crawler on eBay e-commerce web site that scraping eBay iphones pages and insert this record our Sql Server database with using Entity Framework Core. An example database schema will be Microsoft eShopWeb application, we will insert eBay records into Catalog table.
You can find GitHub repository in here : DotnetCrawler
Before I start to developing new crawler, I was search these tools which written in C# are listed below;
Abot is a good crawler but it has no free support if you need to implement some custom things, also there is no enough documentation.
DotnetSpider has really good design, its architecture using the same as the most using crawlers like Scrapy and WebMagic. But documentation is Chinese so even I translated to English with google translate, its hard to learn how to implement custom scenarios. Also I want to insert crawler output to SQL Server database but it was not working properly, opened an issue on GitHub but not responded yet.
I had no other option to solve my problem with limited time. And I did not want to spend more time to investigate more crawler infrastructures. I have decided to write my own tool.
Basics of Crawler
Searching a lot of repositories give me an idea to create new one. So main modules of crawler architecture is almost the same all of one and all of them goes to the big picture of crawler’s life, you can see at below one.
This figure shows the main modules that should be included in a common crawler projects. Therefore, I have added these modules as a separate project under visual studio solution.
Basic explanations of these modules are listed below;
Downloader; responsible for download to given url into local folder or temp and return to htmlnode objects toward to processor.
Processors; responsible to process to given html nodes, extract and find intented spesific nodes, load entities with these processed data. And return this processor to pipeline.
Pipelines; responsible to export entity to different databases which uses from application.
Scheduler; responsible for schedule crawler commands in order to provide polite crawl operations.
Step by Step Developing DotnetCrawler
As per above modules, how could be use Crawler classes lets try to imagine and after that implement together.
As you can see that DotnetCrawler<TEntity> class has generic entity type which will be use as a DTO object and saving the database. Catalog is a generic type of DotnetCrawler and also generated by EF.Core scaffolding command in DotnetCrawler.Data project. We will see this later.
This DotnetCrawler object configure with Builder Design Pattern in order to load their configurations. Also this technic naming as a fluent design.
DotnetCrawler configure with below methods;
AddRequest; This includes main url of crawler target. Also we can define filter for targetted urls that aim to focus intended parts.
AddDownloader; This includes downloader types and if download type is “FromFile” that means download to local folder, then also required to path of download folder. Other options are “FromMemory” and “FromWeb”, both of them download target url but not save.
AddProcessor; This method load with new default processor which basically provide to extract html page and locate some html tags. You can create your own processor due to extendable design.
AddPipeline; This method load with new default pipeline which basically provide to save entity into database. Current pipeline provide to connect SqlServer with using Entity Framework Core. You can create your own pipeline due to extendable design.
For all these configuration should be store in main class; DotnetCrawler.cs;
According to this, after the necessary configurations are done, crawler.Crawle () method is triggered asynchronously. This method completes its operations by navigating the next modules respectively.
Example of eShopOnWeb Microsoft Project Usage
This library also include example project which name is DotnetCrawler.Sample. Basically, in this sample project, implementing Microsoft eShopOnWeb repository. You can find this repo here. In this example repository is implemented e-commerce project, it has “Catalog” table when you generate with EF.Core code-first approach. So before use the crawler, you should download and run this project with a real database. To perform this action please refer this information. (If you have already existing database you can continue with your database)
We were passing “Catalog” table as a generic type in DotnetCrawler class.
var crawler = new DotnetCrawler<Catalog>()
Catalog is a generic type of DotnetCrawler and also generated by EF.Core scaffolding command in DotnetCrawler.Data project. DotnetCrawler.Data project installed EF.Core nuget pagkages. Before run this command, .Data project should download below nuget packages.
Now are packages are ready and able to run EF command in Package Manager Console with selecting DotnetCrawler.Data project.
Scaffold-DbContext "Server=(localdb)\mssqllocaldb;Database=Microsoft.eShopOnWeb.CatalogDb;Trusted_Connection=True;" Microsoft.EntityFrameworkCore.SqlServer -OutputDir Models
By this command DotnetCrawler.Data project created Model folder. In this folder has all entities and context object generated from eShopOnWeb Microsoft’s example.
After that need to configure your entity class with custom crawler attributes in order to understand crawler spiders and load entity fields from web page of eBay iphones;
With this code, basically, crawler requested given url and try to find given attributes which defined xpath addresses for target web url.
After these definitions , finally we can able to run crawler.Crawle() method asynchronously. In this method, it performs the following operations respectively.
- It visits to the url in the request object given to it and finds the links in it. If the property value of Regex is full, it applies filtering accordingly.
- It finds these urls on the internet and download by applying different methods.
- The downloaded web pages are processed to produce the desired data.
- Finally, these data are saved to the database with EF.Core.
public async Task Crawle()
{
var linkReader = new DotnetCrawlerPageLinkReader(Request);
var links = await linkReader.GetLinks(Request.Url, 0);foreach (var url in links)
{
var document = await Downloader.Download(url);
var entity = await Processor.Process(document);
await Pipeline.Run(entity);
}
}
Project Structure of Visual Studio Solution
So you can start with creating new project on Visual Studio with Blank Solution. After that you can add .Net Core Class Library projects as below image;
Only sample project would be .Net Core Console Application. I will explain all the projects in this solution one by one.
DotnetCrawler.Core
This project includes main classes of crawler. It has only one interface which has Crawle method and implementation of this interface. So you can create your custom crawler on this project.
DotnetCrawler.Data
This project includes Attributes, Models and Repository folder. We should examine these folders deeply.
Model folder; should include Entity classes which generated by Entity Framework Core. So you should put your database table entities in this folder and also this folder should exist Context object of Entity Framework Core. Now in this folder have eShopOnWeb Microsoft’s database example.
Attributes folder; include Crawler Attributes which provide to take xpath information about crawled web pages. There has 2 class, DotnetCrawlerEntityAttribute.cs for entity attribute, DotnetCrawlerFieldAttribute.cs for property attribute. These attributes should be on EF.Core entity classes. You can see example usage of attributes as below code block;
[DotnetCrawlerEntity(XPath = "//*[@id='LeftSummaryPanel']/div[1]")]
public partial class Catalog : IEntity
{
public int Id { get; set; }
[DotnetCrawlerField(Expression = "//*[@id='itemTitle']/text()", SelectorType = SelectorType.XPath)]
public string Name { get; set; }
}
The first xpath use for locating that html node when start to crawl. The second one is use for get real data information in particular html node. In this example this path retrieve iphone names from eBay.
Repository folder; include Repository Design Pattern implementation over the EF.Core entities and database context. I used for repository pattern from this resources. In order to use repository pattern we have to apply IEntity interface for all EF.Core entities. You can see above code that Catalog class implement IEntity interface. So crawler’s generic type should implement from IEntity.
DotnetCrawler.Downloader
This project includes download algorithm in main classes of crawler. There are different type of download methods could be apply according to DowloadType of downloader. Also you can develop your own custom downloader in here in order to implement your requirements. To provide these download functions, this project should load HtmlAgilityPack and HtmlAgilityPack.CssSelector.NetCore packages;
The main function of download method is in DotnetCrawlerDownloader.cs — DownloadInternal() method;
This methods download target url as per downloader type; download local file, download temp file or not download directly read from web.
Also one of the main fuction of crawler’s is page visit algorithms. So in this project, in DotnetCrawlerPageLinkReader.cs class apply page visit algorithm with recursive methods. You can use this page visit algorithm by giving depth parameter. I am using this resource in order to solve this issue.
DotnetCrawler.Processor
This project provide that convert to downloaded web data into EF.Core entity. This requirement solve with using Reflection for get or set members of generic types. In DotnetCrawlerProcessor.cs class implement current processor of crawler’s. Also you can develop your own custom processor in here in order to implement your requirements.
DotnetCrawler.Pipeline
This project provide that insert database for a given entity object from processor module. To insert database using EF.Core as object relation mapping framework. In DotnetCrawlerPipeline.cs class implement current pipeline of crawler’s. Also you can develop your own custom pipeline in here in order to implement your requirements. (persistance of different database types)
DotnetCrawler.Scheduler
This project provide that schedule jobs for crawler’s crawl action. This requirement not implemented default solution likewise others so you can develop your own custom processor in here in order to implement your requirements. You can use Quartz or Hangfire for background jobs.
DotnetCrawler.Sample
This project proves that insert new iPhone’s into Catalog table from eBay e-commerce web site with using DotnetCrawler. So you can set as startup project to DotnetCrawler.Sample and debug the modules of we explained above part of article.
Conclusion
This library designed like other strong crawler libraries like WebMagic and Scrapy but built on architecture that focus on easy to extendable and scaling with applying best practices like Domain Driven Design and Object Oriented principles. Thus you can easily implement your custom requirements and use default features of this straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core.
GitHub : Source Code
I have just published course — .NET 8 Microservices: C# 12, DDD, CQRS, Vertical/Clean Architecture.
This is step-by-step development of reference microservices architecture that include microservices on .NET platforms which used ASP.NET Web API, Docker, RabbitMQ, MassTransit, Grpc, Yarp API Gateway, PostgreSQL, Redis, SQLite, SqlServer, Marten, Entity Framework Core, CQRS, MediatR, DDD, Vertical and Clean Architecture implementation with using latest features of .NET 8 and C# 12.