Creating Custom Web Crawler with Dotnet Core using Entity Framework Core and C#

Basics of Crawler

Searching a lot of repositories give me an idea to create new one. So main modules of crawler architecture is almost the same all of one and all of them goes to the big picture of crawler’s life, you can see at below one.

Main modules of Crawler

Step by Step Developing DotnetCrawler

As per above modules, how could be use Crawler classes lets try to imagine and after that implement together.

Example of eShopOnWeb Microsoft Project Usage

This library also include example project which name is DotnetCrawler.Sample. Basically, in this sample project, implementing Microsoft eShopOnWeb repository. You can find this repo here. In this example repository is implemented e-commerce project, it has “Catalog” table when you generate with EF.Core code-first approach. So before use the crawler, you should download and run this project with a real database. To perform this action please refer this information. (If you have already existing database you can continue with your database)

var crawler = new DotnetCrawler<Catalog>()
Packages required to run EF commands
Scaffold-DbContext "Server=(localdb)\mssqllocaldb;Database=Microsoft.eShopOnWeb.CatalogDb;Trusted_Connection=True;" Microsoft.EntityFrameworkCore.SqlServer -OutputDir Models
eShopOnWeb entities
  • It visits to the url in the request object given to it and finds the links in it. If the property value of Regex is full, it applies filtering accordingly.
  • It finds these urls on the internet and download by applying different methods.
  • The downloaded web pages are processed to produce the desired data.
  • Finally, these data are saved to the database with EF.Core.
public async Task Crawle()
{
var linkReader = new DotnetCrawlerPageLinkReader(Request);
var links = await linkReader.GetLinks(Request.Url, 0);
foreach (var url in links)
{
var document = await Downloader.Download(url);
var entity = await Processor.Process(document);
await Pipeline.Run(entity);
}
}

Project Structure of Visual Studio Solution

So you can start with creating new project on Visual Studio with Blank Solution. After that you can add .Net Core Class Library projects as below image;

DotnetCrawler.Core

This project includes main classes of crawler. It has only one interface which has Crawle method and implementation of this interface. So you can create your custom crawler on this project.

DotnetCrawler.Data

This project includes Attributes, Models and Repository folder. We should examine these folders deeply.

[DotnetCrawlerEntity(XPath = "//*[@id='LeftSummaryPanel']/div[1]")]
public partial class Catalog : IEntity
{
public int Id { get; set; }
[DotnetCrawlerField(Expression = "//*[@id='itemTitle']/text()", SelectorType = SelectorType.XPath)]
public string Name { get; set; }
}

DotnetCrawler.Downloader

This project includes download algorithm in main classes of crawler. There are different type of download methods could be apply according to DowloadType of downloader. Also you can develop your own custom downloader in here in order to implement your requirements. To provide these download functions, this project should load HtmlAgilityPack and HtmlAgilityPack.CssSelector.NetCore packages;

DotnetCrawler.Processor

This project provide that convert to downloaded web data into EF.Core entity. This requirement solve with using Reflection for get or set members of generic types. In DotnetCrawlerProcessor.cs class implement current processor of crawler’s. Also you can develop your own custom processor in here in order to implement your requirements.

DotnetCrawler.Pipeline

This project provide that insert database for a given entity object from processor module. To insert database using EF.Core as object relation mapping framework. In DotnetCrawlerPipeline.cs class implement current pipeline of crawler’s. Also you can develop your own custom pipeline in here in order to implement your requirements. (persistance of different database types)

DotnetCrawler.Scheduler

This project provide that schedule jobs for crawler’s crawl action. This requirement not implemented default solution likewise others so you can develop your own custom processor in here in order to implement your requirements. You can use Quartz or Hangfire for background jobs.

DotnetCrawler.Sample

This project proves that insert new iPhone’s into Catalog table from eBay e-commerce web site with using DotnetCrawler. So you can set as startup project to DotnetCrawler.Sample and debug the modules of we explained above part of article.

List of Catalog table, the last 10 records inserted by Crawler

Conclusion

This library designed like other strong crawler libraries like WebMagic and Scrapy but built on architecture that focus on easy to extendable and scaling with applying best practices like Domain Driven Design and Object Oriented principles. Thus you can easily implement your custom requirements and use default features of this straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Mehmet Ozkaya

Mehmet Ozkaya

Software/Solutions Architect, Udemy Instructor, Working on Cloud-Native and Serverless Event-driven Microservices Architectures https://github.com/mehmetozkaya