HTML

Web Scraping
with HTML Agility Pack


by Alex Ulici


I'm Alex a .NET Software Developer at Wayfare. I'm part of the .NET department and I would like to share my knowledge about what Web Scraping is and how you can implement it with HTML Agility Pack.

Web Scraping

"Web scraping (web harvesting or web data extraction) is data scraping
used for extracting data from websites".

Gathered data can be used to generate database systems,
statistics that can be used further. This is very useful especially when the website from which you
need some specific information does not have an API or any other method to share it.

A very good example would be sales or rents websites.
They extract products (name, description and price for example) from certain
websites and displays all of them on a single page.

HTML Agility Pack

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT. It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

What I did?

In this article I'm going to show you how easily you can extract useful information from websites using HTML Agility Pack. What I did is collecting all of my work mates name and description from our website www.wayfare.ro and displayed them on my own website. Keep in mind that this is just a basic example. You can do much more!

How I did it

I created an empty ASP.NET Web API project in which I installed HTML Agility Pack using NuGet.


First step after that was creating my model. Named it Workmate and has the following properties: Name and description.
public class Workmate
{
     public string Name { get; set; }
     public string Description { get; set; }
}
After that I moved on the API side. You have to create a controller which will extract the information from the website and returns a list of Workmate.

I called my controller "TeamController" and it has one method named "GetTeam". Takes nothing and returns a list of Workmate. Inside of it I instantiated a new list of Workmate which will return the actual data.
For the next step we need a to download the HTML document and load it up.Can do that by using the Load method of a HtmlWeb. Load requires an URL so we will define a constant (can be defined globally).
private const string TeamUrl = "https://www.wayfare.ro/team/";
var web = new HtmlWeb();
var htmlDoc = web.Load(TeamUrl);


If the document cannot be parsed we stop.


Now here comes the hard part. We have to get the XPath to the value we want. You can do that by using a web browser since most of the browsers have this feature. For this example, I will be using Google Chrome. Right click on the element you want to extract and select "Inspect".


You will notice that a console opened on the right side of your screen. The desired element should be selected already in the console.

Right click on the element, go to Copy and select "Copy XPath".
In this case your XPath should look like this

"//*[@id="awsm-member-519-390"]/figure/figcaption/div/div[1]/h3".

It's good but if we are going to use this XPath we will get the same person everytime but what we want is a list of all persons from that page.

Going back to coding, let's get the names and descriptions using our XPath.
var names = htmlDoc.DocumentNode.SelectNodes("//*//figure/figcaption/div/div//h3");
var descriptions = htmlDoc.DocumentNode.SelectNodes("//*//figure/figcaption/div/div//span");

Now that we have them, let's translate them to Workmate and add them to the list.
foreach (var node in names.Zip(descriptions, (n, d) => new Workmate {Name = n.InnerText, Description = d.InnerText})) {
   teamList.Add(node);
}

What this foreach does is creates a new Workmate for each record in names and iterates trough both (names and description) and maps Name and Description one on one adding them to teamList.
Finally we are ready to return the Workmate list.

return teamList;

The API is done, let's display the data. Using .NET MVC I created a PageController.

Basically it calls the API asking for the team list. For creating the web page I used Razor. In this example I also wanted to have people ordered by the description length. Run it and voila! (Screenshot does not include all results) Conclusion

Run it and voila!

(Screenshot does not include all results)

Conclusion

I warmly recommend HTML Agility Pack whenever you have to do some web scraping for its flexibility.

Credits

Boeing, G.; Waddell, P. (2016). "New Insights into Rental Housing Markets across the United States:
Web Scraping and Analyzing Craigslist Rental Listings". Journal of Planning Education and Research

HTML Agility Pack GitHub - https://github.com/zzzprojects/html-agility-pack

How to Scrape HTML Data with C# - https://www.youtube.com/watch?v=4cPPD-MFadQ

Leave a Reply

Your email address will not be published. Required fields are marked *