Language: EN

como-parsear-html-en-c-con-html-agility-pack

How to parse HTML in C# with HTML Agility Pack

In this post we will see how to parse a web page in Html with C# and read its content comfortably thanks to the HTML Agility Pack library.

Currently, a large amount of information is transmitted to the user through web pages. Providing our program with the ability to do the same is a very useful functionality when it comes to automating processes.

There are many cases where we may need to read HTML. For example, we can automatically check the status of an order, or the tracking of a shipment, or the price variation of a product, among an almost infinite number of applications.

Interpreting HTML text with C# is not too difficult, but we can do it even more easily thanks to the HTML Agility Pack library available at https://html-agility-pack.net/.

For now, the library is Open Source and the code is hosted on https://github.com/zzzprojects/html-agility-pack. And we say, for now, because in the past, the author has turned some of his Open Source libraries into commercial ones.

With HTML Agility Pack, we can parse the HTML into a tree of nodes. The library incorporates functions to locate child nodes, or nodes that meet a series of properties, and we can even apply LINQ for searches.

We can load a web page with HTML Agility Pack either from a text file, a string in memory, or directly from the Internet.

// From File
var doc = new HtmlDocument();
doc.Load(filePath);

// From String
var doc = new HtmlDocument();
doc.LoadHtml(html);

// From Web
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);

Once the document is loaded, we can retrieve one or more nodes using LINQ.

var node = htmlDoc.DocumentNode.SelectSingleNode("//head/title");
var nodes = doc.DocumentNode.SelectNodes("//article")

Once we have a node, we can use HTML Agility Pack to find its child nodes or read its content, including attributes, name, class, text, etc.

node.Descendants("a").First().Attributes["data-price"].Value
node.Name
node.OuterHtml
node.InnerText

If the content of the node is encoded as HTML (which is normal), we can “clean it” to convert it into “normal” text with the help of the ‘HtmlDecode’ function from the ‘System.Net.WebUtility’ assembly.

System.Net.WebUtility.HtmlDecode(node.Descendants("a").First().InnerText);

However, with HTML Agility Pack, we can only read the HTML code of the page but it does not execute the associated JavaScript. This is a problem in current dynamic pages, where the initially loaded HTML code (which is sometimes practically empty) is modified by the JavaScript.

One possible solution is to load the Web page into a WebViewer control, which does execute the scripts of the page, and parse the content of the WebViewer with HTML Agility Pack.

In any case, a useful tool in a multitude of applications, which is worth having in our inventory of favorite libraries.