PHP HTML parser: web page processing tools in PHP

Today, there is a growing question of whether to process any information or save something from a third-party site. This may be necessary for various reasons: to fill the site with content, updating the technical specifications in the online store, downloading various information, and so on.

At such moments, the developer has a question, which library is best to write a PHP HTML parser?

In fact, I also faced such a question, and I tried to understand and find the answer. Now I’ll tell you which library I think is the best for parsing.

PHP HTML parser standard modules: XPath and DOM

These are standard modules that are already built into PHP, starting from version 5. The big plus is that there is no need to use any auxiliary methods. Many believe, this is what makes it one of the best parsing tools.

It may seem that XPath and DOM are complicated to learn, but I assure you that they are not. Once you understand the basics, XPath can be your number 1 tool.

Another plus is that the power of XPath lies in the axes (the language base) that make it possible to reach absolutely any document in the source code.

An example code written with XPath and DOM. It looks for all tags in the markup and converts them to src attribute.

Of course, this tool is not without its disadvantages. Since the engine is designed more to work with XML, that is, some nuances. For example, all HTML tags must be closed.

PHP Simple HTML DOM Parser

This is a PHP library that allows you to parse an HTML page with jQuery-like selectors.

The peculiarity of the library is the ability to work with invalid HTML code.

There is never a problem with encoding, as automatic conversion is performed.

PHP Simple HTML DOM Parser composer can find and filter nested elements, access their attributes and select individual logical code elements.

In this example, we find all the links on the page and inside foreach we can do with them whatever we want using PHP simple HTML parser.

I’d like to say that this library is not without drawbacks, either. Its main drawback is that among all its main competitors this library has the lowest processing speed, but this drawback is noticeable only with a large amount of information.

In general, the Simple HTML DOM library is very simple. There is clear and convenient documentation, which is important for a beginner. So it may be the best option for a novice parser coder.

phpQuery

Like Simple HTML DOM, phpQuery is a library that is similar to jQuery in PHP.

This library works with DOM. PhpQuery is one of the fastest libraries that can work with attributes, selectors, events, and even Ajax.

An example code on phpQuery

I found some information on the library downsides, to be more precise, almost nothing. I can probably only highlight the fact that the last time phpQuery was updated in far 2009, but you can still find some PHP HTML parser examples with this library.

PHP getopt

PHP getopt – useful library for parsing arguments and options passed to the script from the command line.

Conclusion

I’ve come to the conclusion that it’s better to use the phpQuery library.

If you have very simple tasks, it is more logical to use standard PHP modules or regular expressions.

There are dozens of different parsing libraries (PHP cURL HTML parser, PHP HTML5 parser) and in general, there are many different tools, but of all that I found, included to the list, seemed to me the most interesting.