Extracting an information from web page by machine learning Extracting an information from web page by machine learning python python

Extracting an information from web page by machine learning


First, your task fits into the information extraction area of research. There are mainly 2 levels of complexity for this task:

  • extract from a given html page or a website with the fixed template(like Amazon). In this case the best way is to look at the HTML codeof the pages and craft the corresponding XPath or DOM selectors toget to the right info. The disadvantage with this approach is that itis not generalizable to new websites, since you have to do it foreach website one by one.
  • create a model that extracts sameinformation from many websites within one domain (having anassumption that there is some inherent regularity in the way webdesigners present the corresponding attribute, like zip or phone or whatever else). In this case you should create some features (to use ML approach and let IE algorithm to "understand the content of pages"). The most common features are: DOM path, the format of the value (attribute) to be extracted, layout (like bold, italic and etc.), and surrounding context words. You label some values (you need at least 100-300 pages depending on domain to do it with some sort of reasonable quality). Then you train a model on the labelled pages. There is also an alternative to it - to do IE in unsupervised manner (leveraging the idea of information regularity across pages). In this case you/your algorith tries to find repetitive patterns across pages (without labelling) and consider as valid those, that are the most frequent.

The most challenging part overall will be to work with DOM tree and generate the right features. Also data labelling in the right way is a tedious task. For ML models - have a look at CRF, 2DCRF, semi-markov CRF.

And finally, this is in the general case a cutting edge in IE research and not a hack that you can do it a few evenings.

p.s. also I think NLTK will not be very helpful - it is an NLP, not Web-IE library.


tl;dr: The problem might solvable using ML, but it's not straightforward if you're new to the topic


There's a lot of machine learning libraries for python:

  • Scikit-learn is very popular general-purpose for beginners and great for simple problems with smallish datasets.
  • Natural Language Toolkit has implementations for lots of algorithms, many of which are language agnostic (say, n-grams)
  • Gensim is great for text topic modelling
  • Opencv implements some common algorithms (but is usually used for images)
  • Spacy and Transformers implement modern (state-of-the-art, as of 2020) text NLU (Natural Language Understanding) techniques, but require more familiarity with the complex techniques

Usually you pick a library that suits your problem and the technique you want to use.

Machine learning is a very vast area. Just for the supervised-learning classification subproblem, and considering only "simple" classifiers, there's Naive Bayes, KNN, Decision Trees, Support Vector Machines, feed-forward neural networks... The list goes on and on. This is why, as you say, there are no "quickstarts" or tutorials for machine learning in general. My advice here is, firstly, to understand the basic ML terminology, secondly, understand a subproblem (I'd advise classification within supervised-learning), and thirdly, study a simple algorithm that solves this subproblem (KNN relies on highschool-level math).

About your problem in particular: it seems you want detect the existence of a piece of data (postal code) inside an huge dataset (text). A classic classification algorithm expects a relatively small feature vector. To obtain that, you will need to do what's called a dimensionality reduction: this means, isolate the parts that look like potential postal codes. Only then does the classification algorithm classify it (as "postal code" or "not postal code", for example).

Thus, you need to find a way to isolate potential matches before you even think about using ML to approach this problem. This will most certainly entail natural language processing, as you said, if you don't or can't use regex or parsing.

More advanced models in NLU could potentially parse your whole text, but they might require very large amounts of pre-classified data, and explaining them is outside of the scope of this question. The libraries I've mentioned earlier are a good start.


As per i know there are two ways to do this task using machine learning approach.

1.Using computer vision to train the model and then extract the content based on your use case, this has already been implemented by diffbot.com.and they have not open sourced their solution.

2.The other way to go around this problem is using supervised machine learning to train binary classifier to classify content vs boilerplate and then extract the content. This approach is used in dragnet.and other research around this area. You can have a look at benchmark comparison among different content extraction techniques.