Extracting an information from web page by machine learning

Question:

I would like to extract a specific type of information from web pages in Python. Let’s say postal address. It has thousands of forms, but still, it is somehow recognizable. As there is a large number of forms, it would be probably very difficult to write regular expression or even something like a grammar and to use a parser generator for parsing it out.

So I think the way I should go is machine learning. If I understand it well, I should be able to make a sample of data where I will point out what should be the result and then I have something which can learn from this how to recognize the result by itself. This is all I know about machine learning. Maybe I could use some natural language processing, but probably not much as all the libraries work with English mostly and I need this for Czech.

Questions:

  1. Can I solve this problem easily by machine learning? Is it a good way to go?
  2. Are there any simple examples which would allow me to start? I am machine learning noob and I need something practical for start; closer to my problem is better; simpler is better.
  3. There are plenty of Python libraries for machine learning. Which one would suit my problem best?
  4. Lots of such libs have not very easy-to-use docs as they come from scientific environment. Are there any good sources (books, articles, quickstarts) bridging the gap, i.e. focused on newbies who know totally nothing about machine learning? Every docs I open start with terms I don’t understand such as network, classification, datasets, etc.

Update:

As you all mentioned I should show a piece of data I am trying to get out of the web, here is an example. I am interested in cinema showtimes. They look like this (three of them):

<div class="Datum" rel="d_0">27. června – středa, 20.00
</div><input class="Datum_cas" id="2012-06-27" readonly=""><a href="index.php?den=0" rel="0" class="Nazev">Zahájení letního kina 
</a><div style="display: block;" class="ajax_box d-0">
<span class="ajax_box Orig_nazev">zábava • hudba • film • letní bar
</span>
<span class="Tech_info">Svět podle Fagi
</span>
<span class="Popis">Facebooková  komiksová Fagi v podání divadla DNO. Divoké písně, co nezařadíte, ale slušně si na ně zařádíte. Slovní smyčky, co se na nich jde oběsit. Kabaret, improvizace, písně, humor, zběsilost i v srdci.<br>Koncert Tres Quatros Kvintet. Instrumentální muzika s pevným funkovým groovem, jazzovými standardy a neodmyslitelnými improvizacemi.
</span>
<input class="Datum_cas" id="ajax_0" type="text">
</div>

<div class="Datum" rel="d_1">27. června – středa, 21.30
</div><input class="Datum_cas" id="2012-06-27" readonly=""><a href="index.php?den=1" rel="1" class="Nazev">Soul Kitchen
</a><div style="display: block;" class="ajax_box d-1">
<span class="ajax_box Orig_nazev">Soul Kitchen
</span>
<span class="Tech_info">Komedie, Německo, 2009, 99 min., čes. a angl. tit.
</span>
<span class="Rezie">REŽIE: Fatih Akin 
</span>
<span class="Hraji">HRAJÍ: Adam Bousdoukos, Moritz Bleibtreu, Birol Ünel, Wotan Wilke Möhring
</span>
<span class="Popis">Poslední film miláčka publika Fatiho Akina, je turbulentním vyznáním lásky multikulturnímu Hamburku. S humorem zde Akin vykresluje příběh Řeka žijícího v Německu, který z malého bufetu vytvoří originální restauraci, jež se brzy stane oblíbenou hudební scénou. "Soul Kitchen" je skvělá komedie o přátelství, lásce, rozchodu a boji o domov, který je třeba v dnešním nevypočitatelném světě chránit víc než kdykoliv předtím. Zvláštní cena poroty na festivalu v Benátkách
</span>
<input class="Datum_cas" id="ajax_1" type="text">
</div>

<div class="Datum" rel="d_2">28. června – čtvrtek, 21:30
</div><input class="Datum_cas" id="2012-06-28" readonly=""><a href="index.php?den=2" rel="2" class="Nazev">Rodina je základ státu
</a><div style="display: block;" class="ajax_box d-2">
<span class="Tech_info">Drama, Česko, 2011, 103 min.
</span>
<span class="Rezie">REŽIE: Robert Sedláček
</span>
<span class="Hraji">HRAJÍ: Igor Chmela, Eva Vrbková, Martin Finger, Monika A. Fingerová, Simona Babčáková, Jiří Vyorálek, Jan Fišar, Jan Budař, Marek Taclík, Marek Daniel
</span>
<span class="Popis">Když vám hoří půda pod nohama, není nad rodinný výlet. Bývalý učitel dějepisu, který dosáhl vysokého manažerského postu ve významném finančním ústavu, si řadu let spokojeně žije společně se svou rodinou v luxusní vile na okraji Prahy. Bezstarostný život ale netrvá věčně a na povrch začnou vyplouvat machinace s penězi klientů týkající se celého vedení banky. Libor se následně ocitá pod dohledem policejních vyšetřovatelů, kteří mu začnou tvrdě šlapat na paty. Snaží se uniknout před hrozícím vězením a oddálit osvětlení celé situace své nic netušící manželce. Rozhodne se tak pro netradiční útěk, kdy pod záminkou společné dovolené odveze celou rodinu na jižní Moravu…  Rodinný výlet nebo zoufalý úprk před spravedlností? Igor Chmela, Eva Vrbková a Simona Babčáková v rodinném dramatu a neobyčejné road-movie inspirované skutečností.
</span>

Or like this:

<strong>POSEL&nbsp;&nbsp; 18.10.-22.10 v 18:30 </strong><br>Drama. ČR/90´. Režie: Vladimír Michálek Hrají: Matěj Hádek, Eva Leinbergerová, Jiří Vyorávek<br>Třicátník Petr miluje kolo a své vášni podřizuje celý svůj život. Neplánuje, neplatí účty, neřeší nic, co může<br>počkat  do zítra. Budování společného života s přételkyní je mu proti srsti  stejně jako dělat kariéru. Aby mohl jezdit na kole, raději pracuje jako  poslíček. Jeho život je neřízená střela, ve které neplatí žádná  pravidla. Ale problémy se na sebe na kupí a je stále těžší před nimi  ujet …<br> <br>

<strong>VE STÍNU&nbsp; 18.10.-24.10. ve 20:30 a 20.10.-22.10. též v 16:15</strong><br>Krimi. ČR/98´. Režie: D.Vondříček Hrají: I.Trojan, S.Koch, S.Norisová, J.Štěpnička, M.Taclík<br>Kapitán  Hakl (Ivan Trojan) vyšetřuje krádež v klenotnictví. Z běžné vloupačky  se ale vlivem zákulisních intrik tajné policie začíná stávat politická  kauza. Z nařízení Státní bezpečnosti přebírá Haklovo vyšetřování major  Zenke (Sebastian Koch), policejní specialista z NDR, pod jehož vedením  se vyšetřování ubírá jiným směrem, než Haklovi napovídá instinkt  zkušeného kriminalisty. Na vlastní pěst pokračuje ve vyšetřování. Může  jediný spravedlivý obstát v boji s dobře propojenou sítí komunistické  policie?&nbsp; Protivník je silný a Hakl se brzy přesvědčuje, že věřit nelze  nikomu a ničemu. Každý má svůj stín minulosti, své slabé místo, které  dokáže z obětí udělat viníky a z viníků hrdiny. <br><br>

<strong>ASTERIX A OBELIX VE SLUŽBÁCH JEJÍHO VELIČENSTVA&nbsp; ve 3D&nbsp;&nbsp;&nbsp; 20.10.-21.10. ve 13:45 </strong><br>Dobrodružná fantazy. Fr./124´. ČESKÝ DABING. Režie: Laurent Tirard<br>Hrají: Gérard Depardieu, Edouard Baer, Fabrice Luchini<br>Pod  vedením Julia Caesara napadly proslulé římské legie Británii. Jedné  malé vesničce se však daří statečně odolávat, ale každým dnem je slabší a  slabší. Britská královna proto vyslala svého věrného důstojníka  Anticlimaxe, aby vyhledal pomoc u Galů v druhé malinké vesničce ve  Francii vyhlášené svým důmyslným bojem proti Římanům… Když Anticlimax  popsal zoufalou situaci svých lidí, Galové mu darovali barel svého  kouzelného lektvaru a Astérix a Obélix jsou pověřeni doprovodit ho domů.  Jakmile dorazí do Británie, Anticlimax jim představí místní zvyky ve  vší parádě a všichni to pořádně roztočí! Vytočený Caesar se však  rozhodne naverbovat Normanďany, hrůzu nahánějící bojovníky Severu, aby  jednou provždy skoncovali s Brity. <br><br>

Or it can look like anything similar to this. No special rules in HTML markup, no special rules in order, etc.

Asked By: Honza Javorek

||

Answers:

tl;dr: The problem might solvable using ML, but it’s not straightforward if you’re new to the topic


There’s a lot of machine learning libraries for python:

  • Scikit-learn is very popular general-purpose for beginners and great for simple problems with smallish datasets.
  • Natural Language Toolkit has implementations for lots of algorithms, many of which are language agnostic (say, n-grams)
  • Gensim is great for text topic modelling
  • Opencv implements some common algorithms (but is usually used for images)
  • Spacy and Transformers implement modern (state-of-the-art, as of 2020) text NLU (Natural Language Understanding) techniques, but require more familiarity with the complex techniques

Usually you pick a library that suits your problem and the technique you want to use.

Machine learning is a very vast area. Just for the supervised-learning classification subproblem, and considering only “simple” classifiers, there’s Naive Bayes, KNN, Decision Trees, Support Vector Machines, feed-forward neural networks… The list goes on and on. This is why, as you say, there are no “quickstarts” or tutorials for machine learning in general. My advice here is, firstly, to understand the basic ML terminology, secondly, understand a subproblem (I’d advise classification within supervised-learning), and thirdly, study a simple algorithm that solves this subproblem (KNN relies on highschool-level math).

About your problem in particular: it seems you want detect the existence of a piece of data (postal code) inside an huge dataset (text). A classic classification algorithm expects a relatively small feature vector. To obtain that, you will need to do what’s called a dimensionality reduction: this means, isolate the parts that look like potential postal codes. Only then does the classification algorithm classify it (as “postal code” or “not postal code”, for example).

Thus, you need to find a way to isolate potential matches before you even think about using ML to approach this problem. This will most certainly entail natural language processing, as you said, if you don’t or can’t use regex or parsing.

More advanced models in NLU could potentially parse your whole text, but they might require very large amounts of pre-classified data, and explaining them is outside of the scope of this question. The libraries I’ve mentioned earlier are a good start.

Answered By: loopbackbee

I would suggest you look at the field of information extraction. A lot of people have been researching how to do exactly what you’re asking. There are some techniques for information extraction that are machine learning based, some techniques that are not machine learning based.

It is hard to comment further without looking at examples representative of the problem you want to solve (how does a postal address look in Czech?).

Answered By: carlosdc

The approach needs to be a supervised learning algorithm (typically, they yield much better results than unsupervised or semi-supervised methods). Also, notice that you need to basically extract chunks of text. Intuitively, your algorithm needs to say something like, “from this character onward, for the next three lines, is a postal address”.

I feel that a natural way to approach this will be a combination of word and character level n-gram language models. The modeling itself can be insanely sophisticated. As pointed out by mcstar, Cross Validated is a better place to get into those details.

Answered By: Chthonic Project

First, your task fits into the information extraction area of research. There are mainly 2 levels of complexity for this task:

  • extract from a given html page or a website with the fixed template
    (like Amazon). In this case the best way is to look at the HTML code
    of the pages and craft the corresponding XPath or DOM selectors to
    get to the right info. The disadvantage with this approach is that it
    is not generalizable to new websites, since you have to do it for
    each website one by one.
  • create a model that extracts same
    information from many websites within one domain (having an
    assumption that there is some inherent regularity in the way web
    designers present the corresponding attribute, like zip or phone or whatever else). In this case you should create some features (to use ML approach and let IE algorithm to “understand the content of pages”). The most common features are: DOM path, the format of the value (attribute) to be extracted, layout (like bold, italic and etc.), and surrounding context words. You label some values (you need at least 100-300 pages depending on domain to do it with some sort of reasonable quality). Then you train a model on the labelled pages. There is also an alternative to it – to do IE in unsupervised manner (leveraging the idea of information regularity across pages). In this case you/your algorith tries to find repetitive patterns across pages (without labelling) and consider as valid those, that are the most frequent.

The most challenging part overall will be to work with DOM tree and generate the right features. Also data labelling in the right way is a tedious task. For ML models – have a look at CRF, 2DCRF, semi-markov CRF.

And finally, this is in the general case a cutting edge in IE research and not a hack that you can do it a few evenings.

p.s. also I think NLTK will not be very helpful – it is an NLP, not Web-IE library.

Answered By: Nik

I had built a solution exactly for this. My goal was to extract all the information related to competitions available on the internet. I used a tweak.
What I did is that I detected the pattern in which the information are listed on the websites. In my case, they were listed one by one below the order,
I detected that using the html table tags and got the information related to
the competitions.

While it is a good solution , it works for some site and for some others the same code wont work. But you only have to change some parameters in the same code to make it work.

Answered By: Harsh Gupta

Firstly, Machine Learning is not magic. These algorithms perform specific tasks, even if these can be a bit complex sometimes.

The basic approach of any such task is to generate some reasonably representative labeled data, so that you can evaluate how well you are doing. “BOI” tags could work, where for each word you assign it a label: “O” (outside) if it is not something you’re looking for, “B” (beginning) if it is the start of an address, and “I” for all subsequent words (or numbers or whatever) in the address.

The second step is to think about how you want to evaluate your success. Is it important that you discover the most part of an address, or do you also need to know exactly what the thing is (postcode or street or city, etc). This then changes what you count as an error.

If you want your named-entity recogniser to work well, you have to know your data well, and decide on the best tool for the job. This may very well be a series of regular expressions with some rules on how to combine the results. I expect you’ll be able to find most of the data with relatively simple programmes. Once you have something simple that works, you check out the false positives (things that turned out not to be the thing you were looking for) and the false negatives (things that you missed), and look for patterns. If you see something that you can fix easily, try it out. A huge advantage of regex is that it is much easier to not only recognise something as part of an address, but also detect which part it is.

If you want to move beyond that, you may find that many NLP methods don’t perform well on your data, since “Natural Language Processing” usually needs something that looks like (you guessed it) Natural Language to recognise what something is.

Alternatively, since you can view it as a chunking problem, you might use Maximum Entropy Markov Models. This uses probabilities of transitioning from one type of word to another to chunk text into “part of an address” and “not part of an address”, in this case.

Good luck!

Answered By: marianne

As per i know there are two ways to do this task using machine learning approach.

1.Using computer vision to train the model and then extract the content based on your use case, this has already been implemented by diffbot.com.
and they have not open sourced their solution.

2.The other way to go around this problem is using supervised machine learning to train binary classifier to classify content vs boilerplate and then extract the content. This approach is used in dragnet.
and other research around this area. You can have a look at benchmark comparison among different content extraction techniques.

Answered By: Nitin Panwar