Commit 7b1511ca authored by Alexander Tkachenko's avatar Alexander Tkachenko
Browse files

Updated ner tutorial

parent 7cfe49dc
========================
Named entity recognition
========================
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations.
The `estnltk` package comes with the pre-trained NER-models for Python 2.7/Python 3.4.
In this tutorial you will learn how to use `estnltk`'s out of the box NER utilities and how to build your own ner-models from scratch.
Getting started with NER
========================
The `estnltk` package comes with the pre-trained NER-models for Python 2.7/Python 3.4. The models distinguish 3 types of entities: person names, organizations and locations.
A quick example below demonstrates how to extract named entities from the raw text::
......@@ -18,6 +25,9 @@ A quick example below demonstrates how to extract named entities from the raw te
# Extract named entities
pprint(text.named_entities)
::
['Eesti vabariik',
'põhi',
'Euroopa',
......@@ -32,11 +42,14 @@ A quick example below demonstrates how to extract named entities from the raw te
'Toomas Hendrik Ilves']
When calling a property :py:attr:`estnltk.text.Text.named_entities`, `estnltk` executes on the background the whole text processing pipeline, including tokenization, morphological analysis and named entity extraction.
When accessing the property :py:attr:`~estnltk.text.Text.named_entities` of the :py:class:`~estnltk.text.Text` instance, `estnltk` executes on the background the whole text processing pipeline, including tokenization, morphological analysis and named entity extraction.
A class :class:`estnltk.text.Text` additionally provides a number of useful methods to get more information on the extracted entities::
The class :class:`~estnltk.text.Text` additionally provides a number of useful methods to get more information on the extracted entities::
pprint(list(zip(text.named_entities, text.named_entity_labels, text.named_entity_spans)))
::
[('Eesti vabariik', 'LOC', (0, 14)),
('Põhi Euroopa', 'LOC', (25, 37)),
('Eesti', 'LOC', (46, 51)),
......@@ -50,15 +63,14 @@ A class :class:`estnltk.text.Text` additionally provides a number of useful meth
('Toomas Hendrik Ilves', 'PER', (320, 340))]
Advanced NER
============
The default models use tags PER, ORG and LOC to denote person names, organizations and locations respectively. Entity tags are encoded using a BIO annotation scheme, where each entity label is prefixed with either B or I letter. B- denotes the beginning and I- inside of an entity. The prefixes are used to detect multiword entities, as shown in the example example above. All other words, which don't refer to entities of interest, are labelled with the O tag.
Tagging scheme
--------------
The default models use tags PER, ORG and LOC to denote person names, organizations and locations respectively. Entity tags are encoded using a widely accepted BIO annotation scheme, where each label is prefixed with B or I, or the entire label is given as O. B- denotes the beginning and I- inside of an entity, while O means omitted. This can be used to detect multiword entities, as shown in the example example above. The raw labels are accessible via :py:attr:`estnltk.text.Text.labels` property::
The raw labels are accessible via the property :py:attr:`~estnltk.text.Text.labels` of the :py:attr:`~estnltk.text.Text` instance::
pprint(list(zip(text.word_texts, text.labels)))
::
[('Eesti', 'B-LOC'),
('Vabariik', 'I-LOC'),
('on', 'O'),
......@@ -115,22 +127,24 @@ The default models use tags PER, ORG and LOC to denote person names, organizatio
('.', 'O')]
Advanced NER
============
Training custom models
----------------------
Default models that come with `estnltk` are good enough for basic tasks. However, for some specific tasks, a custom NER model might be needed. To train your own model, you need to provide a training corpus and custom configuration settings. Training is done using a class :class:`estnltk.ner.NerTrainer`. The following example demonstrates how to train a ner model using a default training dataset `/home/projects/estnltk/estnltk/corpora/estner.json` and a settings module :py:mod:`estnltk.estner.settings`::
Default models that come with `estnltk` are good enough for basic tasks. However, for some specific tasks, a custom NER model might be needed. To train your own model, you need to provide a training corpus and custom configuration settings. The following example demonstrates how to train a ner-model using the default training dataset and settings::
from estnltk.corpus import read_json_corpus
from estnltk.ner import NerTrainer
# Read the corpus
# Read the default training corpus
corpus = read_json_corpus('/home/projects/estnltk/estnltk/corpora/estner.json')
# Read ner settings
# Read the default settings
ner_settings = estnltk.estner.settings
# Direcrory to save the model
# Directory to save the model
model_dir = '<output model directory>'
# Train and save the model
......@@ -138,7 +152,7 @@ Default models that come with `estnltk` are good enough for basic tasks. However
trainer.train(corpus, model_dir)
The specified output directory will contain the resulting model file `model.bin` and a copy of a settings module used for training. Now, this model can be used to ne-tag text using a class :class:`estnltk.ner.NerTagger`::
The specified output directory will contain a resulting model file `model.bin` and a copy of a settings module used for training. Now we can load the model and tag some text using :py:class:`~estnltk.ner.NerTagger`::
from estnltk.ner import NerTagger
......@@ -151,6 +165,9 @@ The specified output directory will contain the resulting model file `model.bin`
tagger.tag_document(document)
pprint(list(zip(document.word_texts, document.labels)))
::
[('Eesti', 'B-ORG'),
('koeraspordiliidu', 'I-ORG'),
('(', 'O'),
......@@ -166,13 +183,72 @@ The specified output directory will contain the resulting model file `model.bin`
Training dataset
--------------------
`Estnltk` includes a training dataset used to train original models in `estnltk/estnltk/corpora/estner.json`.
----------------
To train a model with `estnltk`, you need to provide your training data in a certain format (see the default dataset `estnltk/estnltk/corpora/estner.json` for example). The training file contains one document per line along with ne-labels. Let's create a simple document::
text = Text('''Eesti Vabariik on riik Põhja-Euroopas.''')
text.tokenize_words()
pprint(text)
::
{'paragraphs': [{'end': 38, 'start': 0}],
'sentences': [{'end': 38, 'start': 0}],
'text': 'Eesti Vabariik on riik Põhja-Euroopas.',
'words': [{'end': 5, 'start': 0, 'text': 'Eesti'},
{'end': 14, 'start': 6, 'text': 'Vabariik'},
{'end': 17, 'start': 15, 'text': 'on'},
{'end': 22, 'start': 18, 'text': 'riik'},
{'end': 28, 'start': 23, 'text': 'Põhja'},
{'end': 29, 'start': 28, 'text': '-'},
{'end': 37, 'start': 29, 'text': 'Euroopas'},
{'end': 38, 'start': 37, 'text': '.'}]}
Next, let's add named entity tags to each word in the document::
words = text.words
# label each word as "other":
for word in words:
word['label'] = 'O'
# label words "Eesti Vabariik" as a location
words[0]['label'] = 'B-LOC'
words[1]['label'] = 'I-LOC'
# label words "Põhja-Euroopas" as a location
words[4]['label'] = 'B-LOC'
words[5]['label'] = 'I-LOC'
words[6]['label'] = 'I-LOC'
pprint(text.words)
::
[{'end': 5, 'label': 'B-LOC', 'start': 0, 'text': 'Eesti'},
{'end': 14, 'label': 'I-LOC', 'start': 6, 'text': 'Vabariik'},
{'end': 17, 'label': 'O', 'start': 15, 'text': 'on'},
{'end': 22, 'label': 'O', 'start': 18, 'text': 'riik'},
{'end': 28, 'label': 'B-LOC', 'start': 23, 'text': 'Põhja'},
{'end': 29, 'label': 'I-LOc', 'start': 28, 'text': '-'},
{'end': 37, 'label': 'I-LOc', 'start': 29, 'text': 'Euroopas'},
{'end': 38, 'label': 'O', 'start': 37, 'text': '.'}]
Once we have a collection of labelled documents, we can save it to disc using the function :py:func:`estnltk.corpus.write_json_corpus`::
from estnltk.corpus import write_json_corpus
documents = [text]
write_json_corpus(documents, '<output file name>')
This serializes each document object into a json string and saves to the specified file line by line. The resulting training file can be used with the :py:class:`~estnltk.ner.NerTrainer` as shown above.
Ner settings
-------------
By default, `estnltk` uses configuration from a module :mod:`estnltk.estner.settings`. A settings module defines training algorithm parameters, entity categories, feature extractors and feature templates. The simplest way to create a custom configuration is to make a new settings module, e.g. `custom_settings.py`, import the default settings and override necessary parts. For example, a custom minimalistic configuration module could look like this::
By default, `estnltk` uses configuration module :mod:`estnltk.estner.settings`. A settings module defines training algorithm parameters, entity categories, feature extractors and feature templates. The simplest way to create a custom configuration is to make a new settings module, e.g. `custom_settings.py`, import the default settings and override necessary parts. For example, a custom minimalistic configuration module could look like this::
from estnltk.estner.settings import *
......@@ -186,6 +262,6 @@ By default, `estnltk` uses configuration from a module :mod:`estnltk.estner.sett
"estnltk.estner.featureextraction.MorphFeatureExtractor",
)
Now, the :class:`estnltk.ner.NerTrainer` instance can be initialized using `custom_settings` module (make sure `custom_settings.py` is on python path)::
Now, the :class:`estnltk.ner.NerTrainer` instance can be initialized using the `custom_settings` module (make sure `custom_settings.py` is on your python path)::
trainer = NerTrainer(custom_settings)
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment