Commit e1bb0888 authored by Timo Petmanson's avatar Timo Petmanson
Browse files

Updated the database tutorial

parent 4ea97843
...@@ -4,16 +4,21 @@ ...@@ -4,16 +4,21 @@
Handling large text collections with Elastic database Handling large text collections with Elastic database
===================================================== =====================================================
.. content .. Estnltk has database module that simplifies working with large corpora.
Check out :ref:`wikipedia_tutorial`, :ref:`tei_tutorial` for more information
about getting started with larger text document collections.
The activate Elastic (formerly Elasticsearch) carry out the guide from the Elastic team Estnltk database integrates with `Elastic`_, which is a distributed RESTful schema-free
at webpage `https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html`_. JSON database, based on `Apache Lucene`_.
See this `guide`_ for installation.
.. _https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html: https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html/ .. _Elastic: https://www.elastic.co/downloads/elasticsearch
.. _Apache Lucene: https://lucene.apache.org/
.. _guide: https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html
When the installation is complete you can run Elastic (from Elastic folder) with the command:: When the installation is complete you can run Elastic (from Elastic folder) with the command::
./elasticsearch ./bin/elasticsearch
Elastic has a visualization plugin that can be accessed through a browser of your choosing. Elastic has a visualization plugin that can be accessed through a browser of your choosing.
To do this you need to write `http://localhost:9200/_plugin/head/`_ to the URL bar in your browser. To do this you need to write `http://localhost:9200/_plugin/head/`_ to the URL bar in your browser.
...@@ -23,13 +28,77 @@ To do this you need to write `http://localhost:9200/_plugin/head/`_ to the URL ...@@ -23,13 +28,77 @@ To do this you need to write `http://localhost:9200/_plugin/head/`_ to the URL
For simple testing purposes one can increase the memory by using --ES_MAX_MEM switch. For simple testing purposes one can increase the memory by using --ES_MAX_MEM switch.
Example of using the memory switch:: Example of using the memory switch::
./elasticsearch --ES_MAX_MEM=4g ./bin/elasticsearch --ES_MAX_MEM=4g
.. hint::
If you have trouble running Elastic, please refer to `Elastic guide`_.
Do your research before asking us. Estnltk has only a very thin wrapper around the `Elastic Python API`_ .
.. _Elastic guide: https://www.elastic.co/guide/index.html
.. _Elastic Python API: https://elasticsearch-py.readthedocs.org/en/master/
Estnltk Elastic wrapper
=======================
Estnlt has :py:class:`~estnltk.database.database.Database` class that represents a single index of Elastic.
The most important thing to know is the constructor signature::
def __init__(self, index, doc_type='document', **kwargs):
"""
Parameters
----------
index: str
The name of the Elastic index.
doc_type:
The document type to use (default: 'document')
keyword_argument:
All keyword arguments will be passed to Python Elasticsearch constructor.
"""
So, for instance, if you instead of default http://127.0.0.1:9200 want to connect to
http://myserver.com:12345, you need to use::
from estnltk import Database
hosts = [{
'host': 'http://myserver.com',
'port': 443
}]
db = Database('test', hosts=hosts)
Check the `Elastic Python docs`_ for more details.
If Elastic server runs in a certain machine you can access over SSH, you might also want to read about
`SSH tunneling`_ .
Another important property is the actual ElasticSearch instance that Estnltk wraps,
which can be retrieved via :py:meth:`~estnltk.database.database.Database.es` property.
Use this for **complete control over the connection**.
.. _`Elastic Python docs`: https://elasticsearch-py.readthedocs.org/en/master/api.html#elasticsearch.Elasticsearch
.. _`SSH tunneling`: http://blog.trackets.com/2014/05/17/ssh-tunnel-local-and-remote-port-forwarding-explained-with-examples.html
Inserting Text objects to database
==================================
Estnltk has a python function for inserting Text objects to Elastic database for further analysis.
It is important that you create a database before inserting.
In the example there is a database created named 'test'.
After that the Text object is created with a sentence.
Then the :py:meth:`~estnltk.database.database.Database.insert` method is being called.
Example for using the text insert::
from ..database import Database
from ...text import Text
db = Database('test')
text = Text('Mees, keda seal kohtasime, oli tuttav ja ta teretas meid.')
db.insert(text)
Connecting to the server
========================
By default, Elastic tries to connect to localhost 127.0.0.1:9200 .
Bulk importing data Bulk importing data
=================== ===================
...@@ -57,26 +126,8 @@ Eesti Koondkorpus, you can insert them using commands:: ...@@ -57,26 +126,8 @@ Eesti Koondkorpus, you can insert them using commands::
python3 -m estnltk.database.importer eesti corpora/eesti python3 -m estnltk.database.importer eesti corpora/eesti
Insert Text object to database Check out :ref:`wikipedia_tutorial`, :ref:`tei_tutorial` for more information
============================== if you want to download some large and useful datasets to work with.
Estnltk has a python function for inserting Text objects to Elastic database for further analysis.
It is important that you create a database before inserting. In the example there is a database created named 'test'.
After that the Text object is created with a sentence. Then the insert() function is being called.
Example for using the text insert::
from ..database import Database
from ...text import Text
db = Database('test')
text = Text('Mees, keda seal kohtasime, oli tuttav ja ta teretas meid.')
db.insert(text)
Searching the database for keywords Searching the database for keywords
=================================== ===================================
...@@ -93,3 +144,4 @@ The example search is from the 'test' database and the search word is 'aegna':: ...@@ -93,3 +144,4 @@ The example search is from the 'test' database and the search word is 'aegna'::
search = Database.query_documents(db, "aegna") search = Database.query_documents(db, "aegna")
The search will return a json format query with the full text of the successful search result. The search will return a json format query with the full text of the successful search result.
...@@ -110,4 +110,4 @@ You can access the annotated layers as you would access typical layers:: ...@@ -110,4 +110,4 @@ You can access the annotated layers as you would access typical layers::
{'end': 80, 'start': 75, 'text': 'soola'}] {'end': 80, 'start': 75, 'text': 'soola'}]
See package ``estnltk.grammar.examples`` for more examples. See package ``estnltk.grammar.examples`` for more examples.
\ No newline at end of file
.. _tei_tutorial:
================================= =================================
Working with Estonian Koondkorpus Working with Estonian Koondkorpus
================================= =================================
......
...@@ -177,7 +177,7 @@ Only difference is that by using :py:attr:`~estnltk.text.Text.word_texts` proper ...@@ -177,7 +177,7 @@ Only difference is that by using :py:attr:`~estnltk.text.Text.word_texts` proper
Second call would use the ``start`` and ``end`` attributes already stored in the :py:class:`~estnltk.text.Text` instance. Second call would use the ``start`` and ``end`` attributes already stored in the :py:class:`~estnltk.text.Text` instance.
The default word tokenizer is NLTK-s `WordPunctTokenizer`_:: The default word tokenizer is a modification of `WordPunctTokenizer`_ ::
from nltk.tokenize.regexp import WordPunctTokenizer from nltk.tokenize.regexp import WordPunctTokenizer
tok = WordPunctTokenizer() tok = WordPunctTokenizer()
......
.. _wikipedia_tutorial:
======================================== ========================================
Working with Estonian and Võru wikipedia Working with Estonian and Võru wikipedia
======================================== ========================================
......
...@@ -12,3 +12,4 @@ from .disambiguator import Disambiguator ...@@ -12,3 +12,4 @@ from .disambiguator import Disambiguator
from .prettyprinter import PrettyPrinter from .prettyprinter import PrettyPrinter
from .database import Database from .database import Database
from .grammar import * from .grammar import *
from .tokenizers.word_tokenizer import EstWordTokenizer
...@@ -51,6 +51,15 @@ def prepare_text(text): ...@@ -51,6 +51,15 @@ def prepare_text(text):
class Database(object): class Database(object):
"""Database class represents a single index in Elastic """Database class represents a single index in Elastic
and helps with inserting and querying Estnltk documents. and helps with inserting and querying Estnltk documents.
Parameters
----------
index: str
The name of the Elastic index.
doc_type:
The document type to use (default: 'document')
keyword_argument:
All keyword arguments will be passed to Python Elasticsearch constructor.
""" """
def __init__(self, index, doc_type='document', **kwargs): def __init__(self, index, doc_type='document', **kwargs):
...@@ -60,14 +69,17 @@ class Database(object): ...@@ -60,14 +69,17 @@ class Database(object):
@property @property
def index(self): def index(self):
"""The name of the index."""
return self.__index return self.__index
@property @property
def doc_type(self): def doc_type(self):
"""The doc_type property"""
return self.__doc_type return self.__doc_type
@property @property
def es(self): def es(self):
"""The ElasticSearch instance."""
return self.__es return self.__es
def insert(self, text, id=None): def insert(self, text, id=None):
...@@ -117,6 +129,7 @@ class Database(object): ...@@ -117,6 +129,7 @@ class Database(object):
result = self.es.bulk(index=self.index, doc_type=self.doc_type, body=insert_data, refresh=refresh) result = self.es.bulk(index=self.index, doc_type=self.doc_type, body=insert_data, refresh=refresh)
def get(self, id): def get(self, id):
"""Retrieve a document with given id."""
return self.es.get(index=self.index, doc_type=self.doc_type, id=id, ignore=[400, 404])['_source']['text'] return self.es.get(index=self.index, doc_type=self.doc_type, id=id, ignore=[400, 404])['_source']['text']
def refresh(self): def refresh(self):
...@@ -124,6 +137,7 @@ class Database(object): ...@@ -124,6 +137,7 @@ class Database(object):
self.es.indices.refresh(index=self.index, ignore=[400, 404]) self.es.indices.refresh(index=self.index, ignore=[400, 404])
def delete_index(self): def delete_index(self):
"""Delete the index."""
self.es.indices.delete(index=self.index, ignore=[400, 404]) self.es.indices.delete(index=self.index, ignore=[400, 404])
def delete(self, index, id): def delete(self, index, id):
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment