Commit d555d3e5 authored by Timo Petmanson's avatar Timo Petmanson
Browse files

Merge branch 'devel' of github.com:estnltk/estnltk into devel

* 'devel' of github.com:estnltk/estnltk:
  Dokumentatsioon valmis ja väiksemad parandused css-is
  The tutorial for elasticsearch
  Dokumentatsiooni alge 1.1
  Dokumentatsiooni alge 1.1
  Added details to TIMEX tagger's documentation.
parents 35f3ae5c 6a77586a
......@@ -20,6 +20,7 @@ import os
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, '/home/timo/projects/estnltk')
sys.path.insert(0, '/home/keeletehnoloogia/estnltk')
# -- General configuration ------------------------------------------------
......
.. _database_tutorial:
===========================================================
Handling large text collections with ElasticSearch database
===========================================================
=====================================================
Handling large text collections with Elastic database
=====================================================
.. content ..
Mention elasticsearch visualization plugin ES view
The activate Elastic (formerly Elasticsearch) carry out the guide from the Elastic team
at webpage `https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html`_.
Mention that simply for testing purposes one can increase memory using --ES_MAX_MEM switch
./elasticsearch --ES_MAX_MEM=4g
.. _https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html: https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html/
When the installation is complete you can run Elastic (from Elastic folder) with the command::
./elasticsearch
Elastic has a visualization plugin that can be accessed through a browser of your choosing.
To do this you need to write `http://localhost:9200/_plugin/head/`_ to the URL bar in your browser.
.. _http://localhost:9200/_plugin/head/: http://localhost:9200/_plugin/head/
For simple testing purposes one can increase the memory by using --ES_MAX_MEM switch.
Example of using the memory switch::
./elasticsearch --ES_MAX_MEM=4g
Bulk importing data
===================
......@@ -36,3 +50,40 @@ Eesti Koondkorpus, you can insert them using commands::
python3 -m estnltk.database.importer koond corpora/koond
python3 -m estnltk.database.importer eesti corpora/eesti
Insert Text object to database
==============================
Estnltk has a python function for inserting Text objects to Elastic database for further analysis.
It is important that you create a database before inserting. In the example there is a database created named 'test'.
After that the Text object is created with a sentence. Then the insert() function is being called.
Example for using the text insert::
from ..database import Database
from ...text import Text
db = Database('test')
text = Text('Mees, keda seal kohtasime, oli tuttav ja ta teretas meid.')
db.insert(text)
Searching the database for keywords
===================================
To search from the Elastic database you need to specify the name of the database and the keywords that you need
to start the search for. The function to do the search with is query_documents().
The example search is from the 'test' database and the search word is 'aegna'::
from ..database import Database
db = Database('test')
search = Database.query_documents(db, "aegna")
The search will return a json format query with the full text of the successful search result.
......@@ -9,10 +9,20 @@ Estnltk comes with HTML PrettyPrinter that can help building Web applications an
text processing.
PrettyPrinter is capable of very different types of visualization. From visualization of simple given word to multiple
and overlapping word types and even parts of whole sentences.
and overlapping word types and even parts of whole sentences. Here is a list of properties that can be modified with the
help of PrettyPrinter and the matching name of the value that the module is expecting:
Change font color - 'color'
Change background color - 'background'
Change font style - 'font'
Change font weight - 'weight'
Change font style - 'italics'
Add underline - 'underline'
Change font size - 'size'
Change letter spacing - 'tracking'
Example #1 formating specific word
Example #1 Formating specific word in all of text with different visual format.
from ...text import Text
from ..prettyprinter import PrettyPrinter
......@@ -34,27 +44,259 @@ The result of this short program will be:
</head>
<style>
mark.background_0 {
mark.background{
background-color: rgb(102, 204, 255);
}
</style>
<body>
<p>
This must be formated <mark class="background_0">here</mark> and <mark class="background_0">here</mark>
This must be formated <mark class="background">here</mark> and <mark class="background">here</mark>
</p>
</body>
</html>
</embed>
Class Text('...') is what does all the analysis. If we are looking to mark a specific word as in this case is the word
"here" then we must bind the annotation to the word "here" with the help of a function of Text('...') called
'here' then we must bind the annotation to the word 'here' with the help of a function of Text('...') called
tags_with_regex('annotations', 'here') that tags the value of 'annotations' to the word 'here'. This will later be used
to find the exact index where to start and end the selected formating.
When we create a new class PrettyPrinter variable by "pp = PrettyPrinter(background='annotations')", we add arguments
When we create a new class PrettyPrinter variable by 'pp = PrettyPrinter(background='annotations')', we add arguments
describing what property will be added to which tag, in our case, everything that is tagged as 'annotations' will get a
different background color. The rgb(102, 204, 255) is a stock value that is added as background color if no other color
is specified during initiation of the PrettyPrinter class object.
... content ...
Keep in mind that if we activate PrettyPrinter function with the argument 'False' instead of 'True', then the result
will not be the full HTML text, but only the formatted text inside the HTML body paragraph.
Example #2 Formating the same property with different visual format depending on the specific word
text = Text('Nimisõnad värvitakse').tag_analysis()
rules =[
('Nimisõnad', 'green'),
('värvitakse', 'blue')
]
pp = PrettyPrinter(background='words', background_value=rules)
html = pp.render(text, True)
The result of this program will be:
<embed>
<!DOCTYPE html>
<html>
<head>
<link rel="stylesheet" type="text/css" href="prettyprinter.css">
<meta charset="utf-8">
<title>PrettyPrinter</title>
</head>
<style>
mark.background_0 {
background-color: green;
}
mark.background_1 {
background-color: blue;
}
</style>
<body>
<mark class="background_0">Nimisõnad</mark> <mark class="background_1">värvitakse</mark>
</body>
</html>
</embed>
This time we gave the PrettyPrinter class object two arguments: background='words', background_value=rules. The background
value 'words' means that we will not be adding any specific tags as in the previous case, but instead use the original
tag that is used in case of every word. PrettyPrinter will check itself what words match the rules specified in the list
'rules'. Now the second argument background_value=rules shows PrettyPrinter what values will be given to what tag values.
Basically what our 'rules' say to the PrettyPrinter is that each word 'Nimisõnad' will be given a green background
color and the word 'värvitakse' will be given a blue background color. Because different words can have different visual
properties of the same type(eg. background color, font color, font size etc.) the css marks are numbered based on the
number of overlapping values.
Example #3 Using word type tags as rule parameters
text = Text('Suured kollased kõrvad ja').tag_analysis()
rules =[
('A', 'blue'),
('S', 'green')
]
pp = PrettyPrinter(background='words', background_value=rules)
html = pp.render(text, True)
This time the defining parameters are 'A' and 'S' which stand for different word types. The list of different tags can
be found below:
A - adjective
C - comparing adjective
D - adverb
G - non declinable adjective
H - real name
I - interjection
J - conjunction
K - co-expression
N - cardinal numeral
O - ordinal numeral
P - pronoun
S - noun
U - superlative adjective
V - verb
X -
Y - abbreviation
Z - sign
PrettyPrinter will sort everything else out by itself. The result of this will be:
<embed>
<!DOCTYPE html>
<html>
OK
<head>
<link rel="stylesheet" type="text/css" href="prettyprinter.css">
<meta charset="utf-8">
<title>PrettyPrinter</title>
</head>
<style>
mark.background_0 {
background-color: blue;
}
mark.background_1 {
background-color: green;
}
</style>
<body>
<mark class="background_0">Suured</mark> <mark class="background_0">kollased</mark> <mark class="background_1">kõrvad</mark> ja
</body>
</html>
</embed>
As we can see from the results, all adjectives have been marked with a css background mark tag for color blue and the
noun in the sentence has been marked with a css background mark tag for color green. In this way it is possible to
visually separate all words that are of a specific type simply and effectively.
Example #4 Using different category visual representation dor different parts of text
text = Text('Esimene ja teine märgend')
text.tag_with_regex('A', 'Esimene ja')
text.tag_with_regex('B', 'ja teine')
pp = PrettyPrinter(color='A', background='B')
html = pp.render(text, False)
This time we want to highlight two different word types with different properties, font color and background color. To
do this, we have to add both layers as PrettyPrinter class parameters and tie those to a certain value. With
text.tag_with_regex('A', 'Esimene ja') we bind the formating option in PerttyPrinter parameters 'color='A'' applies to
'Esimene ja' part of the text. What happens is that we will have two different css formats, each changing different
things. Here we can also see that the formatting works with overlapping layers, because the word 'ja' is in both 'A' and
'B'. The output with 'False' as the second parameter in render, will be the following:
<mark class="color">Esimene </mark><mark class="background color">ja</mark><mark class="background"> teine</mark> märgend
Here we can see, that the word 'ja' has two class tags, 'background' and 'color'.
Generating just the css
It is possible, to use PrettyPrinter to generate just the css formatting without the HTML or the actual word content. In
this case we just supply the PrettyPrinter class object with the necessary parameters and additional rules(if needed)
and the class will generate the required css mark tags.
Example #5 generating one layer css
pp = PrettyPrinter(color='layer')
css_format = pp.css
This is the simplest form and the result will be:
<embed>
mark.color {
color: rgb(0, 0, 102);
}
</embed>
Example #6 generating css with user defined color value
pp = PrettyPrinter(color='layer', color_value='color_you_have_never_seen')
css_format = pp.css
Similar to last one, the result will be simple color marking, but with the user define value.
<embed>
mark.color {
color: color_you_have_never_seen;
}
</embed>
Example #7 generating css with rules
rules = [
('Nimisõnad', 'green'),
('värvitakse', 'blue')
]
pp = PrettyPrinter(color='layer', color_value=rules)
css_format = pp.css
This simple program generates two mark color classes that define two sets of font color.
<embed>
mark.color_0 {
color: green;
}
mark.color_1 {
color: blue;
}
</embed>
Example #8 generating full css without rules
AESTHETICS = {
'color': 'layer1',
'background': 'layer2',
'font': 'layer3',
'weight': 'layer4',
'italics': 'layer5',
'underline': 'layer6',
'size': 'layer7',
'tracking': 'layer8'
}
pp = PrettyPrinter(**AESTHETICS)
css_format = pp.css
This program returns the css default formatting for all the properties in AESTHETICS.
<embed>
mark.background {
background-color: rgb(102, 204, 255);
}
mark.size {
font-size: 120%;
}
mark.color {
color: rgb(0, 0, 102);
}
mark.tracking {
letter-spacing: 0.03em;
}
mark.weight {
font-weight: bold;
}
mark.underline {
font-decoration: underline;
}
mark.font {
font-family: sans-serif;
}
mark.italics {
font-style: italic;
}
</embed>
......@@ -1178,12 +1178,17 @@ There are a number of mandatory attributes present in the dictionaries:
* **start, end** - the expression start and end positions in the text.
* **tid** - TimeML format *id* of the expression.
* **id** - the zero-based *id* of the expressions, matches the position of the respective dictionary in the resulting list.
* **temporal_function** - *true*, if the expression is relative and exact date has to be computed from anchor points.
* **type** - according to TimeML, four types of temporal expressions are distinguished:
* **type** - following the TimeML specification, four types of temporal expressions are distinguished:
* *DATE expressions*, e.g. *järgmisel kolmapäeval* (*on next Wednesday*)
* *TIME expressions*, e.g. *kell 18.00* (*at 18 o’clock*)
* *DURATIONs*, e.g. *viis päeva* (*five days*)
* *SETs of times*, e.g. *igal aastal* (*on every year*)
* **temporal_function** - boolean value indicating whether the semantics of the expression are relative to the context.
* For DATE and TIME expressions:
* *True* indicates that the expression is relative and semantics have been computed by heuristics;
* *False* indicates that the expression is absolute and semantics haven't been computed by heuristics;
* For DURATION expressions, *temporal_function* is mostly *False*, except for vague durations;
* For SET expressions, *temporal_function* is always *True*;
The **value** is a mandatory attribute containing the semantics and has four possible formats:
......@@ -1254,7 +1259,7 @@ However, when passing ``creation_date=datetime.datetime(1986, 12, 21)``::
import datetime
Text('Täna on ilus ilm', creation_date=datetime.datetime(1986, 12, 21)).timexes
We see that word "today" (*täna*) refers to to December 21 1986::
We see that word "today" (*täna*) refers to to December 21, 1986::
[{'end': 4,
'id': 0,
......
......@@ -31,13 +31,14 @@ class InsertTest(unittest.TestCase):
def test_insert_default_ids(self):
# see pole warningu eemaldamiseks sobiv viis, sest warning lihtsalt peidetakse. Pigem las ta olla nähtav.
# TODO: delete me: warnings.simplefilter("ignore")
self.db.refresh()
db = self.db
# insert the documents
id_first = db.insert(first())
print(id_first)
id_second = db.insert(second())
print(id_second)
# check the count
self.assertEqual(2, db.count())
......@@ -51,19 +52,12 @@ class BulkInsertTest(unittest.TestCase):
def setUp(self):
self.db = Database(TEST_INDEX)
self.db.delete()
#self.db.delete()
def test_bulk_insert(self):
db = self.db
db.refresh()
# parem on tõsta first ja second InsertTestist lihtsalt välja (tegin juba selle ära).
# uue instantsi tegemine on ebavajalik.
# TODO: delete me.
# insert many (bulk) into db bulk_test
# it = InsertTest()
# text_lists = [it.first, it.second]
text_lists = [first(), second()]
id_bulk = db.bulk_insert(text_lists)
......@@ -74,7 +68,6 @@ class BulkInsertTest(unittest.TestCase):
class SearchTest(unittest.TestCase):
def test_search_keyword_documents(self):
# TODO: move Database setup and initialization to def setUp() method
self.db = Database(TEST_INDEX)
keywords = ["aegna"]
search = Database.query_documents(self.db, query=keywords)
......
......@@ -111,6 +111,7 @@ class PrettyPrinter(object):
css_list = []
for aes in self.aesthetics:
css_list.extend(get_mark_css(aes, self.values[aes]))
print('\n'.join(css_list))
return '\n'.join(css_list)
def render(self, text, add_header=False):
......@@ -118,6 +119,7 @@ class PrettyPrinter(object):
html = html.replace('\n', '<br/>')
if add_header:
html = '\n'.join([HEADER, self.css, MIDDLE, html, FOOTER])
print('\n'.join((HEADER, self.css, MIDDLE, html, FOOTER)))
return html
......@@ -28,13 +28,13 @@ MIDDLE = '''
FOOTER = '\t</body>\n</html>'
MARK_SIMPLE_CSS = '''mark.{aes_name} {{
{css_prop}: {css_value};
}}'''
MARK_SIMPLE_CSS = '''\t\tmark.{aes_name} {{
\t\t\t{css_prop}: {css_value};
\t\t}}'''
MARK_RULE_CSS = '''mark.{aes_name}_{rule_index} {{
{css_prop}: {css_value};
}}'''
MARK_RULE_CSS = '''\t\tmark.{aes_name}_{rule_index} {{
\t\t\t{css_prop}: {css_value};
\t\t}}'''
def get_mark_css(aes_name, css_value):
......@@ -73,4 +73,4 @@ CLOSING_MARK = '</mark>'
def get_opening_mark(classes):
return OPENING_MARK.format(classes=htmlescape(classes))
return OPENING_MARK.format(classes=htmlescape(classes))
\ No newline at end of file
......@@ -14,7 +14,7 @@ class TestRender(unittest.TestCase):
def test_no_aesthetics(self):
text = Text('See tekst on lihtsalt tühi')
pp = PrettyPrinter()
html = pp.render(text)
html = pp.render(text, False)
self.assertEqual(text.text, html)
def test_simple_annotations(self):
......@@ -22,7 +22,7 @@ class TestRender(unittest.TestCase):
text.tag_with_regex('annotations', 'siin')
pp = PrettyPrinter(background='annotations')
html = pp.render(text)
html = pp.render(text, False)
expected = 'Siin tekstis on märgend <mark class="background">siin</mark> ja teine on <mark class="background">siin</mark>'
self.assertEqual(expected, html)
......@@ -33,7 +33,7 @@ class TestRender(unittest.TestCase):
text['annotations'] = [text[CLAUSES][0]] # use the first clause only
pp = PrettyPrinter(background='annotations')
html = pp.render(text)
html = pp.render(text, False)
expected = '<mark class="background">Mees</mark>, kes oli tuttav, <mark class="background">teretas meid.</mark>'
self.assertEqual(expected, html)
......@@ -44,7 +44,7 @@ class TestRender(unittest.TestCase):
text.tag_with_regex('B', 'teine')
pp = PrettyPrinter(color='A', background='B')
html = pp.render(text)
html = pp.render(text, False)
expected = '<mark class="color">Esimene</mark> ja <mark class="background">teine</mark> märgend'
self.assertEqual(expected, html)
......@@ -55,7 +55,8 @@ class TestRender(unittest.TestCase):
text.tag_with_regex('B', 'ja teine')
pp = PrettyPrinter(color='A', background='B')
html = pp.render(text)
html = pp.render(text, False)
print(html)
expected = '<mark class="color">Esimene </mark><mark class="background color">ja</mark><mark class="background"> teine</mark> märgend'
self.assertEqual(expected, html)
......@@ -75,7 +76,7 @@ class TestRender(unittest.TestCase):
size='size',
tracking='tracking',
italics='italics')
html = pp.render(text)
html = pp.render(text, False)
expected = ['a ',
'<mark class="background color">b </mark>',
......@@ -86,4 +87,4 @@ class TestRender(unittest.TestCase):
'<mark class="tracking"> i</mark>',
' j ',
'<mark class="italics">k</mark>']
self.assertEqual(''.join(expected), html)
self.assertEqual(''.join(expected), html)
\ No newline at end of file
......@@ -42,7 +42,7 @@ class SimpleTest(unittest.TestCase):
text = self.text
pp = PrettyPrinter(background='words', background_value=self.rules)
html = pp.render(text)
html = pp.render(text, False)
self.assertEqual(self.expected, html)
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment