README.md 3.07 KB
Newer Older
Indrek Jentson's avatar
Indrek Jentson committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# <Put_here_name> experiment

This project contains a data transformation experiment of ...

Transformation includes

* <step 1>
* <step 2>
* <etc>

## Source

* Text corpora is ...

## Tools

* Perl
* Parallel
* Bash
* Python 2.7

* transform.sh - Bash script for running transformation.
* validate.sh - Bash script for running validation of output.

## Conf

* setup.sh - Bash script for installing tools.
* setenv.sh - Bash script for initiating environment variables for other scripts.
* version - File with a given version number in it.

## Var

* ...  
* The first version of files in _var_ is tagged as "v1.0.0".

## Output

* ...

Indrek Jentson's avatar
Indrek Jentson committed
40
41
# Experiment template

Indrek Jentson's avatar
Indrek Jentson committed
42
The main principles are following:
Indrek Jentson's avatar
Indrek Jentson committed
43
44
45
46
47
48
49

* The experiment includes a source dataset in directory _source_.
* The experiment includes the transformation and validation tools in directory _tools_.
* The experiment includes a setup script and a definition of transformation process in directory _conf_.
* Directories _source_, _tools_ and _conf_ must remain unchanged during the transformations.
* The experiment includes the transformation parameters in directory _var_.
* All files are under version control.
Indrek Jentson's avatar
Indrek Jentson committed
50
51
52
53
54
55
56
57
58

Description of the workflow:

* Active directory must be _tools_.
* An user can start a new experiment with data from any previous experiment
(script: *startchange.sh* [from_tag]; if 'from_tag' is given and it is not 'HEAD' then a new branch will be created).
* An user will make changes in the transformation parameters (in directory _var_) with any means necessary.
* After change in the transformation parameters, an experiment environment must run transformation process 
and save an output in directory _output_ (script: *transform.sh*).
Indrek Jentson's avatar
Indrek Jentson committed
59
* If the transformation process produces the log files then they must be saved in directory _log_.
Indrek Jentson's avatar
Indrek Jentson committed
60
61
62
63
64
65
* After the results are produced, an environment must run validation process and save a report in directory _result_ 
(script: *validate.sh* [previous_tag (current_tag|LOCAL)]; LOCAL means that current files are compared against 'previous' set and current files are not commited yet).
* A validation process will compare current files with files in previous experiment. Also, the comparition can be done between any tagged stage in git.
* An user can review a report (file: result/diff.html) and resume the changing of transformation parameters or finish the changing (see next step).
* All changes in var, output, log and result files must be then commited and tagged with new version number 
(script: *stopchange.sh* [new_tag]; if 'new_tag' is missing then value will be calculated from previous version number).
Indrek Jentson's avatar
Indrek Jentson committed
66
67
68
69
70
71
72
73
74
75

For a totally new experiment an user must:

* create a copy of this template project;
* replace files in _source_ directory;
* define a transformation process;
* prepare a setup script which installs all necessary transformation and validation tools;
* prepare the files with transformation parameters (usually transformation rules);
* start a experiment environment with created project.

Indrek Jentson's avatar
Indrek Jentson committed
76
NB! In the first stage of development we assume that experiments are running under Linux (Debian 8) and user has sudo rights.
Indrek Jentson's avatar
Indrek Jentson committed
77