... | ... | @@ -15,5 +15,15 @@ cat sonaveeb_2020.prevert |sed 's/%/\\%/g'| gawk -v ORS="" '/<s /{print $0} |
|
|
/<[/]s/{print $0; print "\n"}' > lausereal.prevert
|
|
|
```
|
|
|
|
|
|
Kokku on lauseid 27987754.
|
|
|
|
|
|
```
|
|
|
ubuntu@idu:/data2/sonaveeb/sonaveeb_2020$ cat sonaveeb_2020.prevert |grep -c '<s'
|
|
|
27987754
|
|
|
ubuntu@idu:/data2/sonaveeb/sonaveeb_2020$ cat sonaveeb_2020.prevert |grep -c '</s'
|
|
|
27987754
|
|
|
ubuntu@idu:/data2/sonaveeb/sonaveeb_2020$ cat sonaveeb_2020.prevert |grep -c '^[^<]'
|
|
|
```
|
|
|
|
|
|
|
|
|
Faili jupitamine on tehtud kõigepealt atribuudi `corpus` järgi. |