Edinburgh Speech Tools  2.1-release
 All Classes Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
doc/esttilt.md
1 The Tilt Intonation Model {#esttilt}
2 ===========================
3 
4 *Tilt* is a phonetic model of intonation that
5 represents intonation as a sequence of continuously parameterised
6 events.
7 
8 The tilt library is a set of functions which analyses, synthesizes and
9 manipulates tilt representations.
10 
11 # Theoretical Overview {#tilt-overview}
12 
13 The basic unit in the tilt model is the *intonational event*.
14 Events occur as instants with nothing between them,
15 as opposed to segmental based phenomena where units occur in a
16 contiguous sequence. The basic types of intonational event are
17 *pitch accents* and (following the popular
18 terminology) *boundary tones*. Pitch accents
19 (denoted by the letter a) are F0 excursions associated with
20 syllables which are used by the speaker to give some degree of
21 emphasis to a particular word or syllable. In the tilt model, boundary
22 tones (b) are rising F0 excursions which occur at the edges of
23 intonational phrases and as well as giving the hearer a cue as to the
24 end of the phrase, can also signal effects such as continuation and
25 questioning. A combination event ab occurs when a pitch accent
26 and boundary tone occur so close to one another that only a single
27 pitch movement is observed. There are different kinds of pitch accents
28 and boundary tones: the choice of pitch accent and boundary tone
29 allows the speaker to produce different global intonational tunes
30 which can indicate questions, statements, moods etc to the hearer.
31 
32 \anchor tilt-f0-representation
33 \image html tilt-f0-representation.svg "Schematic F0 representation"
34 \image latex tilt-f0-representation.eps "Schematic F0 representation" width=7cm
35 
36 
37 
38 \ref tilt-f0-representation shows a Schematic representation of F0,
39 intonational event relation and segment relation in the Tilt
40 model. The linguistically relevant parts of the F0 contour, which
41 correspond to intonational events, are circled. The events, labelled a
42 for pitch accent and b for boundary are linked to the syllable nuclei
43 of the syllable relation. Note that every event is linked to a
44 syllable, but some syllables do not have events.
45 
46 Unlike traditional intonational phonology schemes \cite{ph:thesis},
47 \cite{tobi} which impose a categorical classification on events, Tilt
48 uses a set of continuous parameters. These parameters, collectively
49 known as *tilt parameters*, are determined from
50 examination of the local shape of the event's F0 contour.
51 
52 The tilt model is built on a simpler model, the rise/fall/connection (RFC) model.
53 
54 In the RFC model, each event is modelled by a rise part followed by a
55 fall part. Each part has an amplitude and duration, and two parameters
56 are used to give the time position of the event in the utterance and
57 the F0 height of the event. \ref figure-typical-pitch-accent shows a typical
58 pitch accent with these parameters marked.
59 
60 \anchor figure-typical-pitch-accent
61 \image html typical-pitch-accent.svg "Typical pitch accent"
62 \image latex typical-pitch-accent.eps "Typical pitch accent" width=7cm
63 
64 
65 The RFC parameters for an utterance are therefore:
66 
67  - rise amplitude (Hz)
68  - rise duration (seconds)
69  - fall amplitude (Hz)
70  - fall duration (seconds)
71  - position (seconds)
72  - F0 height (Hz)
73 
74 Sometimes events don't have rise or fall parts, and in these cases the
75 amplitude and duration of the missing part is set to 0. The position
76 parameter can be specified in two ways: either as the distance from
77 the start of the utterance, or the distance from the start of the
78 vowel of the associated syllable. The latter is more linguistically
79 meangingful, but as vowel boundaries are not always available, the
80 former is often used.
81 
82 While the RFC model can accurately describe F0 contours, the mechanism
83 is not ideal in that the RFC parameters for each contour are not as
84 easy to interpret and manipulate as one might like. For instance there
85 are two amplitude parameters for each event, when it would make sense
86 to have only one.
87 
88 The *Tilt* representation helps solve these
89 problems by transforming the four amplitude and duration RFC
90 parameters into three Tilt parameters:
91 
92  - amplitude (Hz): the sum of the magnitudes of the rise and fall amplitudes.
93  - duration (seconds): the sum of the rise and fall durations.
94  - tilt: a dimensionless number which expresses the overall *shape*
95  of the event, independent of its amplitude or duration.
96 
97 The position and F0 height parameters are the same as before.
98 
99 The tilt representation is superior to the RFC representation in that
100 it has fewer parameters without significant loss of
101 accuracy. Importantly, it can be argued that the tilt parameters are
102 more linguistically meaningful.
103 
104 In describing the tilt model, we use the term
105 *analysis* to describe the process of producing a
106 tilt representation from an F0 contour, and *synthesis
107 * to describe the process of prodcing a F0 contour from a
108 tilt representation.
109 
110 ## RFC Analysis {#esttilt-overview-rfcanalysis}
111 
112 ### Locating Events in the F0 contour {#esttilt-overview-rfcanalysis-locating}
113 
114 The first stage in analysis is to find the intonational events in an
115 F0 contour. EST does not directly provide a means for doing this. In
116 practice this is either done by hand by a human labeller, or
117 automatically by the HMM auto event labeller. The current HMM event
118 labeller is based on the HTK system and hence can't be part of EST,
119 but an outline of the system follows:
120 
121 The automatic event detector uses continuous density hidden Markov
122 models to perform a segmentation of the input utterance. A number of
123 units are defined and a HMM is trained on examples of that kind from a
124 pre-labelled training corpus using the Baum-Welch algorithm
125 \cite{baum:72}. Each utterance in the corpus is acoustically processed
126 so that it can be represented by sequence of evenly spaced
127 frames. Each frame is a multi-component vector representing the
128 acoustic information for the time interval centred around the frame.
129 
130 Recognition is performed by forming a network comprising the HMMs for
131 each unit in conjunction with an n-gram language model which gives the
132 prior probability of a sequence of n units occurring. To perform
133 recognition on an utterance, the network is searched using the
134 standard Viterbi algorithm to find the most likely path through the
135 network given the input sequence of acoustic vectors.
136 
137 It is our intention to put a complete event labeller in EST in the future.
138 
139 ### Producing an RFC representation from an utterance's events and F0 contour {#ov-rfc-analysis}
140 
141 An utterance's events are represented in a relation. Initially, events
142 are stored as regions with start and stop times as this is the most
143 common output format of labellers (both human and automatic).
144 
145 For example, for utterance kdt_016, a set of basic event labels is as
146 follows (in xlabel format):
147 
148  0.290 146 sil
149  0.480 146 c
150  0.620 146 a
151  0.760 146 c
152  0.960 146 a
153  1.480 146 c
154  1.680 146 a
155  1.790 146 sil
156 
157 Events are labelled "a", and silences "sil". The use of the "c" label
158 is to allow start times which differ from the end of the previous
159 event. Conceptually, this can alsow be represented as follows:
160 
161  name:sil start:0.0 end:0.290
162  name:a start:0.290 end:0.620
163  name:a start:0.760 end:0.960
164  name:a start:1.480 end:1.680
165  name:sil start:1.790 end:1.790
166 
167 The other component for analysis is the utterance's F0 contour, which
168 is stored in a track. The contour must be continuous (i.e. have no
169 breaks), and its frames must be specified at fixed intervals. For best
170 performance the contour should have been smoothed.
171 
172 The RFC analysis component takes the approximate labels and the
173 smoothed F0 contour, fits rise and fall shapes, and hence determines
174 an optimal set of RFC parameters for the utterance.
175 
176 For each event, a peak picking algorithm decides if the event has a
177 rise part only, a fall part only or a rise part followed by a fall
178 part.
179 
180 \anchor tilt-search-region
181 \image html tilt-search-region.svg "Tilt search region"
182 \image latex tilt-search-region.eps "Tilt search region" width=7cm
183 
184 
185 For each part, a search region, shown in \ref tilt-search-region,
186 is defined around the approximate start and end boundaries as defined
187 in the input label file. The search region is controlled by a number
188 of parameters:
189 
190  - start_limit: the distance in seconds before each input start
191 boundary that the start search region should begin.
192  - end_limit: the distance in seconds after each input end
193 boundary that the end search region should begin.
194  - range: the end and beginnings of the start and end regions
195 respectively, specified as a fraction of the overall label duration.
196 
197 For example, a pitch accent starts at 1.45 seconds and ends at 1.75
198 seconds. If the start and end limit are both defined to be 0.1 seconds
199 and the range is 0.4 (40%), then the start region starts at 1.35
200 seconds and ends at 1.55, and the end region starts at 1.65 and ends
201 at 1.85. The matching algroithm will synthesize every possible shape
202 lying within this region, measure the distance between each and the
203 actual contour, and pick the one with the lowest distance.
204 
205 The final results of the matching process is a relation of events,
206 each with the 6 RFC parameters are descibed above.
207 
208 The program \ref tilt_analysis will perform RFC matching
209 given a label file and F0 contour. The function
210 \ref rfc_analysis takes a F0 contour, a relation and a
211 set of options and returns the RFC parameters in the features of each
212 item in the relation.
213 
214 
215 ## RFC to Tilt Conversion {#rfc2tilt}
216 
217 The rise and fall RFC parameters can be converted to Tilt parameters
218 using the following equations.
219 
220 *Amplitude* is the sum of te magnitudes of the rise
221 and fall amplitudes:
222 
223 \f[ tilt_{amp} = \frac{ \left | A_{rise} \right | -
224  \left | A_{fall}\right |}{
225  \left | A_{rise} \right | +
226  \left | A_{fall}\right |} \f]
227 
228 *Duration* is the sum of the of the rise and fall durations:
229 
230 \f[ tilt_{dur} = \frac{ D_{rise} - D_{fall}}{ D_{rise} + D_{fall}} \f]
231 
232 *Tilt* can be measured with respect to amplitude:
233 
234 \f[ tilt = \frac{ \left | A_{rise} \right | -
235  \left | A_{fall}\right |}{
236  2 \left (\left | A_{rise} \right | +
237  \left | A_{fall}\right | \right )} +
238  \frac{ D_{rise} - D_{fall}}{ 2 ( D_{rise} + D_{fall})}
239  \f]
240 
241 or duration:
242 
243 \f[ A_{event} = \left | A_{rise} \right | + \left | A_{fall} \right | \f]
244 
245 The tilt model assumes that these are strongly correlated so that an
246 average of the two is representative of the shape of the event:
247 
248 \f[ D_{event} = D_{rise} + D_{fall} \f]
249 
250 
251 The is no stand alone program to do this conversion, but the
252 \ref tilt_analysis can do this conversion in addition to
253 performing the RFC matching as described above.
254 
255 
256 The function \ref rfc_to_tilt takes a relation
257 containing RFC parameterised items and converts it to a relation
258 containing Tilt paramterised items.
259 
260 
261 Another function, also called \ref rfc_to_tilt takes a
262 Features object containing the 4 rise fall parameters and writes the 3
263 tilt parameters into another features object. This function can be
264 used to do rfc_to_tilt conversion for a single event.
265 
266 ## Tilt to RFC Conversion {#tilt2rfc}
267 
268 The Tilt parameters can be converted to RFC parameters using the
269 following equations:
270 
271 \important Rise amplitude:
272 \anchor tilt-rise-amplitude
273 \f[
274 A_{rise} = \frac{A_{event} (1 + tilt)}{2}
275 \f]
276 
277 \important Fall amplitude:
278 \anchor tilt-fall-amplitude
279 \f[
280 A_{rise} = \frac{A_{event} (1 - tilt)}{2}
281 \f]
282 
283 
284 \important Rise duration:
285 \anchor tilt-rise-duration
286 \f[
287 A_{rise} = \frac{D_{event} (1 + tilt)}{2}
288 \f]
289 
290 \important Fall duration:
291 \anchor tilt-fall-duration
292 \f[
293 A_{rise} = \frac{D_{event} (1 - tilt)}{2}
294 \f]
295 
296 
297 
298 
299 
300 
301 There is no stand alone program to do this conversion, but the
302 \ref tilt_synthesis can do this conversion in addition to
303 generating a F0 contour.
304 
305 
306 The function \ref tilt_to_rfc takes a relation
307 containing Tilt parameterised items and converts it to a relation
308 containing RFC paramterised items.
309 
310 
311 Another function, also called \ref tilt_to_rfc takes a
312 Features object containing the 3 Tilt parameters and writes the 4 rise
313 fall RFC parameters into another features object. This function can be
314 used to do tilt_to_rfc conversion for a single event.
315 
316 ## RFC to F0 Synthesis {#ov-rfc-to-tilt}
317 
318 An F0 contour can be generated from a set of RFC parameters using the
319 follwing equations.
320 
321 
322 Events are generated as piecewise combinations of quadratic functions:
323 
324 \f{eqnarray*}{
325 f_0(t) = A_{abs} + A - 2 A \cdot (t/D)^2 & 0 < t < D/2 \\
326 f_0(t) = A_{abs} + 2 A \cdot (1-t/D)^2 & D/2 < t < D
327 \f}
328 
329 Between events, straight lines are used:
330 
331 \f[
332 f_0(t) = A_{abs} + A \cdot (t/D) ~~ 0 < t < D
333 \f]
334 
335  The stand alone program
336 \ref tilt_synthesis can do this conversion. It takes a
337 RFC label file as input and produces a F0 file. This program can also
338 generate a F0 file directly from a Tilt label file
339 
340 The function \ref rfc_synthesis takes a relation
341 containing RFC parameterised items and produces a F0 contour in a
342 Track.
343 
344 The function \ref synthesize_rf_event takes a Features
345 object containing the 4 rise fall RFC parameters and generates the F0
346 contour for a single event.
347 
348 # Executable Programs
349 
350  - \ref tilt-analysis_manual: Produces a Tilt or RFC analysis of a
351  F0 contour, given a set label file containing a set of approximate
352  intonational event boundaries.
353  - \ref tilt-synthesis_manual: tilt_synthesis generates a F0 contour,
354  given a label file containing parameterised Tilt or RFC events.
355  - \ref pda_manual: Generates F0 contours
356 
357 # Functions
358 
359  - \ref tiltfunctions
void rfc_synthesis(EST_Track &f0, EST_Relation &ev_list, float f_shift, int no_conn)
Generate an F0 contour given a list RFC events.
void rfc_analysis(EST_Track &fz, EST_Relation &event_list, EST_Features &op)
void tilt_to_rfc(EST_Features &tilt, EST_Features &rfc)
Convert a single set of local tilt parameters to local RFC parameters.
Definition: tilt_utils.cc:197
void tilt_synthesis(EST_Track &track, EST_Relation &ev_list, float f_shift, int no_conn)
Generate an F0 contour given a list Tilt events.
void tilt_analysis(EST_Track &fz, EST_Relation &event_list, EST_Features &op)
Fill op with sensible default parameters for RFC analysis.
void rfc_to_tilt(EST_Features &rfc, EST_Features &tilt)
Convert a single set of local RFC parameters to local tilt parameters. See RFC to F0 Synthesis for a ...
Definition: tilt_utils.cc:172
Definition: EST_HMM.h:82