D. Yarrington, S. Hoskins, H. Bunnell
May 1, 1997
Journal of the Acoustical Society of America
A method has been developed for automatically extracting diphone speech segments with context‐dependent boundaries. When compared with speech synthesized from manually extracted diphone speech segments, it was found that speech synthesized from the automatically extracted segments was, overall, slightly less intelligible but slightly more natural sounding [Yarrington et al., ‘‘Robust automatic extraction of diphones with variable boundaries,’’ in EUROSPEECH ’95, 4th European Conference on Speech Communication and Technology, Vol. 3, pp. 1845–1848, Madrid, Spain (1995)]. The lower intelligibility appeared to be due to a small number of very poor diphone segments. While it is feasible to correct this problem by manually replacing misleading diphones, several changes have been made to the extraction procedure to eliminate or at least reduce the frequency of occurrence of incorrect diphones. In particular, a different spectral measure is being used for estimates of spectral similarity, and F0 plus a spectral ...