Computer-aided animated gesture-driven facial model with speech synthesis

Arumugam Rathinavelu; Kesavamurthi Saranya

doi:10.1515/ijdhd-2013-0029

Artikel

Computer-aided animated gesture-driven facial model with speech synthesis

Arumugam Rathinavelu und Kesavamurthi Saranya

Veröffentlicht/Copyright: 24. Februar 2014

Veröffentlicht von

Veröffentlichen auch Sie bei De Gruyter Brill

Informationen für Autor*innen Erkunden Sie dieses Fachgebiet

Aus der Zeitschrift International Journal on Disability and Human Development Band 14 Heft 1

Abstract

The aim of this study was to develop a gesture-driven facial model with speech synthesis capability. A two-dimensional facial Model was developed and animated based on the Facial Action Coding System. Such emotions as “happy”, “sad”, “anger”, and “fear” were simulated and visualized through the combination of eight action units. A speech synthesizer for the Tamil language was built using a syllable-based concatenation approach. The results indicated that the synthetic speech had an average accuracy rate ranging from 85% to 90% as natural as the human speech. Moreover, 75%–85% of the words were articulated well and identified by the children correctly. The ultimate goal of the system is to assist children with vocal and hearing disabilities in their language learning process.

Keywords: audio visual speech synthesis (AV-TTS); concatenative speech synthesis; facial action coding system (FACS); gesture driven facial animation (GDFA); hearing impairment (HI)

Corresponding author: Arumugam Rathinavelu, Department of Computer Science and Engineering, Dr. Mahalingam College of Engineering and Technology, Pollachi, India, E-mail: starvee@drmcet.ac.in

Acknowledgments

This research work was financially supported by the Dr. Mahalingam College of Engineering and Technology, Pollachi, South India. We also wish to thank the members of the college management.

Conflict of interest statement

Authors’ conflict of interest disclosure: The authors stated that there are no conflicts of interest regarding the publication of this article. Research support played no role in the study design; in the collection, analysis, and interpretation of data; in the writing of the report; or in the decision to submit the report for publication.

Research funding: None declared.

Employment or leadership: None declared.

Honorarium: None declared.

References

1. Schroder M. Emotional speech synthesis – a review. Proceedings of EUROSPEECH 2001;1:561–64.10.21437/Eurospeech.2001-150Suche in Google Scholar

2. Jia J, Zhang S, Meng F, Wang Y, Cai L. Emotional audio-visual speech synthesis based on PAD. IEEE Audio, Speech, Language Process 2011;19:570–82.10.1109/TASL.2010.2052246Suche in Google Scholar

3. Sumby WH, Pollack I. Visual contribution to speech intelligibility in noise. J Acoust Soc Am 1954;26:212–15.10.1121/1.1907309Suche in Google Scholar

4. Ouni S, Cohen M, Ishak H, Massaro D. Visual contribution to speech perception: measuring the intelligibility of animated talking heads. Eurasip Journal of Audio Speech Music Process 2007;2007:3.10.1155/2007/47891Suche in Google Scholar

5. Goyal UK, Kapoor A, Kalra P. Text-to-audiovisual speech synthesizer. Proceedings of the Second International Conference on Virtual Worlds 2000;1834:256–69.10.1007/3-540-45016-5_24Suche in Google Scholar

6. Busso C, Narayanan SS. Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Audio, Speech, Language Process 2007;15:2331–47.10.1109/TASL.2007.905145Suche in Google Scholar

7. Drahos P. Photo-realistic head model for real-time animation. Information Sciences and Technologies Bulletin of the ACM Slovakia 2011;3:12–22.Suche in Google Scholar

8. Noh J. A survey of facial modeling and animation techniques. Proceedings of ACM SIGGRAPH 2001:1–5.Suche in Google Scholar

9. Chung SK. Facial animation: a survey based on artistic expression control. National Taiwan University of Arts 2008:131–62.Suche in Google Scholar

10. Hossain MS, Akbar M, Starkey JD. Inexpensive construction of a 3D face model from stereo images. 10th International Conference on Computer and Information Technology 2007:1–6.10.1109/ICCITECHN.2007.4579387Suche in Google Scholar

11. Patel NM, Zaveri M. 3D facial model construction and expressions synthesis using a single frontal face image. Int J Graphics 2010;1:34–40.Suche in Google Scholar

12. Kodandaramaiah GN, Manjunatha MB, KJilani SA, Giriprasad MN, Kulkarni RB, Mukunda Rao M. Use of lip synchronization by hearing impaired using digital image processing for enhanced perception of speech. 2nd International Conference on Computer, Control and Communication 2009:1–7.10.1109/IC4.2009.4909175Suche in Google Scholar

13. Blanz V, Vetter T. A morphable model for the synthesis of 3D faces. Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques 1999:187–94.10.1145/311535.311556Suche in Google Scholar

14. Patel NM, Zaveri M. Parametric facial expression synthesis and animation. Int J Comput Appl 2010;3:34–40.Suche in Google Scholar

15. Sifakis E, Neverov I, Fedkiw R. Automatic determination of facial muscle activations from sparse motion capture marker data. Proceedings of ACM SIGGRAPH 2005;24:417–25.10.1145/1073204.1073208Suche in Google Scholar

16. Beskow J, McGlashan S. Olga – a conversational agent with gestures. Proceedings of IJCAI 1997:39–44.Suche in Google Scholar

17. Theune M, Meijs K, Heylen D, Ordelman R. Generating expressive speech for storytelling applications. IEEE Audio, Speech, Language Process 2006:14.10.1109/TASL.2006.876129Suche in Google Scholar

18. Wik P, Hjalmarsson A. Embodied conversational agents in computer assisted language learning. ACM J Speech Commun 2009;51:1024–37.10.1016/j.specom.2009.05.006Suche in Google Scholar

19. Styger T, Keller E. Formant synthesis. In: Fundamentals of Speech Synthesis and Speech Recognition: Basic Concepts, state-of-the-art and future challenges, 1994:109–28.Suche in Google Scholar

20. Palo P. A review of articulatory speech synthesis. Thesis submitted at Helsinki University of Technology, 2006.Suche in Google Scholar

21. Shirbahadurkar SD, Bormane DS, Kazi RL. Subjective and spectrogram analysis of speech synthesizer for Marathi TTS using Concatenative synthesis. Recent Trends in Information, Telecommunication and Computing (ITC) 2010:262–64.10.1109/ITC.2010.76Suche in Google Scholar

22. Utama RJ, Syrdal AK, Conkie A. Six approaches to limited domain concatenative speech synthesis. In Proceedings of ICSLP 2006:2058–61.10.21437/Interspeech.2006-404Suche in Google Scholar

23. Visagie A, Du Preez JA. Sinusoidal modeling in speech synthesis, a survey. Proceedings of the 12th Annual Symposium of the Pattern Recognition Association of South Africa (PRASA)Franschhoek 2001:138–42.Suche in Google Scholar

24. Taylor P. Unifying unit selection and hidden Markov model speech synthesis. In Proceedings of ICSLP 2006:1758–61.10.21437/Interspeech.2006-487Suche in Google Scholar

25. Thomas S. Natural sounding text-to-speech synthesis based on syllable-like units. Thesis, Indian Institute of Technology, 2007.Suche in Google Scholar

26. Sangeetha J, Jothilakshmi S, Sindhuja S, Ramalingam. V. Text to speech synthesis system for Tamil. Int J Emerging Tech Adv En 2013;3:170–75.Suche in Google Scholar

27. Fischer R. Automatic facial expression analysis and emotional classification. Thesis, Massachusetts Institute of Technology, 2004.Suche in Google Scholar

28. Rathinavelu A, Radhika R. Animated articulator talking head with gestures for words. Int J Adv Comput 2012; 35:31–34.Suche in Google Scholar

29. Rathinavelu A, Karthikeyan M. Computer aided visualization model for speech perception and intelligibility. Int J Comput Commun Tech 2012;3:47–50.Suche in Google Scholar

Received: 2013-6-7

Accepted: 2013-8-1

Published Online: 2014-2-24

Published in Print: 2015-2-1

Sie haben derzeit keinen Zugang zu diesem Inhalt.

Artikel in diesem Heft

https://doi.org/10.1515/ijdhd-2013-0029

Schlagwörter für diesen Artikel

audio visual speech synthesis (AV-TTS); concatenative speech synthesis; facial action coding system (FACS); gesture driven facial animation (GDFA); hearing impairment (HI)