[외신] 음성인식시스템에도 편견이 존재한다

4월 17, 2021

벤처비트가 전한 소식이다.

최첨단 자동 음성 인식(ASR) 알고리즘도 세계 특정 지역의 사람들의 억양을 인식하기 위해 고군분투한다. 네덜란드 암스테르담 대학, 네덜란드 암연구소, 델프트 공과대학교의 연구원들이 발표한 새로운 연구 결과, 네덜란드어를 위한 ASR 시스템이 특정 연령대, 성별, 출신 국가의 스피커를 다른 사람들보다 더 잘 인식한다는 사실을 밝혀냈다.

음성 인식은 IBM의 슈박스 머신(IBM’s Shoebox machine)과 월드 오브 원더의 줄리(Worlds of Wonder’s Julie doll) 이후 많은 발전을 이루었다. 그러나 AI에 의한 기술적 진전에도 불구하고 오늘날의 음성 인식 시스템은 좋게 평가해도 불완전한 상태이며, 어떤 면에서는 차별적이라고 볼 수 있다. 워싱턴 포스트가 의뢰한 연구에서, 구글과 아마존이 만든 인기 있는 스마트 스피커는 원어민 사용자보다 미국 이외의 억양을 이해할 가능성은 30% 가량이 낮았다. 보다 최근의 연구결과에서 Algorithic Justice League의 음성 삭제 프로젝트는 애플, 아마존, 구글, IBM, 마이크로소프트의 음성 인식 시스템이 아프리카계 미국인 음성의 경우 35%의 단어 오류율이 나타나는 것을 발견했다.

Technology

이 최신 연구의 공동 저자들은 네덜란드인용 ASR 시스템이 다른 그룹의 스피커를 얼마나 잘 인식하는지 조사하기 시작했다. 일련의 실험에서, 그들은 ASR 시스템이 성별, 나이, 억양의 차원을 따라 언어의 다양성을 반영하는지를 관찰했다.

연구원들은 ASR 시스템이 네덜란드어를 인식하도록 AI 언어 모델을 훈련시키는 데 사용되는 주석이 달린 말뭉치인 CGN에서 샘플 데이터를 수집하도록 하는 것으로 시작했다. CGN은 네덜란드와 벨기에 플랑드르 지역에서 18세에서 65세까지의 사람들이 방송 뉴스와 전화 통화를 포함한 화법을 다루는 녹음 파일을 포함하고 있다.

CGN은 1,185명의 여성과 1,678명의 남성들의 483시간에 해당하는 데이터를 수집했다. 이것에 더불어 시스템을 더욱 견고하게 만들기 위해 공동 저자들은 데이터 증강 기술을 적용하여 총 교육 데이터 시간을 “9배” 늘렸다.

연구진이 CGN에서 도출된 테스트 세트를 통해 훈련된 ASR 시스템을 실행했을 때, 그들은 그것이 말하는 스타일에 관계없이 남성 음성보다 여성 음성을 더 안정적으로 인식한다는 것을 발견했다. 또한, 해당 시스템은 젊은 사람들에 비해 나이 든 사람들의 말을 인식하는데 어려움을 겪었다. 그리고 원어민과 원어민이 아닌 사람의 차이는 더욱 컸으며 네덜란드의 어린이들의 언어 오류율은 비원어민의 단어 오류율 보다 약 20%가량 높았다.

일반적으로 청소년층의 음성이 시스템으로 가장 정확하게 해석된 것으로 나타났으며, 노인(65세 이상)과 어린이(65세 이상)가 그 뒤를 이었다. 이것은 네덜란드어 어휘와 문법에 매우 능숙한 비원어민에게도 마찬가지였다.

연구자들은 “ASR 시스템의 편견을 해결하기 위해 정서적 편견에 대한 완화 전략을 구성하고, 사전에 점검하고, 개발 시점에서 이러한 문제 해결을 위한 프로세스를 형성해야한다고 말했다. 연령, 지역, 성별 등 다양성에 대한 점검을 마친 뒤에 제품을 내놓는 간접 편향 완화 전략은 다양한 디자인에서 잠재적 편견을 발견할 수 있다.”고 밝히며 데이터 세트에 포함되는 편향을 완벽히 제거하는 것은 불가능하지만 알고리즘 수준에서 편향을 완화해야한다고 지적했다.

Even state-of-the-art automatic speech recognition (ASR) algorithms struggle to recognize the accents of people from certain regions of the world. That’s the top-line finding of a new study published by researchers at the University of Amsterdam, the Netherlands Cancer Institute, and the Delft University of Technology, which found that an ASR system for the Dutch language recognized speakers of specific age groups, genders, and countries of origin better than others.

Speech recognition has come a long way since IBM’s Shoebox machine and Worlds of Wonder’s Julie doll. But despite progress made possible by AI, voice recognition systems today are at best imperfect — and at worst discriminatory. In a study commissioned by the Washington Post, popular smart speakers made by Google and Amazon were 30% less likely to understand non-American accents than those of native-born users. More recently, the Algorithmic Justice League’s Voice Erasure project found that that speech recognition systems from Apple, Amazon, Google, IBM, and Microsoft collectively achieve word error rates of 35% for African American voices versus 19% for white voices.

The coauthors of this latest research set out to investigate how well an ASR system for Dutch recognizes speech from different groups of speakers. In a series of experiments, they observed whether the ASR system could contend with diversity in speech along the dimensions of gender, age, and accent.

The researchers began by having an ASR system ingest sample data from CGN, an annotated corpus used to train AI language models to recognize the Dutch language. CGN contains recordings spoken by people ranging in age from 18 to 65 years old from Netherlands and the Flanders region of Belgium, covering speaking styles including broadcast news and telephone conversations.

CGN has a whopping 483 hours of speech spoken by 1,185 women and 1,678 men. But to make the system even more robust, the coauthors applied data augmentation techniques to increase the total hours of training data “ninefold.”

When the researchers ran the trained ASR system through a test set derived from the CGN, they found that it recognized female speech more reliably than male speech regardless of speaking style. Moreover, the system struggled to recognize speech from older people compared with younger, potentially because the former group wasn’t well-articulated. And it had an easier time detecting speech from native speakers versus non-native speakers. Indeed, the worst-recognized native speech — that of Dutch children — had a word error rate around 20% better than that of the best non-native age group.

In general, the results suggest that teenagers’ speech was most accurately interpreted by the system, followed by seniors’ (over the age of 65) and children’s. This held even for non-native speakers who were highly proficient in Dutch vocabulary and grammar.

As the researchers point out, while it’s to an extent impossible to remove the bias that creeps into datasets, one solution is mitigating this bias at the algorithmic level.

“[We recommend] framing the problem, developing the team composition and the implementation process from a point of anticipating, proactively spotting, and developing mitigation strategies for affective prejudice [to address bias in ASR systems],” the researchers wrote in a paper detailing their work. “A direct bias mitigation strategy concerns diversifying and aiming for a balanced representation in the dataset. An indirect bias mitigation strategy deals with diverse team composition: the variety in age, regions, gender, and more provides additional lenses of spotting potential bias in design. Together, they can help ensure a more inclusive developmental environment for ASR.”