NECTEC Technical Journal

A Statistical Grammar Acquisition Method Based on
Clustering Analysis using a Bracketed Corpus

Thanaruk Theeramunkong
Department of Electrical Engineering
Sirindhorn International Institute of Technology
Thammasat University, Pathumthani 12121, Thailand
ping@siit.tu.ac.th

ABSTRACT -- This paper proposes a new method for learning a context-sensitive conditional probability context-free grammar from an unlabeled bracketed corpus based on clustering analysis and describes a natural language parsing model which uses a probability-based scoring function of the grammar to rank parses of a sentence. By grouping brackets in a corpus into a number of similar bracket groups based on their local contextual information, the corpus is automatically labeled with some nonterminal labels, and consequently a grammar with conditional probabilities is acquired. The statistical parsing model provides a framework for finding the most likely parse of a sentence based on these conditional probabilities. Experiments using Wall Street Journal data show that our approach achieves a relatively high accuracy: 88 % recall, 72 % precision and 0.7 crossing brackets per sentence for sentences shorter than 10 words, and 71 % recall, 51 % precision and 3.4 crossing brackets for sentences between 10-19 words. This result supports the assumption that local contextual statistics obtained from an unlabeled bracketed corpus are effective for learning a useful grammar and parsing.
Keywords -- Statistical Parsing, Grammar Acquisition, Clustering Analysis, Local Contextual Information

บทคัดย่อ -- บทความนี้นำเสนอวิธีการเรียนกฎไวยากรณ์ที่มีค่าความน่าจะเป็นของการใช้กฎตามเนื้อความ(context) ที่อยู่รอบ การเรียนรู้นี้ใช้ชุดประโยค(corpus)ที่มีโครงสร้างของประโยคแต่ไม่มีการใส่ Label บอกว่าโครงสร้างนั้นคืออะไร นอกจากนี้บทความนี้ยังเสนอโมเดลการวิเคราะห์ประโยคเชิงโครงสร้าง (parsing) ที่ใช้กฎไวยากรณ์ที่คำนวณได้จากข้างต้น เพื่อทำการจัดลำดับความเป็นไปได้ของผลลัพธ์แต่ละอันที่ได้จากการวิเคราะห์ โดยใช้ชุดประโยคภาษาอังกฤษขนาดใหญ่ของ Wall Street Journal มาใช้ในการทดลอง เราพบว่าวิธีการที่เสนอสามารถวิเคราะห์ประโยคได้อย่างถูกต้องสูงโดยมีระดับความถูกต้องอยู่ที่ 88 % recall 72 % precision และ 0.7 crossing brackets ต่อประโยค ในกรณีประโยคที่สั้นกว่า 10 คำ ส่วนสำหรับประโยคที่ยาว 10-19 คำ จะมีความถูกต้องอยู่ที่ระดับ 71 % recall, 51 % precision และ 3.4 crossing brackets ต่อประโยค ผลที่ได้นี้สนับสนุนสมมุติฐานที่ว่าเนื้อความใกล้ที่อยู่รอบ (local context) มีประโยชน์และประสิทธิผลในการช่วยให้สามารถเรียนรู้กฎไวยากรณ์อันจะเป็นประโยชน์ในการวิเคราะห์ประโยคด้วย
คำสำคัญ -- การวิเคราะห์ประโยชน์ตามสถิติ, การเรียนรู้กฎไวยากรณ์, การวิเคราะห์โดยแบ่งกลุ่ม, ข้อมูลเนื้อความใกล้ที่อยู่รอบ

	National Electronics and Computer Technology Center (NECTEC)
	Copyright © 2001 By Information System Service Section. All right reserved.