

如果您无法下载资料,请参考说明:
1、部分资料下载需要金币,请确保您的账户上有足够的金币
2、已购买过的文档,再次下载不重复扣费
3、资料包下载后请先用软件解压,在使用对应软件打开
中文分词算法解析 Introduction Whenprocessingnaturallanguagetext,oneofthefundamentaltasksistosegmentthetextintomeaningfulunits,suchaswordsorphrases.Thisprocessisknownastextsegmentationortokenization.InChinese,textsegmentation,alsocalledChinesewordsegmentation,isparticularlyimportantbecausetherearenospacesbetweenChinesewords.Therefore,segmentingaChinesesentenceintoitsconstituentwordsisessentialformanynaturallanguageprocessingtasks,suchasmachinetranslation,informationretrieval,andtextclassification,etc.ThispaperwilldiscussvariousChinesewordsegmentationalgorithms. BasicPrinciplesofChineseWordSegmentation ThemainprincipleofChinesewordsegmentationistoidentifytheboundariesbetweenwordsinasentence.SincetherearenospacesbetweenChinesewords,thisprocessismorechallengingthaninotherlanguages.ThetwomainapproachestoChinesewordsegmentationarerule-basedmethodsandstatisticalmethods. Rule-basedmethodsrelyondictionariescomposedofwordsandtheircorrespondingPOStagstosegmentasentence.Themainideabehindrule-basedmethodsistomatchthewordsinthesentencewiththoseinthedictionary.Theapproachidentifiesthelongestmatchingsequencewhiletakingintoaccountthewordfrequencyandthecontext.Whenthesegmenterdoesnotfindafullmatchingword,itcanuserulestosplitthecharactersintosmallerunits. Statisticalmethods,ontheotherhand,usemachinelearningalgorithmstosegmentChinesetext.Themethodreliesontrainingastatisticalmodelonalargecorpusoftext,labeledwithwordboundaries.Themodelcanthenapplylearnedrulestosegmentnewandunseentext.Statisticalmethodsaregenerallymoreaccurateandrequirelessmanualannotationthanrule-basedmethods. ChineseWordSegmentationAlgorithms 1.MaximumMatchAlgorithm(MMA) Amongtherule-basedmethods,themaximummatchalgorithmisamongthesimplestandmostwidelyused.Itstartsbysettingthelargestwordlengthtobesegmentedinagivensentence.Thealgorithmthenlooksupthecorrespondingwordswiththatlengthinthedictionaryandremovesthewordsfoundfromthesentence.Thisprocessisrepeatedfortheremainingtextuntilallwordsaresegmented. Forinstance,considerthesentence

快乐****蜜蜂
实名认证
内容提供者


最近下载
201651206021+莫武林+浅析在互联网时代下酒店的营销策略——以湛江民大喜来登酒店为例.doc
201651206021+莫武林+浅析在互联网时代下酒店的营销策略——以湛江民大喜来登酒店为例.doc
用于空间热电转换的耐高温涡轮发电机转子及其装配方法.pdf
用于空间热电转换的耐高温涡轮发电机转子及其装配方法.pdf
用于空间热电转换的耐高温涡轮发电机转子及其装配方法.pdf
用于空间热电转换的耐高温涡轮发电机转子及其装配方法.pdf
用于空间热电转换的耐高温涡轮发电机转子及其装配方法.pdf
用于空间热电转换的耐高温涡轮发电机转子及其装配方法.pdf
论《离骚》诠释史中的“香草”意蕴.docx
论《离骚》诠释史中的“香草”意蕴.docx