python - NLTK - when to normalize the text? -


i've finished gathering data plan use corpus, i'm bit confused whether should normalize text. plan tag & chunk corpus in future. of nltk's corpora lower case , others aren't.

can shed light on subject, please?

by "normalize" mean making lowercase?

the decision whether lowercase dependent of plan do. purposes, lowercasing better because lowers sparsity of data (uppercase words rarer , might confuse system unless have massive corpus such statistics on capitalized words decent). in other tasks, case information might valuable.

additionally, there other considerations you'll have make similar. example, should "can't" treated ["can't"], ["can", "'t"], or ["ca", "n't"] (i've seen 3 in different corpora). 7-year-old? 1 long word? or 3 words should separated?

that said, there's no reason reformat corpus. can have code make these changes on fly. way original information still around later if ever need it.


Comments

Popular posts from this blog

linux - Using a Cron Job to check if my mod_wsgi / apache server is running and restart -

actionscript 3 - TweenLite does not work with object -

jQuery Ajax Render Fragments OR Whole Page -