python - NLTK - when to normalize the text? -

- April 15, 2013

i've finished gathering data plan use corpus, i'm bit confused whether should normalize text. plan tag & chunk corpus in future. of nltk's corpora lower case , others aren't.

can shed light on subject, please?

by "normalize" mean making lowercase?

the decision whether lowercase dependent of plan do. purposes, lowercasing better because lowers sparsity of data (uppercase words rarer , might confuse system unless have massive corpus such statistics on capitalized words decent). in other tasks, case information might valuable.

additionally, there other considerations you'll have make similar. example, should "can't" treated ["can't"], ["can", "'t"], or ["ca", "n't"] (i've seen 3 in different corpora). 7-year-old? 1 long word? or 3 words should separated?

that said, there's no reason reformat corpus. can have code make these changes on fly. way original information still around later if ever need it.

Search This Blog

C A N B

python - NLTK - when to normalize the text? -

Comments

Post a Comment

Popular posts from this blog

jQuery Ajax Render Fragments OR Whole Page -

javascript - Iterate over array and calculate average values of array-parts -

ASP.NET Javascript: window.open won't work twice -