

Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.įew changes, you only have to be sure that \S* in the url pattern doesn't eat a comma or a dot at the end of the url with a negative lookbehind (?means that at each failing position of the string, the regex engine has to test the three alternatives (http|ftp|https) and without the whole group for nothing. One of the weaknesses of the url pattern is that it starts with an alternation in an optional group.
Python text cleaner names capitilization how to#
Even if it catches a closing parenthesis or a comma, it isn't important for what you want to do (we will see how to handle commas or dots later). If you read your url pattern, you can see that the only part that isn't optional is in fact +(?:\.+)+ ( written +(\.+)+ in your script: _ is already inside \w, you can put - at the end of a character without to escape it, the capture group is useless).Īll that comes after this part of the pattern doesn't require a precise description and can be replaced with a \S* (zero or more non-white-spaces). To do that, a simple re.findall(r'+', text) suffices, but you have to remove the urls first if you don't want to catch letter sequences contained in them.

It returns all words in lower-case separated by a space.ĭescribed like that, you easily understand that you can extract the words and join them with a space. Then you finish the job removing duplicate spaces, spaces at the beginning or the end of the string, and converting all in lower-case.īut concretely, what is the result of this script? Unclear seems the difference between text = re.sub(r'\n', '', text)Īnd whether text = re.sub(r'\s+', ' ', text)Īctually, your approach consists to remove or to replace with a space all that isn't a word (urls and characters that are not an ascii letter). However, how could the script above be improved, or be written cleaner? So far, the script does the job, which is great. The script is cleaned via cleaner = lambda x: cleaning(x)ĭf = df.apply(cleaner)ĭf = df.replace('', np.nan) Text = ''.join(character for character in text if character not in exclude) Text = ''.join(character for character in text if ord(character) < 128) Url_pattern = re.sub(url_pattern, ' ', text) # remove new line and digits with regular expression Particularly, I'm interested in feedback to the following code: def cleaning(text): It should only have white-space between words and remove all "\n" elements from the text. The clean text would ideally be lowercase words, without numbers and at maybe only commas and a dot at the end of a sentence.

I created following script to clean text that I scraped.
