Data Cleaning and Preprocessing in Data Science and Machine Learning

Importance data cleaning can be understand by following image.
as in industry around 60 % time goes into data cleaning.it is almost impossible to work with unclean or raw data. For machine to be process and work well with test data ,data cleaning is crucial task.
Rather than just talking ,we will going to see some of the operations perform on raw data to clean it and send to machine to process

Major operations we are going to see,are:
- Find/Get email from string/documents
- Delete all the tags like < >
- Remove the newlines(‘\n’), tabs(‘\t’), “-”, “\”.
- Remove word in string before character
- Decontraction of words
- chunking on the text
- delete all the digits
- Replace character with space
- convert all text into lower case
Let,import bunch of libraries first:
- Find/Get email from string/documents
code:
2. Delete all the tags like < > and data in between tags
code:
3. Remove the newlines(‘\n’), tabs(‘\t’), “-”, “\”.
code:
4. Remove word in string before character
code:
5. Decontraction of words
code:
6. chunking on the text
code:
7. delete all the digits
code:
8. Replace unwanted character with space
code:
9. convert all text into lower case
code:
Blog By:
Akshay Bhor: Deep Learning Engineer
an Data Scientist
