Data Cleaning and Preprocessing in Data Science and Machine Learning

Importance data cleaning can be understand by following image.

as in industry around 60 % time goes into data cleaning.it is almost impossible to work with unclean or raw data. For machine to be process and work well with test data ,data cleaning is crucial task.

Rather than just talking ,we will going to see some of the operations perform on raw data to clean it and send to machine to process

Major operations we are going to see,are:

  1. Find/Get email from string/documents
  2. Delete all the tags like < >
  3. Remove the newlines(‘\n’), tabs(‘\t’), “-”, “\”.
  4. Remove word in string before character
  5. Decontraction of words
  6. chunking on the text
  7. delete all the digits
  8. Replace character with space
  9. convert all text into lower case

Let,import bunch of libraries first:

  1. Find/Get email from string/documents

code:

2. Delete all the tags like < > and data in between tags

code:

3. Remove the newlines(‘\n’), tabs(‘\t’), “-”, “\”.

code:

4. Remove word in string before character

code:

5. Decontraction of words

code:

6. chunking on the text

code:

7. delete all the digits

code:

8. Replace unwanted character with space

code:

9. convert all text into lower case

code:

Blog By:

Akshay Bhor: Deep Learning Engineer

an Data Scientist