🧑‍💻 Getting Started with Data Cleaning Project ✈️

We are going to use Fake News data

In this project, we will use Liar dataset which based on Fake News Data . We are going to apply different techniques to clean the data with data analysis, data preparation, data visualization. The main concern is to to prepare the data according to the label that contains the types of news.


Our Main Approach:
  • Importing Python Libraries
  • Data Loading
  • Exploratory data analysis
  • Woking on Categorical data coloumns
  • Woking on Numeric data coloumns

Here is what we are going to do in this project:
  • Deep data analysis with different ways
  • How the fake news data distributed
  • Fake news labels ananlysis with plots and workd clouds
  • Analysis of the fake news statement
  • Cleaning the statement of fake news for supporting the above labels of fake news properly
  • Analysis of subject(s) with graphs and preparation to make groups of subjects that will more distinguish the labels of fake news
  • Analysis of speakers with graphs
  • Analysis of speaker's job title with graphs and preparation to make groups of speaker's job title that will more distinguish the labels of fake news
  • Analysis of state info with graphs
  • Analysis of party affiliation with graphs and preparation to make groups of party affiliation that will more distinguish the labels of fake news
  • Analysis of venue with graphs and preparation to make groups of venue that will more distinguish the labels of fake news
  • Aanalysis of Numeric Data Type Features with plots
  • Conclusion of this project

Source of Data:
https://www.cs.ucsb.edu/~william/data/liar_dataset

Importing Python Libraries 📕 📗 📘 📙

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer 
import numpy as np
import seaborn as sns
from array import *
import matplotlib.pyplot as plt
from prettytable import PrettyTable
from nltk.tokenize import RegexpTokenizer
from collections import Counter 
from sklearn import preprocessing
from nltk.tokenize import sent_tokenize, word_tokenize
import re
from wordcloud import WordCloud
import string
from nltk.corpus import stopwords
import matplotlib.cm as cm
from matplotlib import rcParams
%matplotlib inline

Loading the data 📁 📂

In [2]:
df = pd.read_csv(r"Liar_Dataset.csv") 

Exploratory data analysis 🔎

    In this section we are going to do this follwoing things:

  • We will look at the first five records of this data.

  • We will look at the last five records of this data.

  • We will delete the ID].json coloumn because its only a name of file.

  • We will look at the coloumns names, lenght of all records, shape of the data, information and data types of the data.

  • We will check the missing/null values in the data because its a must thing to check the null values and manage these by deleting them or filling them.

  • We will check the counts of rows, counts of coloumns and also delete the duplicate data if existing.

  • We will separate the numeric and categorical data coloumns.

5 top records of data

In [3]:
df.head(5)
Out[3]:
[ID].jsonlabelstatementsubject(s)speakerspeaker's job titlestate infoparty affiliationbarely true countsfalse countshalf true countsmostly true countspants on fire countsvenue
011972.jsonTRUEBuilding a wall on the U.S.-Mexico border will...immigrationrick-perryGovernorTexasrepublican3030422318Radio interview
111685.jsonFALSEWisconsin is on pace to double the number of l...jobskatrina-shanklandState representativeWisconsindemocrat21000a news conference
211096.jsonFALSESays John McCain has done nothing to help the ...military,veterans,voting-recorddonald-trumpPresident-ElectNew Yorkrepublican63114513761comments on ABC's This Week.
35209.jsonhalf-trueSuzanne Bonamici supports a plan that will cut...medicare,message-machine-2012,campaign-adverti...rob-cornillesconsultantOregonrepublican11311a radio show
49524.jsonpants-fireWhen asked by a reporter whether hes at the ce...campaign-finance,legal-issues,campaign-adverti...state-democratic-party-wisconsinNaNWisconsindemocrat57227a web video

5 last records of data

In [4]:
df.tail(5)
Out[4]:
[ID].jsonlabelstatementsubject(s)speakerspeaker's job titlestate infoparty affiliationbarely true countsfalse countshalf true countsmostly true countspants on fire countsvenue
127823419.jsonhalf-trueFor the first time in more than a decade, impo...energy,oil-spill,tradebarack-obamaPresidentIllinoisdemocrat70711601639a press conference
1278312548.jsonmostly-trueSays Donald Trump has bankrupted his companies...candidates-biographyhillary-clintonPresidential candidateNew Yorkdemocrat402969767a speech on the economy
12784401.jsonTRUEJohn McCain and George Bush have "absolutely n...health-carecampaign-defend-americaNaNWashington, D.C.none01020a television ad
127851055.jsonFALSEA new poll shows 62 percent support the presid...health-careamericans-united-changeNaNNaNnone14410an Internet ad.
127869117.jsonbarely-trueNo one claims the report vindicating New Jerse...candidates-biography,infrastructurerudy-giulianiAttorneyNew Yorkrepublican9111073comments on NBC's "Meet the Press"

Deletinmg the [ID].json coloumn because its just a name of file so not useful

In [5]:
df.drop(['[ID].json'], axis=1, inplace=True)

Coloumns/features in data

In [6]:
df.columns
Out[6]:
Index(['label', 'statement', 'subject(s)', 'speaker', 'speaker's job title',
       'state info', 'party affiliation', 'barely true counts', 'false counts',
       'half true counts', 'mostly true counts', 'pants on fire counts',
       'venue'],
      dtype='object')

Length of data

In [7]:
print('lenght of data is', len(df))
lenght of data is 12787

Shape of data

In [8]:
df.shape
Out[8]:
(12787, 13)

Data information

In [9]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12787 entries, 0 to 12786
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   label                 12787 non-null  object
 1   statement             12787 non-null  object
 2   subject(s)            12787 non-null  object
 3   speaker               12787 non-null  object
 4   speaker's job title   9222 non-null   object
 5   state info            10040 non-null  object
 6   party affiliation     12787 non-null  object
 7   barely true counts    12787 non-null  int64 
 8   false counts          12787 non-null  int64 
 9   half true counts      12787 non-null  int64 
 10  mostly true counts    12787 non-null  int64 
 11  pants on fire counts  12787 non-null  int64 
 12  venue                 12658 non-null  object
dtypes: int64(5), object(8)
memory usage: 1.3+ MB

Data types of all coloumns

In [10]:
df.dtypes
Out[10]:
label                   object
statement               object
subject(s)              object
speaker                 object
speaker's job title     object
state info              object
party affiliation       object
barely true counts       int64
false counts             int64
half true counts         int64
mostly true counts       int64
pants on fire counts     int64
venue                   object
dtype: object

Checking Null values

In [11]:
np.sum(df.isnull().any(axis=1))
Out[11]:
4351

Rows and columns in the dataset

In [12]:
print('Count of columns in the data is:  ', len(df.columns))
Count of columns in the data is:   13
In [13]:
print('Count of rows in the data is:  ', len(df))
Count of rows in the data is:   12787

Checking duplicate data

In [14]:
current=len(df)
print('Rows of data before Delecting ', current)
Rows of data before Delecting  12787
In [15]:
df=df.drop_duplicates()
In [16]:
now=len(df)
print('Rows of data before Delecting ', now)
Rows of data before Delecting  12786
In [17]:
diff=current-now
print('Duplicated rows are ', diff)
Duplicated rows are  1

Checking Null values

In [18]:
df.isnull().sum()
Out[18]:
label                      0
statement                  0
subject(s)                 0
speaker                    0
speaker's job title     3565
state info              2747
party affiliation          0
barely true counts         0
false counts               0
half true counts           0
mostly true counts         0
pants on fire counts       0
venue                    129
dtype: int64

Replacing NaN values to missing values

In [19]:
df.replace('', np.nan, inplace=True)

Replacing unknown to missing values

In [20]:
df['venue']= df['venue'].replace(np.nan, 'Unknown')
df["speaker's job title"]= df["speaker's job title"].replace(np.nan, 'Unknown')
df["state info"]= df["state info"].replace(np.nan, 'Unknown')

Checking Null values again

In [21]:
df.isnull().sum()
Out[21]:
label                   0
statement               0
subject(s)              0
speaker                 0
speaker's job title     0
state info              0
party affiliation       0
barely true counts      0
false counts            0
half true counts        0
mostly true counts      0
pants on fire counts    0
venue                   0
dtype: int64

Preparing the list of news types columns that are in numeric data types

In [22]:
num_cols = ['barely true counts', 'false counts', 'half true counts', 'mostly true counts', 'pants on fire counts']
num_cols
Out[22]:
['barely true counts',
 'false counts',
 'half true counts',
 'mostly true counts',
 'pants on fire counts']

Preparing list of the categorical columns

In [23]:
cate_cols = df.columns.drop('label').drop(num_cols)
cate_cols
Out[23]:
Index(['statement', 'subject(s)', 'speaker', 'speaker's job title',
       'state info', 'party affiliation', 'venue'],
      dtype='object')

Checking the number of unique values of each categorical column

In [24]:
df[cate_cols].apply(lambda x: x.nunique(), axis=0)
Out[24]:
statement              12761
subject(s)              4534
speaker                 3308
speaker's job title     1355
state info                85
party affiliation         24
venue                   5142
dtype: int64

Label 💡

  • We will plot the labels to check their counts

  • We will look at the labels with their percentage participation

  • We will also look at each label that contain the most words in the news statement

Distribution of fake news labels

In [25]:
sns.countplot(data= df, x = "label")
plt.show()
In [26]:
df["label"].value_counts().head(7).plot(kind = 'pie', autopct='%1.1f%%', figsize=(8, 8)).legend(bbox_to_anchor=(1, 1))
Out[26]:
<matplotlib.legend.Legend at 0x22dd9fc9940>
In [27]:
df["label"].value_counts()
Out[27]:
half-true      2626
FALSE          2504
mostly-true    2454
barely-true    2102
TRUE           2053
pants-fire     1047
Name: label, dtype: int64

Looking at the words which are in the barely-true news

In [28]:
data1=df[df['label']=='barely-true']
d =data1['statement']
string_ = []
for t in d:
    string_.append(t)
string_ = pd.Series(string_).map(str)
string_=str(string_)
wordcloud = WordCloud(width=1500, height=700,max_font_size=250, background_color ='white').generate(string_)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Looking at the words which are in the half-true news

In [29]:
data1=df[df['label']=='half-true']
d =data1['statement']
string_ = []
for t in d:
    string_.append(t)
string_ = pd.Series(string_).map(str)
string_=str(string_)
wordcloud = WordCloud(width=1500, height=700,max_font_size=250, background_color ='black').generate(string_)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Looking at the words which are in the mostly-true news

In [30]:
data1=df[df['label']=='mostly-true']
d =data1['statement']
string_ = []
for t in d:
    string_.append(t)
string_ = pd.Series(string_).map(str)
string_=str(string_)
wordcloud = WordCloud(width=1500, height=700,max_font_size=250, background_color ='yellow').generate(string_)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Looking at the words which are in the TRUE news

In [31]:
data1=df[df['label']=='TRUE']
d =data1['statement']
string_ = []
for t in d:
    string_.append(t)
string_ = pd.Series(string_).map(str)
string_=str(string_)
wordcloud = WordCloud(width=1500, height=700,max_font_size=250, background_color ='purple').generate(string_)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

Looking at the words which are in the False news

In [32]:
data1=df[df['label']=='FALSE']
d =data1['statement']
string_ = []
for t in d:
    string_.append(t)
string_ = pd.Series(string_).map(str)
string_=str(string_)
wordcloud = WordCloud(width=1500, height=700,max_font_size=250, background_color ='pink').generate(string_)
plt.figure(figsize=(12,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()