Pandas

Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Pandas
Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.
  1. Install pandas
                         
                            pip install pandas
                            
    
  2. Import Pandas
                         
                            import pandas
                            
    
         
            import pandas as pd
            
    
  3. Checking Pandas Version
                         
                            import pandas as pd                        print(pd.__version__)
                            
    
  4. Pandas Series
    Pandas Series
    What is a Series?
    A Pandas Series is like a column in a table.
    It is a one-dimensional array holding data of any type.


    Labels
    If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

    This label can be used to access a specified value.


    Create Labels
    With the index argument, you can name your own labels.


    Key/Value Objects as Series
    You can also use a key/value object, like a dictionary, when creating a Series.


    DataFrames
    Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

    Series is like a column, a DataFrame is the whole table.

                                     
                                        a = [1, 7, 2]myvar = pd.Series(a)print(myvar)                                    
                                

                                     
                                        a = [1, 7, 2]                                    myvar = pd.Series(a, index = ["x", "y", "z"])
                                        
                                        print(myvar)      
                                        
                                
  5. Pandas DataFrames
    Pandas DataFrames
    What is a DataFrame? A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. Locate Row As you can see from the result above, the DataFrame is like a table with rows and columns. Pandas use the loc attribute to return one or more specified row(s) Note: When using [], the result is a Pandas DataFrame. Named Indexes With the index argument, you can name your own indexes. Locate Named Indexes Use the named index in the loc attribute to return the specified row(s). Load Files Into a DataFrame If your data sets are stored in a file, Pandas can load them into a DataFrame.

                                     
                                        data = {
                                            "calories": [420, 380, 390],
                                            "duration": [50, 40, 45]
                                          }
                                          
                                          myvar = pd.DataFrame(data)
                                          
                                          print(myvar)
                                        
                                

                                            
    print(df.loc[0])
                                        
                                        

                                        
                                            data = {
                                                "calories": [420, 380, 390],
                                                "duration": [50, 40, 45]
                                              }
                                              
                                              df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
                                              
                                              print(df) 
                                    
                                    

                                            
                                                print(df.loc["day2"])  
                                        
                                        

                                            
                                                df = pd.read_csv('data.csv')                                            print(df) 
                                        
                                        
  6. Pandas Read CSV
    Pandas Read CSV
    Read CSV Files A simple way to store big data sets is to use CSV files (comma separated files). CSV files contains plain text and is a well know format that can be read by everyone including Pandas. In our examples we will be using a CSV file called 'data.csv'. max_rows The number of rows returned is defined in Pandas option settings. You can check your system's maximum rows with the pd.options.display.max_rows statement.

    Tip: use to_string() to print the entire DataFrame.

                                            
                                                df = pd.read_csv('data.csv')                                            print(df.to_string())      
                                        
                                        

                                            
                                                df = pd.read_csv('data.csv')                                            print(df)      
                                        
                                        

                                        
                                    print(pd.options.display.max_rows)
                                    
                                    
  7. Pandas Read JSON
    Pandas Read JSON
    Read JSON Big data sets are often stored, or extracted as JSON. JSON is plain text, but has the format of an object, and is well known in the world of programming, including Pandas. In our examples we will be using a JSON file called 'data.json'. Dictionary as JSON JSON = Python Dictionary JSON objects have the same format as Python dictionaries. If your JSON code is not in a file, but in a Python Dictionary, you can load it into a DataFrame directly

                                        
                                            df = pd.read_json('data.json')print(df.to_string())                                     
                                        

                                        
                                            
    data = {
        "Duration":{
          "0":60,
          "1":60,
          "2":60,
          
        },
        "Pulse":{
          "0":110,
          "1":117,
          "2":103,
          
        },
        "Maxpulse":{
          "0":130,
          "1":145,
          "2":135,
        
        },
        "Calories":{
          "0":409,
          "1":479,
          "2":340,
          
        }
      }
      
      df = pd.DataFrame(data)
      
      print(df) 
                                        
                                        
  8. Pandas - Analyzing DataFrames
    Pandas - Analyzing DataFrames
    Viewing the Data
    One of the most used method for getting a quick overview of the DataFrame, is the head() method.


    The head() method returns the headers and a specified number of rows, starting from the top.


    There is also a tail() method for viewing the last rows of the DataFrame. The tail() method returns the headers and a specified number of rows, starting from the bottom.


    Info About the Data
    The DataFrames object has a method called info(), that gives you more information about the data set.


    Null Values
    The info() method also tells us how many Non-Null values there are present in each column, and in our data set it seems like there are 164 of 169 Non-Null values in the "Calories" column.

                                                
                                                    print(df.head())
                                                
                                                

                                                
                                                    print(df.tail()) 
                                                
                                                

                                                
                                                    print(df.info()) 
                                                
                                                
  9. Cleaning Data
    Cleaning Data
    Data Cleaning Data cleaning means fixing bad data in your data set. Bad data could be:
    1. Empty cells
    2. Data in wrong format
    3. Wrong data
    4. Duplicates

  10. Cleaning Empty Cells
    Cleaning Empty Cells
    Empty cells Empty cells can potentially give you a wrong result when you analyze data.
    1. Remove Rows
      One way to deal with empty cells is to remove rows that contain empty cells.This is usually OK, since data sets can be very big, and removing a few rows will not have a big impact on the result.

      Note: By default, the dropna() method returns a new DataFrame, and will not change the original.

      If you want to change the original DataFrame, use the inplace = True argument
    2. Replace Empty Values
      Another way of dealing with empty cells is to insert a new value instead.

      This way you do not have to delete entire rows just because of some empty cells.

      The fillna() method allows us to replace empty cells with a value
    3. Replace Only For Specified Columns
      The example above replaces all empty cells in the whole Data Frame.

      To only replace empty values for one column, specify the column name for the DataFrame
    4. Replace Using Mean, Median, or Mode
      A common way to replace empty cells, is to calculate the mean, median or mode value of the column.

      Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column

    Note: By default, the dropna() method returns a new DataFrame, and will not change the original.

    If you want to change the original DataFrame, use the inplace = True argument

                                                
                                                    new_df = df.dropna()
                                                
                                                

                                                
                                                    df.dropna(inplace = True)
                                                
                                                

                                                
                                                    df.fillna(130, inplace = True)
                                                
                                                

    Mean = the average value (the sum of all values divided by number of values).

                                                  
                                                    x = df["Calories"].mean()                                                df["Calories"].fillna(x, inplace = True)
                                                  
                                                  

    Median = the value in the middle, after you have sorted all values ascending.

                                                  
                                                    x = df["Calories"].median()df["Calories"].fillna(x, inplace = True)
                                                  
                                                  

    Mode = the value that appears most frequently.

                                                  
                                                    x = df["Calories"].mode()[0]df["Calories"].fillna(x, inplace = True)
                                                  
                                                  
  11. Cleaning Data of Wrong Format
    Cleaning Data of Wrong Format
    Data of Wrong Format Cells with data of wrong format can make it difficult, or even impossible, to analyze data. To fix it, you have two options: remove the rows, or convert all cells in the columns into the same format.
    1. Convert Into a Correct Format
    2. Removing Rows

                                                
                                                    df['Date'] = pd.to_datetime(df['Date'])
                                                
                                                

                                                                                        
                                                    df.dropna(subset=['Date'], inplace = True)
       
                                                
                                                
  12. Fixing Wrong Data
    Fixing Wrong Data
    Wrong Data "Wrong data" does not have to be "empty cells" or "wrong format", it can just be wrong, like if someone registered "199" instead of "1.99". Sometimes you can spot wrong data by looking at the data set, because you have an expectation of what it should be.
    1. Replacing Values
    2. Removing Rows

                                                
                                                    df.loc[7, 'Duration'] = 45
                                                
                                                

                                                
                                                    for x in df.index:
      if df.loc[x, "Duration"] > 120:
        df.drop(x, inplace = True)
                                                
                                                
  13. Java Wrapper Classes
    Java Wrapper Classes
    Wrapper classes provide a way to use primitive data types (int, boolean, etc..) as objects.

    The table below shows the primitive type and the equivalent wrapper class:
  14. Removing Duplicates
    Removing Duplicates
    Discovering Duplicates Duplicate rows are rows that have been registered more than one time. To discover duplicates, we can use the duplicated() method. The duplicated() method returns a Boolean values for each row

                                    
                                        print(df.duplicated())
                                    
                                    

                                    
                                        df.drop_duplicates(inplace = True)
                                    
                                    
  15. Data Correlations
    Data Correlations
    Finding Relationships
    A great aspect of the Pandas module is the corr() method.

    The corr() method calculates the relationship between each column in your data set.

    The Result of the corr() method is a table with a lot of numbers that represents how well the relationship is between two columns.

    The number varies from -1 to 1.

    1 means that there is a 1 to 1 relationship (a perfect correlation), and for this data set, each time a value went up in the first column, the other one went up as well.

    0.9 is also a good relationship, and if you increase one value, the other will probably increase as well.

    -0.9 would be just as good relationship as 0.9, but if you increase one value, the other will probably go down.

    0.2 means NOT a good relationship, meaning that if one value goes up does not mean that the other will.

    What is a good correlation? It depends on the use, but I think it is safe to say you have to have at least 0.6 (or -0.6) to call it a good correlation.


    Perfect Correlation: We can see that "Duration" and "Duration" got the number 1.000000, which makes sense, each column always has a perfect relationship with itself.


    Good Correlation: "Duration" and "Calories" got a 0.922721 correlation, which is a very good correlation, and we can predict that the longer you work out, the more calories you burn, and the other way around: if you burned a lot of calories, you probably had a long work out.


    Bad Correlation: "Duration" and "Maxpulse" got a 0.009403 correlation, which is a very bad correlation, meaning that we can not predict the max pulse by just looking at the duration of the work out, and vice versa.

    Note: The corr() method ignores "not numeric" columns.

                                    
                                        df.corr()
                                    
                                    

                                    
                                        df.drop_duplicates(inplace = True)
                                    
                                    
  16. DataFrame Reference
    DataFrame Reference
    Property/Method Description
    abs() Return a DataFrame with the absolute value of each value
    add() Adds the values of a DataFrame with the specified value(s)
    add_prefix() Prefix all labels
    add_suffix() Suffix all labels
    agg() Apply a function or a function name to one of the axis of the DataFrame
    aggregate() Apply a function or a function name to one of the axis of the DataFrame
    align() Aligns two DataFrames with a specified join method
    all() Return True if all values in the DataFrame are True, otherwise False
    any() Returns True if any of the values in the DataFrame are True, otherwise False
    append() Append new columns
    applymap() Execute a function for each element in the DataFrame
    apply() Apply a function to one of the axis of the DataFrame
    assign() Assign new columns
    astype() Convert the DataFrame into a specified dtype
    at Get or set the value of the item with the specified label
    axes Returns the labels of the rows and the columns of the DataFrame
    bfill() Replaces NULL values with the value from the next row
    bool() Returns the Boolean value of the DataFrame
    columns Returns the column labels of the DataFrame
    combine() Compare the values in two DataFrames, and let a function decide which values to keep
    combine_first() Compare two DataFrames, and if the first DataFrame has a NULL value, it will be filled with the respective value from the second DataFrame
    compare() Compare two DataFrames and return the differences
    convert_dtypes() Converts the columns in the DataFrame into new dtypes
    corr() Find the correlation (relationship) between each column
    count() Returns the number of not empty cells for each column/row
    cov() Find the covariance of the columns
    copy() Returns a copy of the DataFrame
    cummax() Calculate the cumulative maximum values of the DataFrame
    cummin() Calculate the cumulative minmum values of the DataFrame
    cumprod() Calculate the cumulative product over the DataFrame
    cumsum() Calculate the cumulative sum over the DataFrame
    describe() Returns a description summary for each column in the DataFrame
    diff() Calculate the difference between a value and the value of the same column in the previous row
    div() Divides the values of a DataFrame with the specified value(s)
    dot() Multiplies the values of a DataFrame with values from another array-like object, and add the result
    drop() Drops the specified rows/columns from the DataFrame
    drop_duplicates() Drops duplicate values from the DataFrame
    droplevel() Drops the specified index/column(s)
    dropna() Drops all rows that contains NULL values
    dtypes Returns the dtypes of the columns of the DataFrame
    duplicated() Returns True for duplicated rows, otherwise False
    empty Returns True if the DataFrame is empty, otherwise False
    eq() Returns True for values that are equal to the specified value(s), otherwise False
    equals() Returns True if two DataFrames are equal, otherwise False
    eval Evaluate a specified string
    explode() Converts each element into a row
    ffill() Replaces NULL values with the value from the previous row
    fillna() Replaces NULL values with the specified value
    filter() Filter the DataFrame according to the specified filter
    first() Returns the first rows of a specified date selection
    floordiv() Divides the values of a DataFrame with the specified value(s), and floor the values
    ge() Returns True for values greater than, or equal to the specified value(s), otherwise False
    get() Returns the item of the specified key
    groupby() Groups the rows/columns into specified groups
    gt() Returns True for values greater than the specified value(s), otherwise False
    head() Returns the header row and the first 10 rows, or the specified number of rows
    iat Get or set the value of the item in the specified position
    idxmax() Returns the label of the max value in the specified axis
    idxmin() Returns the label of the min value in the specified axis
    iloc Get or set the values of a group of elements in the specified positions
    index Returns the row labels of the DataFrame
    infer_objects() Change the dtype of the columns in the DataFrame
    info() Prints information about the DataFrame
    insert() Insert a column in the DataFrame
    interpolate() Replaces not-a-number values with the interpolated method
    isin() Returns True if each elements in the DataFrame is in the specified value
    isna() Finds not-a-number values
    isnull() Finds NULL values
    items() Iterate over the columns of the DataFrame
    iteritems() Iterate over the columns of the DataFrame
    iterrows() Iterate over the rows of the DataFrame
    itertuples() Iterate over the rows as named tuples
    join() Join columns of another DataFrame
    last() Returns the last rows of a specified date selection
    le() Returns True for values less than, or equal to the specified value(s), otherwise False
    loc Get or set the value of a group of elements specified using their labels
    lt() Returns True for values less than the specified value(s), otherwise False
    keys() Returns the keys of the info axis
    kurtosis() Returns the kurtosis of the values in the specified axis
    mask() Replace all values where the specified condition is True
    max() Return the max of the values in the specified axis
    mean() Return the mean of the values in the specified axis
    median() Return the median of the values in the specified axis
    melt() Reshape the DataFrame from a wide table to a long table
    memory_usage() Returns the memory usage of each column
    merge() Merge DataFrame objects
    min() Returns the min of the values in the specified axis
    mod() Modules (find the remainder) of the values of a DataFrame
    mode() Returns the mode of the values in the specified axis
    mul() Multiplies the values of a DataFrame with the specified value(s)
    ndim Returns the number of dimensions of the DataFrame
    ne() Returns True for values that are not equal to the specified value(s), otherwise False
    nlargest() Sort the DataFrame by the specified columns, descending, and return the specified number of rows
    notna() Finds values that are not not-a-number
    notnull() Finds values that are not NULL
    nsmallest() Sort the DataFrame by the specified columns, ascending, and return the specified number of rows
    nunique() Returns the number of unique values in the specified axis
    pct_change() Returns the percentage change between the previous and the current value
    pipe() Apply a function to the DataFrame
    pivot() Re-shape the DataFrame
    pivot_table() Create a spreadsheet pivot table as a DataFrame
    pop() Removes an element from the DataFrame
    pow() Raise the values of one DataFrame to the values of another DataFrame
    prod() Returns the product of all values in the specified axis
    product() Returns the product of the values in the specified axis
    quantile() Returns the values at the specified quantile of the specified axis
    query() Query the DataFrame
    radd() Reverse-adds the values of one DataFrame with the values of another DataFrame
    rdiv() Reverse-divides the values of one DataFrame with the values of another DataFrame
    reindex() Change the labels of the DataFrame
    reindex_like() ??
    rename() Change the labels of the axes
    rename_axis() Change the name of the axis
    reorder_levels() Re-order the index levels
    replace() Replace the specified values
    reset_index() Reset the index
    rfloordiv() Reverse-divides the values of one DataFrame with the values of another DataFrame
    rmod() Reverse-modules the values of one DataFrame to the values of another DataFrame
    rmul() Reverse-multiplies the values of one DataFrame with the values of another DataFrame
    round() Returns a DataFrame with all values rounded into the specified format
    rpow() Reverse-raises the values of one DataFrame up to the values of another DataFrame
    rsub() Reverse-subtracts the values of one DataFrame to the values of another DataFrame
    rtruediv() Reverse-divides the values of one DataFrame with the values of another DataFrame
    sample() Returns a random selection elements
    sem() Returns the standard error of the mean in the specified axis
    select_dtypes() Returns a DataFrame with columns of selected data types
    shape Returns the number of rows and columns of the DataFrame
    set_axis() Sets the index of the specified axis
    set_flags() Returns a new DataFrame with the specified flags
    set_index() Set the Index of the DataFrame
    size Returns the number of elements in the DataFrame
    skew() Returns the skew of the values in the specified axis
    sort_index() Sorts the DataFrame according to the labels
    sort_values() Sorts the DataFrame according to the values
    squeeze() Converts a single column DataFrame into a Series
    stack() Reshape the DataFrame from a wide table to a long table
    std() Returns the standard deviation of the values in the specified axis
    sum() Returns the sum of the values in the specified axis
    sub() Subtracts the values of a DataFrame with the specified value(s)
    swaplevel() Swaps the two specified levels
    T Turns rows into columns and columns into rows
    tail() Returns the headers and the last rows
    take() Returns the specified elements
    to_xarray() Returns an xarray object
    transform() Execute a function for each value in the DataFrame
    transpose() Turns rows into columns and columns into rows
    truediv() Divides the values of a DataFrame with the specified value(s)
    truncate() Removes elements outside of a specified set of values
    update() Update one DataFrame with the values from another DataFrame
    value_counts() Returns the number of unique rows
    values Returns the DataFrame as a NumPy array
    var() Returns the variance of the values in the specified axis
    where() Replace all values where the specified condition is False
    xs() Returns the cross-section of the DataFrame
    __iter__() Returns an iterator of the info axes