Source of Dataframes Reganv.info


  [TOC] 

Sample Data
==================
Data that will be used as an example in the following sections
```python
data = {'Food': ['Taco', 'Sushi', 'Hamburger', 'Hotdog', 'Stew', 'Ribs', 'Spaghetti'],
        'Calories': [27, 24, 22, 32, 99, 123, 231]
        }
df = pd.DataFrame(data)
```
|index | Food | Calories |
| ---- | --- | -------- |
| 0 | Taco | 27 |
| 1 |  Sushi | 24 |
| 2 | Hamburger | 22 |
| 3 | Hotdog | 32 |
| 4 | Stew | 99 |
| 5 | Ribs | 123 |
| 6 | Spaghetti | 231 |

Accessing
==================
Ultimately it seems like the only method you need here is loc

By Row - loc
------------------
Access a group of rows and columns by label(s) or a boolean array. [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)

Expects an index or an index and column.
```python
print(df.loc['Food'])  # Key Error
print(df.loc['Taco'])  # Key Error
print(df.loc[0])  # Returns pandas series
    Food        Taco
    Calories      27
print(df.loc[[0]])  # Returns dataframe
       Food  Calories
    0  Taco        27
```

Can access an individual value like so
```python
print(df.loc[0, "Food"]  # Returns str
     Taco
```
You can also set a value in the same way
```python
df.loc[0, "Food"] = "Roast"
```
Useful if you have reindexed the dataframe to a non integer index
```python
df = df.set_index('Food')
df.loc['Taco']  # Return series
    Calories    27
```
By Row - iloc
------------------
Purely integer-location based indexing for selection by position. Performs mostly the same as loc if you haven't changed the index. Deprecated since 2.2.0 [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html#)
```python
print(df.iloc[0, "Food"])  # ValueError, can't have a string, only ints
print(df.iloc[0, 0])  # Returns str
    Taco
print(df.iloc[0])  # Returns series
    Food        Taco
    Calories      27
print(df.iloc[[0]])  # Returns dataframe
```
By Row - at
------------------
Similar to loc, in that both provide label-based lookups. Use at if you only need to get or set a single value in a DataFrame or Series [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html)
```python
df.at[0]  # TypeError missing argument 'col'
df.at[1, "Food"]  # Returns str
    Sushi
df.at[1, "Food"] = "Roast Beef"  # Set value
```
Masking
==================
A way to filter down a dataframe to only rows of interest

Basic Masking
------------------
```python
df.loc[df['Calories'] > 100]
        Food  Calories
5       Ribs       123
6  Spaghetti       231
```

Can apply multiple filters, just make sure your parenthesis and boolean operators are in the right place
```python
df.loc[df['Food'].str.contains('i') & (df['Calories'] > 200)]
        Food  Calories
6  Spaghetti       231

df.loc[df['Food'].str.contains('i') | (df['Food'].str.contains('o'))]
        Food  Calories
0       Taco        27
1      Sushi        24
3     Hotdog        32
5       Ribs       123
6  Spaghetti       231
```

Compare to another list/dataframe
------------------
Lets say we have another dataframe with street foods in it.

street_food_df = 

| Name | Type |
| --- | --- |
| Taco | Mexican |
| Hamburger | American |
| Hotdog | American |

We can filter down our food dataframe like so
```python
final_df = df.loc[df['Food'].isin(street_food_df['Name'])]
print(final_df)
        Food  Calories
0       Taco        27
2  Hamburger        22
3     Hotdog        32
```

Grouping
==================
Aggregating multiple columns
------------------
Here's how to specify what to do with each column in a group by operation

```
consolidated_df = df.groupby('cpe')[['ip', 'latest_version', 'eol_date', 'host_count']].agg({
    'ip': lambda x: ','.join(x),
    'latest_version': 'max',
    'eol_date': 'max',
    'host_count': 'count'
    }).reset_index()
```