Documentation: DataScience_Utils (dsx)¶
- class dsx.ds_utils.dsx(pandas_obj)¶
The dsx module (same name but not to confuse with the package name) contains a collection of wrapper functions to simplify common operations in data analytics tasks. The core module ds_utils (data science utilities) is designed to work with DataFrame in Pandas to simplify common tasks
- classmethod activate_lolviz()¶
Import lolviz package as lz. Add graphviz directory to the os.environ[“path”].
- Parameters
lolviz_dir (str, optional) –
- Return type
lolviz instance
- classmethod backup(df, name: str = 'last')¶
To backup the DataFrame or List (or any object with .copy() method)
- Parameters
df – DataFrame or List (or any object with .copy() method)
name – Name of the backup. To be used to retrieve the data.
- Return type
None
- bk(bk_name: Optional[str] = None)¶
To backup the dataframe.
- Parameters
bk_name –
- Return type
None
- ci(col, n=1000, func=<function mean>, p=0.05)¶
Generate ‘n’ bootstrap samples, evaluating func at each resampling. This method returns a function, which can be called to obtain confidence intervals of interest. :param n: sample size for the sampling distribution
(defalt = 1,000)
- Parameters
func (function, optional) – The statistic functions to be bootstrapped its sampling distribution (default = np.mean())
p (float, optional) – p-value for specifyin 2-sided symmetric confidence interval
- Returns
Function to be called to obtain confidence intervals of interest. Return 2-sided symmetric confidence interval specified
- Return type
function
- cols_shift(col_names: Union[str, list], direction: Union[str, int] = 'right')¶
To shift a list of columns to the left-most or the right-most of the dataframe. Note: there is no “inplace” for this method.
- Parameters
col_names (str or list) –
direction (str or int) – str = ‘left’ or right int = 0 or 1
inplace –
- Returns
df with reordered columns
- Return type
pd.core.frame.DataFrame
- cols_std(inplace=True, camel=False)¶
To standardize the names of all columns, to be compatible with iPython. This method removes space and special characterss in the column names. After standardized, the column names can be used as attribute of the DataFrame (with autocomplete) in iPython
- Parameters
inplace (bool) –
camel (bool) –
- Returns
Only when inplace parameter is set to False
- Return type
pandas.core.frame.DataFrame
- convert_dtypes()¶
To convert dtypes to Pandas 1.0 dtypes and stringify object columns
- Return type
pd.core.frame.DataFrame
- cumsum(col_name: str) pandas.core.frame.DataFrame ¶
To generates the following using the unique values of a variable: - Count (Raw Count of Records) - Percentage of the values over the total data - Accumulated percentage of the values
- Parameters
col_name (str) –
- Return type
pd.core.frame.DataFrame
- classmethod del_tempfiles(tempdata=False)¶
Static method: To delete temporary files of the projects.
- Parameters
tempdata (bool, optional) – Default is ‘False’. Set to ‘True’ to delete temporary data in ‘data/temp’ directory
- Return type
None
- static delta_todate(num_yyyyddd)¶
To convert timedelta to date :param num_yyyyddd:
- Return type
datetime.datetime
- dump(path: str, compression_level: int = 7)¶
To dump DataFrame to the project’s data/temp directory
- Parameters
path (str) –
dir (str, optional) – Default = data/temp
compression_level (int, optional) –
- Return type
None
- duplicated(colname_list: Union[str, list], return_dups=False, keep: bool = False) int ¶
To count the duplicated rows, given a list of columns that contain the unique key.
- Parameters
colname_list (Union[str, list]) –
return_dups (bool, optional) – Default = False Set to True to return a tuple containing (count, df_duplicates).
keep (bool, optional) –
- Returns
Number of Duplicated Rows
- Return type
int
- get_dfname(set=True)¶
To get name of the variable.
Only work in iPython.
- Parameters
var –
- Returns
variable_name
- Return type
str
- static get_varname(var: object)¶
To get name of the variable.
Only work in iPython.
- Parameters
var –
- Returns
variable_name
- Return type
str
- info()¶
To generate the meta-data of the DataFrame. Meta-data includes the following: - Column Names - Missing Count - Missing Percentage - Unique Value Count (nunique) - Unique Value Percentage
- Return type
pandas.core.frame.DataFrame
- static interactive()¶
Set InteractiveShell.ast_node_interactivity = “all” Set mpl.use(“module://backend_interagg”) Set plt.ion()
- isnull(colname: str) tuple ¶
Count the rows (and the %) of missing values in the specified column
- Parameters
colname (str) – Single column name
- Returns
(Count of Missing Rows, Percentage of Missing Rows)
- Return type
tuple
- isnull_list(col_names_list=None) pandas.core.frame.DataFrame ¶
Generate a report of cases with missing values
- Parameters
col_names_list (list, optional) – List of columns to be included in the report. If not specified, all columns will be used.
- Return type
pandas.core.frame.DataFrame
- len_compare(df_to_compare, overwrite_df1=None) tuple ¶
Compare the length of two Dataframes (or any other enumeratable object)
- Parameters
df_to_compare –
overwrite_df1 (bool, optional) – To ignore this instance of DataFrame and use the DataFrame in parameter as the copy to be compared.
- static matplotlib_config()¶
Print matplotlib configurations
- Returns
lines of texts
- Return type
str
- merge(right, how='left', on=None, left_on=None, right_on=None, isnull=None) pandas.core.frame.DataFrame ¶
To merge with another DataFrame. A wrapper method for ‘merge’ in pandas, with additional checking mechanisms. The mehtod also creates a backup of the original DataFrame with the key ‘last’ in dsx.backup_repo (dictionary).
- Parameters
right (pd.core.frame.DataFrame) –
isnull (str) –
- Return type
pd.core.frame.DataFrame
- nunique(col_names_list=None) pandas.core.frame.DataFrame ¶
- To generate:
the number of unique values
the percentage of the unique value over the total records (or rows)
- Parameters
col_names_list (list) – If not specified, all column names will be used
- Return type
pd.core.frame.DataFrame
- static plt_labels(percent=False, fontsize=None, color=None, denominator=None)¶
To insert label for each element in the current axes (last chart created). :param percent: :type percent: bool :param fontsize: :type fontsize: float :param color: :type color: str :param denominator: :type denominator: float
- Return type
None
- static progress(iterable: collections.abc.Iterable, counter: int) str ¶
To return string template for the progress of a loop operation.
- Parameters
iterable (Iterable) –
counter (int) –
- Return type
str
- rename(col_index_or_name: Union[str, int], col_name_new, inplace: bool = True)¶
To rename single column :param col_index_or_name: :param col_name_new: :param inplace:
- Returns
renamed_DataFrame – Only if inplace is set to False.
- Return type
pd.core.frame.DataFrame
- reset_index(index_label: str = 'RID', inplace: bool = True)¶
To reset index and immediately rename the old ‘index’ to new index_label defined.
- Parameters
index_label (str, optional) –
inplace (bool, optional) –
- Returns
ONLY when inplace == False
- Return type
pd.core.frame.DataFrame
- classmethod restore(name: str = 'last')¶
To restore the DataFrame or List (or any object with .copy() method)
- Parameters
df – DataFrame or List (or any object with .copy() method)
name – Name of the backup. To be used to retrieve the data.
- Return type
Object
- rs(bk_name: Optional[str] = None, inplace=True)¶
To restore the dataframe.
- Parameters
bk_name –
inplace –
- Return type
pandas.core.frame.DataFrame
- classmethod set_dirs(root=False)¶
Set the project root folder.
- Parameters
root (bool, optional) – To indicate whether the current active directory is the root or sub-directory of the project
- Return type
None
- static set_ipython(node_interactivity: str = 'last')¶
Set ast_node_interactivity in Ipython.core.InteractiveShell
- Parameters
node_interactivity (str, optional) – Default is ‘last’. DSX uses ‘all’ if kernel is detected.
- Return type
None
- classmethod setup_project(root=True, get_xfiles=False, xfiles_url=None, git_files=False)¶
Setup project directories for new projects. If the directories exist, will not be overwritten.
- Parameters
root (bool, optional) –
get_xfiles (bool, optional) –
git_files (bool, optional) –
- Return type
None
- split(col: str, sep: str, index_label: str = 'RID', drop_innerindex: bool = True, reset_index_inplace: bool = True)¶
To generate a DataFrame by splitting the values in a string, where the values are separated by a separator character.
This method is improved upon the original split method in pandas. Where there is no separator in a row, the value will still be posted to the newly generated DataFrame as the outputs.
- Parameters
col (str) –
sep (str) –
index_label (str) –
drop_innerindex (bool) –
reset_index_inplace –
- Return type
pd.core.frame.DataFrame
- to_dict(key_col: str, val_col: str) pandas.core.frame.DataFrame ¶
To generate dictionary from two columns :param key_col: :type key_col: str :param val_col: :type val_col: str
- Return type
pd.core.frame.DataFrame
- to_excel_stringify(dir=None, strings_to_urls_bool=False)¶
Faster option to export Excel File, with the option to stringify all hyperlinks in the table. :param dir: :param strings_to_urls_bool:
- static to_numeric(inputString)¶
To convert string to numeric :param inputString:
- xv(title=None, convert_time=True, width='100%', height='1200', dirhtml='../_temp', dirbase='_temp', **kwargs)¶
- Parameters
title (str, Title for the new viewer file.) –
convert (bool, Convert datetime dtype to str for display.) –