Documentation: DataScience_Utils (dsx)¶

class dsx.ds_utils.dsx(pandas_obj)¶

The dsx module (same name but not to confuse with the package name) contains a collection of wrapper functions to simplify common operations in data analytics tasks. The core module ds_utils (data science utilities) is designed to work with DataFrame in Pandas to simplify common tasks

classmethod activate_lolviz()¶

Import lolviz package as lz. Add graphviz directory to the os.environ[“path”].

Parameters: lolviz_dir (str, optional) –
Return type: lolviz instance

classmethod backup(df, name: str = 'last')¶

To backup the DataFrame or List (or any object with .copy() method)

Parameters

df – DataFrame or List (or any object with .copy() method)
name – Name of the backup. To be used to retrieve the data.

Return type

None

bk(bk_name: Optional[str] = None)¶

To backup the dataframe.

Parameters: bk_name –
Return type: None

ci(col, n=1000, func=<function mean>, p=0.05)¶

Generate ‘n’ bootstrap samples, evaluating func at each resampling. This method returns a function, which can be called to obtain confidence intervals of interest. :param n: sample size for the sampling distribution

(defalt = 1,000)

Parameters

func (function, optional) – The statistic functions to be bootstrapped its sampling distribution (default = np.mean())
p (float, optional) – p-value for specifyin 2-sided symmetric confidence interval

Returns

Function to be called to obtain confidence intervals of interest. Return 2-sided symmetric confidence interval specified

Return type

function

cols_shift(col_names: Union[str, list], direction: Union[str, int] = 'right')¶

To shift a list of columns to the left-most or the right-most of the dataframe. Note: there is no “inplace” for this method.

Parameters

col_names (str or list) –
direction (str or int) – str = ‘left’ or right int = 0 or 1
inplace –

Returns

df with reordered columns

Return type

pd.core.frame.DataFrame

cols_std(inplace=True, camel=False)¶

To standardize the names of all columns, to be compatible with iPython. This method removes space and special characterss in the column names. After standardized, the column names can be used as attribute of the DataFrame (with autocomplete) in iPython

Parameters

inplace (bool) –
camel (bool) –

Returns

Only when inplace parameter is set to False

Return type

pandas.core.frame.DataFrame

convert_dtypes()¶

To convert dtypes to Pandas 1.0 dtypes and stringify object columns

Return type: pd.core.frame.DataFrame

cumsum(col_name: str) → pandas.core.frame.DataFrame¶

To generates the following using the unique values of a variable: - Count (Raw Count of Records) - Percentage of the values over the total data - Accumulated percentage of the values

Parameters: col_name (str) –
Return type: pd.core.frame.DataFrame

classmethod del_tempfiles(tempdata=False)¶

Static method: To delete temporary files of the projects.

Parameters: tempdata (bool, optional) – Default is ‘False’. Set to ‘True’ to delete temporary data in ‘data/temp’ directory
Return type: None

static delta_todate(num_yyyyddd)¶

To convert timedelta to date :param num_yyyyddd:

Return type: datetime.datetime

dump(path: str, compression_level: int = 7)¶

To dump DataFrame to the project’s data/temp directory

Parameters

path (str) –
dir (str, optional) – Default = data/temp
compression_level (int, optional) –

Return type

None

duplicated(colname_list: Union[str, list], return_dups=False, keep: bool = False) → int¶

To count the duplicated rows, given a list of columns that contain the unique key.

Parameters

colname_list (Union[str, list]) –
return_dups (bool, optional) – Default = False Set to True to return a tuple containing (count, df_duplicates).
keep (bool, optional) –

Returns

Number of Duplicated Rows

Return type

int

get_dfname(set=True)¶

To get name of the variable.

Only work in iPython.

Parameters: var –
Returns: variable_name
Return type: str

static get_varname(var: object)¶

To get name of the variable.

Only work in iPython.

Parameters: var –
Returns: variable_name
Return type: str

info()¶

To generate the meta-data of the DataFrame. Meta-data includes the following: - Column Names - Missing Count - Missing Percentage - Unique Value Count (nunique) - Unique Value Percentage

Return type: pandas.core.frame.DataFrame

static interactive()¶: Set InteractiveShell.ast_node_interactivity = “all” Set mpl.use(“module://backend_interagg”) Set plt.ion()

isnull(colname: str) → tuple¶

Count the rows (and the %) of missing values in the specified column

Parameters: colname (str) – Single column name
Returns: (Count of Missing Rows, Percentage of Missing Rows)
Return type: tuple

isnull_list(col_names_list=None) → pandas.core.frame.DataFrame¶

Generate a report of cases with missing values

Parameters: col_names_list (list, optional) – List of columns to be included in the report. If not specified, all columns will be used.
Return type: pandas.core.frame.DataFrame

len_compare(df_to_compare, overwrite_df1=None) → tuple¶

Compare the length of two Dataframes (or any other enumeratable object)

Parameters

df_to_compare –
overwrite_df1 (bool, optional) – To ignore this instance of DataFrame and use the DataFrame in parameter as the copy to be compared.

static matplotlib_config()¶

Print matplotlib configurations

Returns: lines of texts
Return type: str

merge(right, how='left', on=None, left_on=None, right_on=None, isnull=None) → pandas.core.frame.DataFrame¶

To merge with another DataFrame. A wrapper method for ‘merge’ in pandas, with additional checking mechanisms. The mehtod also creates a backup of the original DataFrame with the key ‘last’ in dsx.backup_repo (dictionary).

Parameters

right (pd.core.frame.DataFrame) –
isnull (str) –

Return type

pd.core.frame.DataFrame

nunique(col_names_list=None) → pandas.core.frame.DataFrame¶

To generate:

the number of unique values
the percentage of the unique value over the total records (or rows)

Parameters: col_names_list (list) – If not specified, all column names will be used
Return type: pd.core.frame.DataFrame

static plt_labels(percent=False, fontsize=None, color=None, denominator=None)¶

To insert label for each element in the current axes (last chart created). :param percent: :type percent: bool :param fontsize: :type fontsize: float :param color: :type color: str :param denominator: :type denominator: float

Return type: None

static progress(iterable: collections.abc.Iterable, counter: int) → str¶

To return string template for the progress of a loop operation.

Parameters

iterable (Iterable) –
counter (int) –

Return type

str

rename(col_index_or_name: Union[str, int], col_name_new, inplace: bool = True)¶

To rename single column :param col_index_or_name: :param col_name_new: :param inplace:

Returns: renamed_DataFrame – Only if inplace is set to False.
Return type: pd.core.frame.DataFrame

reset_index(index_label: str = 'RID', inplace: bool = True)¶

To reset index and immediately rename the old ‘index’ to new index_label defined.

Parameters

index_label (str, optional) –
inplace (bool, optional) –

Returns

ONLY when inplace == False

Return type

pd.core.frame.DataFrame

classmethod restore(name: str = 'last')¶

To restore the DataFrame or List (or any object with .copy() method)

Parameters

df – DataFrame or List (or any object with .copy() method)
name – Name of the backup. To be used to retrieve the data.

Return type

Object

rs(bk_name: Optional[str] = None, inplace=True)¶

To restore the dataframe.

Parameters

bk_name –
inplace –

Return type

pandas.core.frame.DataFrame

classmethod set_dirs(root=False)¶

Set the project root folder.

Parameters: root (bool, optional) – To indicate whether the current active directory is the root or sub-directory of the project
Return type: None

static set_ipython(node_interactivity: str = 'last')¶

Set ast_node_interactivity in Ipython.core.InteractiveShell

Parameters: node_interactivity (str, optional) – Default is ‘last’. DSX uses ‘all’ if kernel is detected.
Return type: None

classmethod setup_project(root=True, get_xfiles=False, xfiles_url=None, git_files=False)¶

Setup project directories for new projects. If the directories exist, will not be overwritten.

Parameters

root (bool, optional) –
get_xfiles (bool, optional) –
git_files (bool, optional) –

Return type

None

split(col: str, sep: str, index_label: str = 'RID', drop_innerindex: bool = True, reset_index_inplace: bool = True)¶

To generate a DataFrame by splitting the values in a string, where the values are separated by a separator character.

This method is improved upon the original split method in pandas. Where there is no separator in a row, the value will still be posted to the newly generated DataFrame as the outputs.

Parameters

col (str) –
sep (str) –
index_label (str) –
drop_innerindex (bool) –
reset_index_inplace –

Return type

pd.core.frame.DataFrame

to_dict(key_col: str, val_col: str) → pandas.core.frame.DataFrame¶

To generate dictionary from two columns :param key_col: :type key_col: str :param val_col: :type val_col: str

Return type: pd.core.frame.DataFrame

to_excel_stringify(dir=None, strings_to_urls_bool=False)¶: Faster option to export Excel File, with the option to stringify all hyperlinks in the table. :param dir: :param strings_to_urls_bool:

static to_numeric(inputString)¶: To convert string to numeric :param inputString:

xv(title=None, convert_time=True, width='100%', height='1200', dirhtml='../_temp', dirbase='_temp', **kwargs)¶

Parameters

title (str, Title for the new viewer file.) –
convert (bool, Convert datetime dtype to str for display.) –

Documentation: DataScience_Utils (dsx)¶

Indices and tables¶

Table of Contents

This Page