pandas.DataFrame Operation: Difference between revisions

Latest revision as of 15:36, 24 July 2023

A pandas.DataFrame Operation is a python-based tabular data structure operation for a pandas.DataFrame.

Context:
- It can (typically) involve a pandas.DataFrame Method.
- It can range from being a pandas.DataFrame Management Operation to being a pandas.DataFrame Query.
Example(s)
- Create an empty array:
  - df = pd.DataFrame(columns=['col1','col2']).
  - df = pd.DataFrame(np.zeros(0, dtype=[('col1', 'i7'),('col2', 'a50')])).
- Create a prepopulated array:
  - df = pd.DataFrame([2,7,9]), with auto-gen column name and index names.
  - df = pd.DataFrame({'col1' : [1,2,3], 'col2' : Series([1., 2., 3., 4.]) }), with auto-gen index keys.
  - df = pd.DataFrame({'col1' : Series([1., 2., 3.], index=['row3', 'row2', 'row1']), 'col2' : Series([1., 2., 3., 4.], index=['row1', 'row2', 'row3', 'row4']) }), with explicit index keys.
  - df = pd.DataFrame(np.random.randn(10, 2), columns=['colA', 'colB'])
  - mydata = [ {'col0':'A', 'col1':'EB', 'col2':1.1}, {'col0':'B', 'col1':'EB', }, {'col0':'C', 'col1':'PG', 'col2':2.4}, {'col0':'D', 'col1':'PG', 'col2':'7.0'}, ] df = pd.DataFrame(mydata) df.set_index('col0', inplace=True)
  - mydata = [{"str 1"}, {"str 2"},] pd.DataFrame(mydata, columns=['colA'])
- Update array rows (or add array column):
  - df = df['colX'].str.replace("\n","< BR >")
  - df.loc[0] = ["val_0_A", 5.7] df.loc[1] = ["val_1_A", 8.8] df.loc[2] = ["val_2_A", -0.2]
  - df_tmp = df.ColX.str.extract('us-(....)[-]?(.*)', expand=False) ; df_tmp.columns = ['ColX_A','ColX_B'] ; df.loc[:,'ColX_A'] = df_tmp'ColX_A' ; df.loc[:,'ColX_B'] = df_tmp'ColX_B' ;
  - df_yyyymm = df_dates['dtimeCol'].map(lambda x: 100*x.year + x.month).astype('int') ; df_yyyywoy = df_dates['dtimeCol'].map(lambda x: 100*x.year + x.weekofyear).astype('int') ; df3.loc[:,'yyyywoy'] = pd.Series(df_yyyywoy, index=df3.index) ;
- Query an array:
  - dataValue = df.loc[indexValue,['col2']][0]
  - seriesRecord = df.iloc[row=7]
  - df[ ['colK','colL'] ][(df.colI=='valX') and (df.colJ=='valY')]
  - df[df['colX'].str.contains("strS")]
  - mask=df['count'] > 2 df[mask]
  - df.textItem.apply(lambda s: s.split(' ')).str.len() # token count .
  - df.groupby(["colX","colY"]).count(), Group By Query.
  - DataFrame({'count' : df.groupby(["colX","colY"]).size()}).reset_index()
  - DataFrame({'count' : df.groupby(["colX","colY"]).size()}).reset_index().query('(count>4321)')
  - g=df.groupby(['col1']) g.count().sort('col2', ascending=False) g.filter(lambda x: x['col1'].count() > minCount) # Roll-Up Query
  - srs_tokenCount = df.col2.apply(lambda x: pd.value_counts(x.lower().split(" "))).sum(axis = 0)
- Add a columns to an array:
  - df['col4'] = df['col3'].str.len() # characters count
  - df['col5'] = df.col3.apply(lambda s: s.split(' > ')) # array with tokenized string
- Delete array rows.
  - g = df.groupby(['col1']) df = g.filter(lambda x: x['col2'].count() >= 1) df.index = range(0, len(df))
- Query an array's structure:
  - rows, cols = df.shape
  - rows = len(df.index)
- Query an array's metadata:
  - df.columns
  - df.dtypes
- Modify an array's structure.
  - df'c1' = df'c1'.astype(float)
  - df'c2' = df'c2'.astype(object)
  - df'dtimeCol' = df'c2'.astype('datetime64[ns]')
  - df.index = np.random.permutation(range(0, len(df))) df.sort_index(inplace=True) # randomly reorder array.
  - if not df.empty: # delete all records df=df[0:0]
  - if 'colX' in df.columns: # remove a single column df = df.drop('colX', 1)
  - df.rename(columns={'oldColName':'newColName'}, inplace=True)
- Iterate over an array.
  - for index, row in df.iterrows(): print row['colY'], row['colX']
- Delete an array:
  - del df gc.collect()
Counter-Example(s)
- a pandas.Series Operation, on a pandas.Series.
- a numpy.ndarray Operation, on a numpy.ndarray.
- a SciPy Sparse Array Operation.
- a Python Array Operation.
- a Python List Operation.
- a Perl Associative Array Operation.
- a Perl Array Operation.
- a SQL Table Operation.
- an R Array Operation (R DataFrame Operation).
See: pandas.DataFrame Attribute.

References

2014

http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html
- abs() Return an object with absolute value taken.
- add(other[, axis, level, fill_value]) Binary operator add with support to substitute a fill_value for missing data in
- …
- describe([percentile_width]) Generate various summary statistics of each column, excluding
- …
- from_csv(path[, header, sep, index_col, ...]) Read delimited file into DataFrame
- …
- get_value(index, col) Quickly retrieve single value at passed column and index
- get_values() same as values (but handles sparseness conversions)
- groupby([by, axis, level, as_index, sort, ...]) Group series using mapper (dict or key function, apply given function
- …
- isnull() Return a boolean same-sized object indicating if the values are null
- …
- median([axis, skipna, level, numeric_only]) Return the median of the values for the requested axis
- …
- rename_axis(mapper[, axis, copy, inplace]) Alter index and / or columns using input function or functions.
- …
- select(crit[, axis]) Return data corresponding to axis labels matching criteria
- …
- tail([n]) Returns last n row
- …
- to_excel(excel_writer[, sheet_name, na_rep, ...]) Write DataFrame to a excel sheet
- …
- transpose() Transpose index and columns
- …
- sort([columns, column, axis, ascending, inplace]) Sort DataFrame either by labels (along either axis) or by the values in
- sort_index([axis, by, ascending, inplace, kind]) Sort DataFrame either by labels (along either axis) or by the values in
- sortlevel([level, axis, ascending, inplace]) Sort multilevel index by chosen axis and primary level.
- squeeze() squeeze length 1 dimensions
- stack([level, dropna]) Pivot a level of the (possibly hierarchical) column labels, returning a
- …

@@ Line 9: / Line 9: @@
 ** [[Create a prepopulated array]]:
 *** <code>df = pd.DataFrame([2,7,9])</code>, with [[auto-gen]] [[column name]] and [[index name]]s.
-*** <code>df = pd.DataFrame({'col1' : [1,2,3], 'col2' : Series([1., 2., 3., 4.]) })</code>, with [[auto-gen]] [[index key]]s
+*** <code>df = pd.DataFrame({'col1' : [1,2,3], 'col2' : Series([1., 2., 3., 4.]) })</code>, with [[auto-gen]] [[index key]]s.
-*** <code>df = pd.DataFrame({'col1' : Series([1., 2., 3.], index=['row3', 'row2', 'row1']), 'col2' : Series([1., 2., 3., 4.], index=['row1', 'row2', 'row3', 'row4']) })</code>, with explicit [[index key]]s
+*** <code>df = pd.DataFrame({'col1' : Series([1., 2., 3.], index=['row3', 'row2', 'row1']), 'col2' : Series([1., 2., 3., 4.], index=['row1', 'row2', 'row3', 'row4']) })</code>, with explicit [[index key]]s.
 *** <code>df = pd.DataFrame([[np]].random.randn(10, 2), columns=['colA', 'colB'])</code>
 *** <code>mydata = [<BR>&nbsp; &nbsp; &nbsp; {'col0':'A', 'col1':'EB', 'col2':1.1},<BR>&nbsp; &nbsp; &nbsp;  {'col0':'B', 'col1':'EB', },<BR>&nbsp; &nbsp; &nbsp;  {'col0':'C', 'col1':'PG', 'col2':2.4},<BR>&nbsp; &nbsp; &nbsp;  {'col0':'D', 'col1':'PG', 'col2':'7.0'},<BR>] <BR>df = pd.DataFrame(mydata) <BR> df.set_index('col0', inplace=True)</code>
@@ Line 34: / Line 34: @@
 *** <code>df['col4'] = df['col3'].str.len() # [[characters count]]</code>
 *** <code>df['col5'] = df.col3.apply(lambda s: s.split(' > ')) # [[array]] with [[tokenized string]]</code>
-** [[Delete array row]]s
+** [[Delete array row]]s.
 *** <code>g = df.groupby(['col1']) <BR> df = g.filter(lambda x: x['col2'].count() >= 1)   <BR>  df.index = range(0, len(df))</code>
 ** [[Query an array's structure]]:

pandas.DataFrame Operation: Difference between revisions

Latest revision as of 15:36, 24 July 2023

References

2014

Navigation menu

Search