Return to Question

Commonmark migration

edited Jun 10, 2020 at 13:24

Exercise 1: Parsing out variables

Any of the ten variables might appear in a string (src, adid, kw, kwid, mt, dist, qs, adpos, device, placement).
Any of the ten variables might appear in a string (src, adid, kw, kwid, mt, dist, qs, adpos, device, placement).
The value for that parameter will always be after the equal sign, i.e. src=bing (bing is the value being parsed).
At this point, the length is inconsistent for any given parameter.
Not every parameter will appear in every string.

Not every parameter will appear in the same position in every string.
The parameters will always be delimited with an underscore.

The value for that parameter will always be after the equal sign, i.e. src=bing (bing is the value being parsed).
At this point, the length is inconsistent for any given parameter.
Not every parameter will appear in every string.
Not every parameter will appear in the same position in every string.
The parameters will always be delimited with an underscore.

Exercise 2: Absence of parameters (catch all)

If none of that ten parameters appear, then it should all be captured into a catch-all column (i.e., Site Link).

Exercise 1: Parsing out variables

Any of the ten variables might appear in a string (src, adid, kw, kwid, mt, dist, qs, adpos, device, placement).

The value for that parameter will always be after the equal sign, i.e. src=bing (bing is the value being parsed).
At this point, the length is inconsistent for any given parameter.
Not every parameter will appear in every string.
Not every parameter will appear in the same position in every string.
The parameters will always be delimited with an underscore.

Exercise 2: Absence of parameters (catch all)

If none of that ten parameters appear, then it should all be captured into a catch-all column (i.e., Site Link).

Exercise 1: Parsing out variables

Any of the ten variables might appear in a string (src, adid, kw, kwid, mt, dist, qs, adpos, device, placement).
The value for that parameter will always be after the equal sign, i.e. src=bing (bing is the value being parsed).
At this point, the length is inconsistent for any given parameter.
Not every parameter will appear in every string.

Not every parameter will appear in the same position in every string.
The parameters will always be delimited with an underscore.

Exercise 2: Absence of parameters (catch all)

If none of that ten parameters appear, then it should all be captured into a catch-all column (i.e., Site Link).

deleted 39 characters in body; edited title

Source Link

edited Nov 7, 2015 at 3:40

Jamal

edited Nov 7, 2015 at 3:40

Jamal

35.2k
13
134
238

Using Pandas to parse adwords export, Google Analytics

Thanks

edit: Just want to add that comb_str is just re-importing the string itself and then adding it back to the DF in case it's needed to be used in an index or something later.

Using Pandas to parse adwords export, Google Analytics

Thanks

edit: Just want to add that comb_str is just re-importing the string itself and then adding it back to the DF in case it's needed to be used in an index or something later.

Using Pandas to parse adwords export

comb_str is just re-importing the string itself and then adding it back to the DF in case it's needed to be used in an index or something later.

Source Link

asked Oct 28, 2015 at 21:50

mburke05

asked Oct 28, 2015 at 21:50

mburke05

Using Pandas to parse adwords export, Google Analytics

I did this exercise yesterday mostly as practice, but it has some utility as well in day to day. I was basically attempting to take a string that looked like the following:

src=bing_adid=8312488564_kw=swiftcapital com_kwid=35030235383_mt=e_qs=swiftcapital.com_device=c'

And create a series of columns based on the parameters within the string. Here you can see src, kw, kwid, mt, etc.

I was given a set of rules for the string, and used those in my assumptions; as follows:

Exercise 1: Parsing out variables

Any of the ten variables might appear in a string (src, adid, kw, kwid, mt, dist, qs, adpos, device, placement).

The value for that parameter will always be after the equal sign, i.e. src=bing (bing is the value being parsed).

At this point, the length is inconsistent for any given parameter.

Not every parameter will appear in every string.

Not every parameter will appear in the same position in every string.

The parameters will always be delimited with an underscore.

Exercise 2: Absence of parameters (catch all)

If none of that ten parameters appear, then it should all be captured into a catch-all column (i.e., Site Link).

In addition, I used the assumption that an _ and = character both could occur in a string which is why I elected to use regular expressions (this was as much for practical utility as it was just to teach myself some more regex.)

Here's my solution using pandas, csv, and re, please let me know if you can think of an easier/smarter way in python. (I have a feeling I could have hard-coded this using things like FIND() in excel but I want to avoid Excel whenever possible.

# we use pandas and regular expression libraries
import pandas as pd
import re
# dropping all but the 'input' column; dropping all rows in that column that are NaN
df = pd.read_csv('Parsing.csv')
df = df[df['input'].notnull() == True]
df = df[['input']]
# defining params to search for in our keywords
parameters = ['src', 'adid', 'kw', 'kwid', 'mt', 'dist', 'qs', 'adpos', 'device', 'placement']
new_params = [params+'=' for params in parameters]
#split apart parsed data will be stored here in a mock_dataframe which we will reinject to pandas 
mock_df = {'comb_str=': [],
 'src=': [], 
 'adid=': [], 
 'kw=': [], 
 'kwid=': [], 
 'mt=': [], 
 'dist=': [], 
 'qs=' : [], 
 'adpos=' : [], 
 'device=' : [],
 'placement=' : [],
 'catchall' : []}
# nested for loop that looks for the conditions using regular expressions and appends to our mock_df
a = [x for x in df['input']]
for entry in a:
 mock_df['comb_str='].append(entry)
 if any(params in entry for params in new_params):
 mock_df['catchall'].append('')
 for params in new_params:
 if params in entry:
 mock_df[params].append(re.search(params+'(.*?)(?=_[^_=\n]+=|$)',entry).group(1))
 else:
 mock_df[params].append('')
 else:
 for params in new_params:
 mock_df[params].append('')
 mock_df['catchall'].append(entry)
# to get an idea what the new df looks like
df_upd = pd.DataFrame(data=mock_df)
df_upd
# port new DF to a csv for excel use. 
df_upd.to_csv(path_or_buf='parsed.csv')

I'm hoping to get help on removing complexity in my code, and if possible making the code more efficient. I would imagine an input might be 1mil or so of the above strings, with 1mil*#of params defined columns as an output. So it can get messy in a hurry.

Thanks

edit: Just want to add that comb_str is just re-importing the string itself and then adding it back to the DF in case it's needed to be used in an index or something later.

lang-py