Solved: Re: Newbie data processing question - frequency across two columns?

Turn on suggestions

Showing results for

Search instead for

Did you mean:

Options

RSS Feed
Mark Topic as New
Mark Topic as Read
Float this Topic for Current User
Bookmark
Subscribe
Mute
Printer Friendly Page

BookmarkSubscribeRSS Feed

☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.

Sam_SAS

SAS Employee

Go to Solution

Mark as New
Bookmark
Subscribe
Mute
RSS Feed
Permalink
Print
Report Inappropriate Content

Newbie data processing question - frequency across two columns?

Posted Friday (363 views)

Hello,

I work with Visual Analytics (probably not a VA question though) and don't have much expertise at data processing, and I'm trying to figure out how to do the following:

Have:

Lead Writer Assistant Writer Project

Joe Mac, Maggie Earwax for Dummies

Joe Clock Repair

Sam Joe Desk Fans: A Celebration

All Writers

Joe

Mac

Maggie

Sam

Want:

Writer Total Projects Lead Assistant

Joe 3 2 1

Mac 1 0 1

Maggie 1 0 1

Sam 1 1 0

That is, I have a list of writers in one table who might be represented in either of two columns in the second. For Assistant writer it is a "contains" relationship.

I'm sure this is a fairly basic thing to do in Base SAS programming but I don't know which proc/statement you would use. I am reasonably sure this cannot be done directly in VA, although I can do some unpleasant workarounds with parameters.

I am working in a Viya 4 environment.

Any pointers?

Thanks,

Sam

0 Likes

1 ACCEPTED SOLUTION

Accepted Solutions

Patrick

Opal | Level 21

Mark as New
Bookmark
Subscribe
Mute
RSS Feed
Permalink
Print
Report Inappropriate Content

Re: Newbie data processing question - frequency across two columns?

Posted Friday (322 views) | In reply to Sam_SAS

Are your Have tables already loaded into CAS or are these just normal SAS files and data prep could also happen under Compute?
What volumes are you dealing with?

Is your WANT a table or a report?

For creating a table something like below could work. I can't test it but I believe if all your tables are in CAS then the whole process will execute within CAS.

data writer;
 infile datalines truncover dsd dlm='|';
 input writer:40ドル.;
 datalines;
Joe
Mac
Maggie
Sam
;
data lead_assistant_project;
 infile datalines truncover dsd dlm='|';
 input Lead_Writer:40ドル. Assistant_Writer:40ドル. Project:40ドル.;
 datalines;
Joe|Mac,Maggie|Earwax for Dummies
Joe||Clock Repair
Sam|Joe|Desk Fans: A Celebration
;
/* restructure your data into a form that's easier to work with */
data writer_role_project;
 length writer 40ドル role 10ドル;
 set lead_assistant_project;
 keep writer role project;
 
 role='Lead';
 do i=0 to countc(Lead_Writer,',');
 writer=scan(Lead_Writer,i+1,',');
 if not missing(writer) then output;
 end;
 
 role='Assistant';
 do i=0 to countc(Assistant_Writer,',');
 writer=scan(Assistant_Writer,i+1,',');
 if not missing(writer) then output;
 end;
 
run;
proc fedsql;
/* create table want as */
 select 
 coalesce(t1.writer,t2.writer) as writer
 ,sum(case when t1.project='' then 0 else 1 end) as Total_Projects
 ,sum(case when t1.role='Lead' then 1 else 0 end) as Lead
 ,sum(case when t1.role='Assistant' then 1 else 0 end) as Assistant
 from writer_role_project t1
 full join writer t2 
 on t1.writer=t2.writer
 group by coalesce(t1.writer,t2.writer)
 ;
quit;

View solution in original post

0 Likes

9 REPLIES 9

Patrick

Opal | Level 21

Mark as New
Bookmark
Subscribe
Mute
RSS Feed
Permalink
Print
Report Inappropriate Content

Re: Newbie data processing question - frequency across two columns?

Posted Friday (323 views) | In reply to Sam_SAS

Are your Have tables already loaded into CAS or are these just normal SAS files and data prep could also happen under Compute?
What volumes are you dealing with?

Is your WANT a table or a report?

For creating a table something like below could work. I can't test it but I believe if all your tables are in CAS then the whole process will execute within CAS.

data writer;
 infile datalines truncover dsd dlm='|';
 input writer:40ドル.;
 datalines;
Joe
Mac
Maggie
Sam
;
data lead_assistant_project;
 infile datalines truncover dsd dlm='|';
 input Lead_Writer:40ドル. Assistant_Writer:40ドル. Project:40ドル.;
 datalines;
Joe|Mac,Maggie|Earwax for Dummies
Joe||Clock Repair
Sam|Joe|Desk Fans: A Celebration
;
/* restructure your data into a form that's easier to work with */
data writer_role_project;
 length writer 40ドル role 10ドル;
 set lead_assistant_project;
 keep writer role project;
 
 role='Lead';
 do i=0 to countc(Lead_Writer,',');
 writer=scan(Lead_Writer,i+1,',');
 if not missing(writer) then output;
 end;
 
 role='Assistant';
 do i=0 to countc(Assistant_Writer,',');
 writer=scan(Assistant_Writer,i+1,',');
 if not missing(writer) then output;
 end;
 
run;
proc fedsql;
/* create table want as */
 select 
 coalesce(t1.writer,t2.writer) as writer
 ,sum(case when t1.project='' then 0 else 1 end) as Total_Projects
 ,sum(case when t1.role='Lead' then 1 else 0 end) as Lead
 ,sum(case when t1.role='Assistant' then 1 else 0 end) as Assistant
 from writer_role_project t1
 full join writer t2 
 on t1.writer=t2.writer
 group by coalesce(t1.writer,t2.writer)
 ;
quit;

0 Likes

quickbluefish

Barite | Level 11

Mark as New
Bookmark
Subscribe
Mute
RSS Feed
Permalink
Print
Report Inappropriate Content

Re: Newbie data processing question - frequency across two columns?

Posted Friday (316 views) | In reply to Sam_SAS

You could do this, though seems very complicated for what you're ultimately doing (mostly because of the data structure, where variables contain, potentially, lists of delimited names). This assumes one row per project and none with missing lead writer.


data have;
infile cards dsd truncover firstobs=1 dlm="|";
length _lead _assistant project 200ドル;
input _lead _assistant project;
cards;
Joe|Mac, Maggie|A
Joe||B
Sam|Joe|C
;
run;
proc sql noprint;
select count(_assistant) + countc(_assistant, ",") into :nassist trimmed from have;
select count(_lead) + countc(_lead, ",") into :nlead trimmed from have;
quit;
%put &=nassist;
%put &=nlead;
data want;
set have end=last;
array l {&nlead} 50ドル _temporary_;
array a {&nassist} 50ドル _temporary_;
array names {%eval(&nlead+&nassist)} 50ドル _temporary_;
_lead=compress(_lead);
_assistant=compress(_assistant);
nl=countW(_lead, ",");
na=countW(_assistant, ",");
retain tot_l 0 tot_a 0 namenum 0;
do i=1 to nl;
	tot_l+1;
	nm=scan(_lead, i, ',');
	if nm not in names then do;
		namenum+1;
		names[namenum]=nm;
	end;
	l[tot_l]=nm;
end;
if missing(_assistant) then goto next;
do i=1 to na;
	tot_a+1;
	nm=scan(_assistant, i, ',');
	if nm not in names then do;
		namenum+1;
		names[namenum]=nm;
	end;
	a[tot_a]=nm;
end;
next:
if last then do;
	length writer 50ドル total_proj lead assistant 4;
	do nn=1 to namenum;
		lead=0; assistant=0;
		writer=names[nn];
		do i=1 to tot_l;
			if l[i]=writer then lead+1;
		end;
		do i=1 to tot_a;
			if a[i]=writer then assistant+1;
		end;
		total_proj=sum(lead, assistant);
		output;
	end;
end;
keep writer total_proj lead assistant;
run;
proc print data=want; run;

quickbluefish_0-1763165232184.png

0 Likes

Tom

Super User

Mark as New
Bookmark
Subscribe
Mute
RSS Feed
Permalink
Print
Report Inappropriate Content

Re: Newbie data processing question - frequency across two columns?

Posted Friday (294 views) | In reply to Sam_SAS

It will easier if you first put the data into a better form.

First let's convert your listing/report into a dataset so we have something to code with. Let's give the variables actual names instead of just labels so the coding will be easier.

data have;
 infile cards dsd dlm='|' truncover;
 input lead :20ドル. assist :100ドル. Project :30ドル.;
datalines;
Joe|Mac, Maggie|Earwax for Dummies
Joe| |Clock Repair
Sam|Joe|Desk Fans: A Celebration
;

Now let's convert it into a "tall" or "long" structure so we do not have multiple data points stored in one instance of a variable.

data fixed;
 projno+1;
 length name 20ドル role 8ドル ;
 set have;
 role='lead';
 do _n_=1 by 1;
 name=left(scan(lead,_n_,','));
 if name=' ' then leave;
 output;
 end;
 role='assist';
 do _n_=1 by 1;
 name=left(scan(assist,_n_,','));
 if name=' ' then leave;
 output;
 end;
 drop lead assist ;
run;

Now we have data like this:

 projno name role Project
 1 Joe lead Earwax for Dummies
 1 Mac assist Earwax for Dummies
 1 Maggie assist Earwax for Dummies
 2 Joe lead Clock Repair
 3 Sam lead Desk Fans: A Celebration
 3 Joe assist Desk Fans: A Celebration

From which it will be very easy to get counts.

For example you could just plain old PROC FREQ.

proc freq data=fixed;
 tables name*role ;
run;

Results

Tom_0-1763174677662.png

2 Likes

Ksharp

Super User

Mark as New
Bookmark
Subscribe
Mute
RSS Feed
Permalink
Print
Report Inappropriate Content

Re: Newbie data processing question - frequency across two columns?

Posted Friday (280 views) | In reply to Sam_SAS

data writer;
 infile datalines truncover dsd dlm='|';
 input writer:40ドル.;
 datalines;
Joe
Mac
Maggie
Sam
;
data lead_assistant_project;
 infile datalines truncover dsd dlm='|';
 input Lead_Writer:40ドル. Assistant_Writer:40ドル. Project:40ドル.;
 datalines;
Joe|Mac,Maggie|Earwax for Dummies
Joe||Clock Repair
Sam|Joe|Desk Fans: A Celebration
;
data have;
set lead_assistant_project(rename=(Assistant_Writer=_Assistant_Writer));
length Assistant_Writer $ 80;
if not missing(_Assistant_Writer) then do;
do i=1 to countw(_Assistant_Writer,',');
 id+1;
 Assistant_Writer=scan(_Assistant_Writer,i,',');
 output;
end;
end;
else do;id+1;output;end;
drop _Assistant_Writer i;
run;
proc transpose data=have out=temp;
by id Project;
var Lead_Writer Assistant_Writer ;
run;
proc sql;
create table want as
select a.writer,coalesce(Total_Projects,0) as Total_Projects,
 coalesce(Lead,0) as Lead,coalesce(Assistant,0) as Assistant
 from writer as a left join (
select col1,count(distinct Project) as Total_Projects,
 (select count(distinct Project) from temp where col1=a.col1 and _NAME_='Lead_Writer') as Lead,
 (select count(distinct Project) from temp where col1=a.col1 and _NAME_='Assistant_Writer') as Assistant
 from temp as a
 where col1 is not missing
 group by col1 
) as b on a.writer=b.col1;
quit;

Ksharp_0-1763179486641.png

0 Likes

Kurt_Bremser

Super User

Mark as New
Bookmark
Subscribe
Mute
RSS Feed
Permalink
Print
Report Inappropriate Content

Re: Newbie data processing question - frequency across two columns?

Posted Sunday (193 views) | In reply to Sam_SAS

You can do it in one DATA step by using a hash object:

data lead_assistant_project;
 infile datalines truncover dsd dlm='|';
 input Lead_Writer:8ドル. Assistant_Writer:40ドル. Project:40ドル.;
 datalines;
Joe|Mac,Maggie|Earwax for Dummies
Joe||Clock Repair
Sam|Joe|Desk Fans: A Celebration
;
data _null_;
set lead_assistant_project end=done;
if _n_ = 1
then do;
 length
 writer 8ドル
 total lead assistant 8
 ;
 declare hash w (ordered:"yes");
 w.definekey("writer");
 w.definedata("writer","total","lead","assistant");
 w.definedone();
end;
writer = lead_writer;
if w.find() = 0
then do;
 total + 1;
 lead + 1;
 rc = w.replace();
end;
else do;
 total = 1;
 lead = 1;
 assistant = 0;
 rc = w.add();
end;
if assistant_writer ne "" then do i = 1 to countw(assistant_writer,",");
 writer = scan(assistant_writer,i,",");
 if w.find() = 0
 then do;
 total + 1;
 assistant + 1;
 rc = w.replace();
 end;
 else do;
 total = 1;
 lead = 0;
 assistant = 1;
 rc = w.add();
 end;
end;
if done then rc = w.output(dataset:"want");
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

0 Likes

PaigeMiller

Diamond | Level 26

Mark as New
Bookmark
Subscribe
Mute
RSS Feed
Permalink
Print
Report Inappropriate Content

Re: Newbie data processing question - frequency across two columns?

Posted Sunday (175 views) | In reply to Sam_SAS

Perhaps an overview of the process for our self-admitted newbie @Sam_SAS might be helpful.

Arranging the data properly so you can actually perform the next analysis is a very important step, and one that beginners sometimes do not grasp. The best arrangement of the data is the "long" or "tall" data set as described by @Tom . Why is that the best? Because most SAS data analysis PROCs work best on data in this "long" or "tall" arrangement. So once you have this "long" or "tall" data set, a simple call to PROC FREQ (or other relevant PROC) gets the job done.

Now, the actual process of re-arranging such data isn't particularly simple, especially for newbies, and you have received several examples of working code. But this process of re-arranging the data into "long" or "tall" data sets is where you must start.

Note: there are of course examples where a "wide" dataset (rather than "long" or "tall") is required, but these are rare, and unless you have a solid justification for "wide" data sets, it is best to avoid them. See Maxim 19. Also note that when I talk about a "solid justification for wide data sets", this does not include situations where you want a wide output such as a wide report or wide Excel file; this is not a "solid justification for wide data sets". Wide outputs such as a wide report are wide Excel file are most easily produced from "tall" or "long" data sets via PROC REPORT or PROC TABULATE.

--
Paige Miller

0 Likes

Sam_SAS

SAS Employee

Mark as New
Bookmark
Subscribe
Mute
RSS Feed
Permalink
Print
Report Inappropriate Content

Re: Newbie data processing question - frequency across two columns?

Posted yesterday (81 views) | In reply to PaigeMiller

Thanks for all the replies!

This data is in xlsx files that are being loaded into CAS by a Viya job. The tables are not "big data" but I think the All Writers table would be about 75 rows and the other one might be ~1,000 rows. Some individuals in the All Writers table might have 0 projects in a given month.

I am not well versed in how to play with CAS data in Studio although I understand it is possible. I want this processing to be able to run as a scheduled job, so I'm not sure if it would be preferable to do this on the tables before they are loaded into CAS, or if it's fine to do it in CAS. I *think* I have the necessary permissions to do either.

The end goal is to get a table that is easy to read for a 'dashboard' report in VA.

0 Likes

Patrick

Opal | Level 21

Mark as New
Bookmark
Subscribe
Mute
RSS Feed
Permalink
Print
Report Inappropriate Content

Re: Newbie data processing question - frequency across two columns?

Posted 9 hours ago (48 views) | In reply to Sam_SAS

In my opinion CAS is not really the place for data prep and though if it was me, I'd use Compute to read the Excel files and prepare a table that's suitable for VA. It's then this table that I'd load into CAS for VA reporting.

You can create such a process using SAS Studio flow and then schedule the flow.

You also need to look up/skill up how to define your CAS based reporting data so the VA report autoloads the data into CAS after server restart (which will wipe-out all CAS memory based tables).

Did you already try to adopt any of the proposed solutions to your environment? If not then that's what you eventually should be doing next so you can ask targeted follow-up questions if you get stuck.

0 Likes

Sam_SAS

SAS Employee

Mark as New
Bookmark
Subscribe
Mute
RSS Feed
Permalink
Print
Report Inappropriate Content

Re: Newbie data processing question - frequency across two columns?

Posted 4 hours ago (19 views) | In reply to Patrick

I think all of the responses could be marked as the solution. Thanks everyone!

Ultimately what made the most sense was just to have my data admin create the table with a different query. So I have what I wanted out of the box now.

0 Likes

sasinnovate.png

April 27 – 30 | Gaylord Texan | Grapevine, Texas

Registration is open

Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just 495ドル!

Develop Code with SAS Studio

Get started using SAS Studio to write, run and debug your SAS programs.

Watch Develop Code with SAS Studio on YouTube

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats

9 replies
Friday
364 views
2 likes
7 in conversation
- Tom
- PaigeMiller
- Kurt_Bremser
- Patrick
- Sam_SAS
- Ksharp
- quickbluefish

[フレーム]