Showing posts with label queries. Show all posts
Showing posts with label queries. Show all posts
Wednesday, June 14, 2017
Running SQL Queries Against Baby Names from Social Security Card Applications
I decided to look around to see if I can find some interesting data sets
I came across the Baby Names from Social Security Card Applications-National Level Data data set and decided it would be a fun exercise to import this data into SQL Server and run some queries
You can download the zip file from the link above, you will see 137 files with a txt extension in the zip file
Here is what the description is for these files
National Data on the relative frequency of given names in the population of U.S. births where the individual has a Social Security Number
(Tabulated based on Social Security records as of March 5, 2017)
For each year of birth YYYY after 1879, we created a comma-delimited file called yobYYYY.txt. Each record in the individual annual files has the format "name,sex,number," where name is 2 to 15 characters, sex is M (male) or F (female) and "number" is the number of occurrences of the name. Each file is sorted first on sex and then on number of occurrences in descending order. When there is a tie on the number of occurrences, names are listed in alphabetical order. This sorting makes it easy to determine a name's rank. The first record for each sex has rank 1, the second record for each sex has rank 2, and so forth.
To safeguard privacy, we restrict our list of names to those with at least 5 occurrences.
The format of the files looks like this
Mary,F,7065
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578
So these files are simple comma separated values files
Download the zip file from this url: https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data
Unzip the files in a folder on your hard drive, in my case I unzipped it to C:\Names
Let's get started with the fun part
First I created a new DB
CREATE DATABASE Scratch GO USE Scratch GO
Now that we have our database, let's create this very simple table, it will store all the data from the file, I also added a year column to this table
CREATE TABLE Names (FirstName varchar(500), Gender char(1), YearInFile date) GO
Because the file doesn't have a year column, I created this view so that I can bulk insert into the view instead of having to mess around with SSIS or format files
CREATE VIEW myView AS SELECT FirstName,Gender,NameCount FROM Names
I extracted all the files into a folder with the name Names, I placed this on the C drive
Now that you have the table and view created, try running the following
BULK INSERT myView FROM'c:\Names\yob1932.txt'WITH (FIELDTERMINATOR =',')
It should insert 9382 rows
If that is working, it is time to import all 137 files
I am presenting you a solution here so that you can run this all in SSMS, of course you could have as easy used PowerShell, Python or any other language to import these files. Nothing prevents you from doing that :-)
Attention/warning!!
We will be using xp_cmdshell
It is not a best practice to have xp_cmdshell enabled. As a matter of fact beginning with SQL Server 2005, the product ships with xp_cmdshell disabled. If you try to run xp_cmdshell you will get the following message if it is not enabled
Server: Msg 15281, Level 16, State 1, Procedure xp_cmdshell, Line 1
SQL Server blocked access to procedure ‘sys.xp_cmdshell’ of component ‘xp_cmdshell’ because this component is turned off as part of the security configuration for this server. A system administrator can enable the use of ‘xp_cmdshell’ by using sp_configure. For more information about enabling ‘xp_cmdshell’, see “Surface Area Configuration” in SQL Server Books Online.
To enable xp_cmdshell execute the following code
EXECUTE SP_CONFIGURE 'show advanced options', 1 RECONFIGURE WITH OVERRIDE GO EXECUTE SP_CONFIGURE 'xp_cmdshell', '1' RECONFIGURE WITH OVERRIDE GO EXECUTE SP_CONFIGURE 'show advanced options', 0 RECONFIGURE WITH OVERRIDE GO
Now it is time to import all the files. First we want to store all the files that we want to import in a table
We are creating a temp table, we then call a DIR command with the B switch. we delete everything that doesn't end in txt. Finally we add an identity column, we will use this to loop over this table
IF OBJECT_ID('tempdb..#tempList') IS NOT NULL DROP TABLE #tempList CREATE TABLE #tempList ([FileName] VARCHAR(500)) --plain vanilla dos dir command with /B switch (bare format) INSERT INTO #tempList EXEC MASTER..XP_CMDSHELL 'dir c:\Names /B' --delete the null values DELETE #tempList WHERE [FileName] IS NULL -- Delete all the files that don't have txt extension DELETE #tempList WHERE [FileName] NOT LIKE '%.txt' --this will be used to loop over the table ALTER TABLE #tempList ADD id int IDENTITY go
Let's see what we have in the table
SELECT * FROM #tempList
You will see something like this
FileNameid
yob1880.txt1
yob1881.txt2
yob1882.txt3
yob1883.txt4
yob1884.txt5
yob1885.txt6
....................
Now it is time to import all the files, here is the code that will import all 137 files, I put comments in the code so you know what it does
DECLARE @LoopID int, @MaxID int
SELECT @LoopID = min(id),@MaxID = max(ID)
FROM #tempList
SET NOCOUNT ON
-- if you want to run this multiple time, wipe out the data
TRUNCATE TABLE Names
-- Change this to your directory
DECLARE @Directory varchar(50) = 'c:\Names\'
DECLARE @FileName varchar(500),@FullFileName varchar(500)
DECLARE @SQL NVARCHAR(1000)
Declare @year varchar(10)
-- loop over all the file
WHILE @LoopID <= @MaxID
BEGIN
SELECT @FileNAme = filename
FROM #tempList
WHERE id = @LoopID
-- concatenate the directory and filename
SELECT @FullFileName = @Directory + @FileName
BEGIN
SELECT @SQL = N'BULK INSERT myView
FROM ''' + @FullFileName +''' WITH (FIELDTERMINATOR ='','')'
-- you can remove this..this basically prints what will be executed
PRINT @SQL
-- run the command EXEC SP_EXECUTESQL @SQL -- this grabs the year part from the file name SELECT @year = right(parsename(@filename,2),4) -- This will update the rows we just inserted to have the year value UPDATE Names SET YearInFile = @year WHERE YearInFile is null END --Get the next id, instead of +1 we grab the next value in case of skipped id values SELECT @LoopID = min(id) FROM #tempList WHERE id > @LoopID END
And now that we have the data we can run some queries
Let's see if we have all the data
SELECT COUNT(*) FROM names --1,891,894That should return 1,891,894 rows
let's see..did the baby boomers parents really get it on after WW II?
SELECT SUM(nameCount) as SumCount,YearInFile FROM names group by YearInFile ORDER BY SumCount DESC
SumCountYearInFile
-----------
42000971957
41565761959
41548091960
41401601961
41318021958
41211301956
40354991962
40129331955
39932062007
39804071954
Looks like it... but what happened in 2007, top of the housing bubble was 2006/2007 so people started buying houses and needed to populate those rooms with humans :-)
What are the top 10 most popular names over time?
SELECT TOP 10 FirstName, gender,SUM(Namecount) as Alltime FROM names GROUP BY FirstName,gender ORDER BY 3 DESC
Here are the results, looks like males don't have as many unique names
FirstName gender Alltime
James M 5136026
John M 5105834
Robert M 4809176
Michael M 4338144
Mary F 4120692
William M 4087525
David M 3601032
Joseph M 2592856
Richard M 2560771
Charles M 2378806
What were the most popular names in 1880?
SELECT TOP 10 FirstName, gender,SUM(Namecount) as [1880] FROM names where YearInFile = 1880 GROUP BY FirstName,gender ORDER BY 3 DESC
Here are those results
FirstName gender 1880
John M 9655
William M 9532
Mary F 7065
James M 5927
Charles M 5348
George M 5126
Frank M 3242
Joseph M 2632
Anna F 2604
Thomas M 2534
For 2016
SELECT TOP 10 FirstName, gender,SUM(Namecount) as [2016] FROM names where YearInFile = 2016 GROUP BY FirstName,gender ORDER BY 3 DESC
Now we have some names which I recognize since my kids friends are named like hte ones on this list
FirstName gender 2016
Emma F 19414
Olivia F 19246
Noah M 19015
Liam M 18138
Ava F 16237
Sophia F 16070
William M 15668
Mason M 15192
James M 14776
Isabella F 14722
I guess nobody likes February
SELECT FirstName , count(*) FROM names WHERE FirstName in('January','February','March','April','May','June', 'July','August','September','October','November','December') GROUP BY FirstName
FirstName Count
April 135
August 206
December 56
January 59
July 73
June 273
March 41
May 213
November 37
October 34
September 62
I wonder how many of these names are the result of someone putting the month in the first name field :-)
How many names are both male and female?
;WITH cte as( SELECT FirstName FROM names GROUP BY FirstName HAVING (COUNT( DISTINCT Gender )>1)) SELECT FirstName, gender, count(*) FROM Names n WHERE EXISTS (SELECT * FROM cte c where c.FirstName = n.FirstName) GROUP BY FirstName, Gender ORDER BY 1
As you can see, there are a lot
Here are some where number of males and females doesn't differ that much
FirstName gender Count
Courtney F 123
Courtney M 126
Cruz F 121
Cruz M 114
Tom F 62
Tom M 137
One more query and that is it for this post..
How many people are named after weekdays?
SELECT FirstName , count(*) FROM names WHERE FirstName in('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday') GROUP BY FirstNameHere are the results
FirstNameCount
------------- -----------------
Friday 4
Monday 5
Sunday 89
Tuesday 59
Wednesday 51
Last Monday I saw Tuesday who told me that she was going to see Friday on Sunday....
Yep, crystal clear...
That is it for this post.... feel free to run your own queries and if you find some interesting things, feel free to leave me a comment with the results or the query
One more query for you, I added this at the bottom since there are 52 rows in the result
What is the most popular letter a name starts with?
Here is that query, there are a couple of counts here:, the total count, the occurance and the distinct count of a name starting with the letter across all years
SELECT LEFT(FirstName,1) AS Initial, gender, SUM(Namecount) AS SumCount, COUNT(NameCount) AS TotalCount , COUNT(DISTINCT firstname) AS DistinctCount FROM names GROUP BY left(FirstName,1),gender ORDER BY 3 DESCHere are the results
Initial gender SumCount TotalCount DistinctCount
J M 29,396,710 63,543 3,995
M F 19,855,157 105,326 5,583
A F 16,744,520 117,270 7,501
R M 16,329,781 46,772 2,135
J F 14,887,120 79,019 4,622
D M 14,375,824 62,369 3,380
S F 13,587,753 90,382 6,059
M M 12,717,485 51,078 2,643
C M 12,652,140 51,820 2,255
C F 12,640,899 80,401 4,159
L F 12,099,095 86,872 4,473
A M 11,611,302 66,139 3,594
K F 11,065,934 74,797 4,967
D F 9,723,142 66,592 3,916
E F 9,644,797 49,929 2,477
B F 8,485,706 39,199 2,251
T M 8,331,252 42,525 2,592
B M 8,038,739 37,772 1,701
S M 7,586,177 42,524 2,524
R F 7,216,686 51,892 2,837
E M 7,122,968 36,781 1,542
W M 7,073,357 17,052 577
L M 6,594,743 42,492 1,799
G M 6,587,004 26,930 1,095
T F 5,913,655 61,915 4,277
K M 5,744,651 44,110 3,096
P F 5,024,379 19,759 1,090
N F 4,315,987 36,667 2,507
H F 4,203,255 19,717 1,159
P M 4,181,853 15,150 666
G F 4,083,933 28,566 1,398
H M 3,962,710 26,661 1,128
V F 3,546,832 23,367 1,083
N M 3,354,035 20,738 1,180
F M 3,048,688 15,174 592
F F 2,084,083 17,670 842
I F 1,966,942 17,719 1,066
I M 1,350,255 11,261 573
W F 1,235,983 9,462 569
V M 1,183,415 11,609 559
O M 1,112,184 15,521 724
O F 951,873 12,914 718
Z M 926,015 9,284 833
Y F 656,093 11,343 1,072
Z F 525,515 13,263 1,139
Q M 183,280 3,684 296
X M 172,430 1,175 127
Y M 163,545 7,031 579
U M 93,783 2,875 152
Q F 75,612 2,663 274
X F 54,170 1,466 150
U F 50,450 1,654 169
Not that many names that start with a letter at the end of the alphabet
Subscribe to:
Posts (Atom)