Assignment - Topical Databasing
CS 5800 - Database Systems
Utah State University
Due date: Monday, Dec. 4 (23:59), late turnins will have 20% deducted.


There are two parts to this assignment. One part is to write some queries using a NoSQL database. The second part is to perform some data mining.


Turnin in the files queries.txt and mining.pdf using Canvas. You may turnin your assignment as many times as you like.


The assignment will be graded for functionality.


The assignment permits you to work in groups of at most two. I will assume that the same groups for the SQL, part 2 assignment. E-mail me if your group has dissolved.


Your task is to install a NoSQL database using CouchDB and write some queries. Download version 1.7.1, on Mac use brew install to download and install this version. If that version does not work (on some Windows 10 boxes it does not) then use version 1.6.1 (which can be found at https://www.apache.org/dist/couchdb/binary/win/1.6.1/.

Using CouchDB

Download and configure the software for your system. Then create a database, and within that database a document. To that document add the data in the file data.txt.


A text file, queries.txt, containing solutions in JSON to the following queries. The file should be formatted as follows (values of the fields will vary!).
----View 1: Count the randomArrayItems
   "_id": "_design/countRandomArrayItems",
   "_rev": "2-b4e0bf693aea17edffd8a05ea80b9989",
   "language": "javascript",
   "views": {
       "countRandomArrayItems": {
           "map": "function(doc) {\n  for (var i in doc.data) {\n    var person = doc.data[i];\n    emit(person.name, 1);\n  }\n}",
           "reduce": "function(keys, values, rereduce) {\n  return values.length;\n}"

----View 2: Names of people with max age by gender and isActive ...

  1. Count the randomArrayItems.


  2. Name(s) of the person(s) with the maximum age, and their age(s) by gender and whether they are active or not.


  3. A count of the people by tags, that is, count all the people with the given value in the tags array.


  4. The average age of people by company.


  5. The JSON of the lattitude, longitude, and address of each employee that has a lattitude of more than 80.


  6. Names of people and their frineds that start with the letter "J" if they have at least one friend whose name starts with the letter "J".


Association Rule Mining

The task is become a data scientist and analyze the data in the following dataset: mushroom.arff (if you are using Orange, use the csv file instead mushroom.arff.csv). You are to writeup your investigation and turn it in using the turnin page as mining.pdf. The data is described in the header of the mushroom.arff file as well as online at the UCI Machine Learning Repository: mushroom data.


I recommend using Weka. Documentation on the tool can be found at http://www.cs.waik ato.ac.nz/ml/weka/documentation.html.

Another good tool is Orange. If you use Orange, be sure to download mushroom.arff.csv as your dataset.

But if you would like to use another tool, there are many.


The task is to do association rule mining. For each of the following questions explain how you figured it out and give the rule(s) that support your conclusion.
  1. What color mushrooms should you avoid eating?
  2. What are the properties or characteristics of edible mushrooms?
  3. Are there any interesting observations that relate the odor, color, and/or habitat of a mushroom?
Describe how you used the association rule builder, why you chose the builder you did and why the rule is interesting.

You may get good results by limiting the number of attributes in the data set (using the Preprocess step). Note that association rule miners work best on "nominal" rather than "numeric" data.

  Copyright © 2017 by Curtis Dyreson. All rights reserved.