problem17

.pdf

School

University of Colorado, Denver *

*We aren’t endorsed by this school

Course

6040

Subject

Computer Science

Date

Apr 28, 2024

Type

pdf

Pages

Uploaded by hhfehiiue on coursehero.com

4/26/24, 11:09 PM problem17 file:///Users/mingjingyang/Desktop/OMSA/6040/cse6040-pfx/problem17/problem17.html 1/25 Problem 17: Spectral graph partitioning Version 1.3 Changelog: 1.3: Added another more informative error message. [Dec 5, 2019] 1.2: Added more hints and more informative error messages. [Dec 3, 2019] 1.1: Added more examples; no changes to code or test cells. [Dec 2, 2019] 1.0: Initial version In this problem, you'll consider the data mining task of "clustering" a graph. That is, given a graph or network of relationships, can you identify distinct communities within it? This problem assesses your Pandas and Numpy skills. Exercises. There are six exercises, numbered 0-5, worth a total of ten (10) points. However, only Exercises 0-4 require you to write code. If you have done them correctly, then Exercise 5, which consists only of a hidden test cell, should pass when submitted to the autograder. Regarding dependencies and partial credit: Exercise 1 (1 points) depends on a correct Exercise 0 (2 points). Exercises 2 (2 points) and 3 (2 points) are independent. Neither depends on Exercises 0 or 1. Exercise 4 (2 points) relies on all earlier exercises. Exercise 5 (1 point) depends on Exercise 4. As always , it is possible that you will pass Exercises 0-4 but not pass Exercise 5 if there is a subtle bug that the test cells happen not to catch, so be prepared for that possibility! Setup The main modules you'll need are pandas , numpy , and scipy . The rest are auxiliary functions for loading data, plotting results, and test code support. Start by running the following cell.

4/26/24, 11:09 PM problem17 file:///Users/mingjingyang/Desktop/OMSA/6040/cse6040-pfx/problem17/problem17.html 2/25 In [1]: # Main modules you'll need: import numpy as np import scipy as sp import pandas as pd from pandas import DataFrame # Support code for data loading, code testing, and visualization: import sys sys.path.insert(0, 'resource/asnlib/public') from cse6040utils import tibbles_are_equivalent, pandas_df_to_markdown _table, hidden_cell_template_msg, cspy from matplotlib.pyplot import figure, subplots % matplotlib inline from networkx.convert_matrix import from_pandas_edgelist from networkx.drawing.nx_pylab import draw from networkx import DiGraph from networkx.drawing import circular_layout # Location of input data: def dataset_path(base_filename): return f"resource/asnlib/publicdata/ {base_filename} " Background: Relationship networks and partitioning Suppose we have data on the following five people, stored in the nodes data frame (run the next cell): In [2]: nodes = DataFrame({'name': ['alice', 'bob', 'carol', 'dave', 'edith'], 'age': [35, 18, 27, 57, 41]}) nodes Also suppose we have some information on their relationships, as might be captured in a social network or database of person-to-person transactions. In particular, if some person follows another person , we say is the source and is the target . For the people listed above, suppose these relationships are stored in a data frame named edges (run this code cell): Out[2]: name age 0 alice 35 1 bob 18 2 carol 27 3 dave 57 4 edith 41

4/26/24, 11:09 PM problem17 file:///Users/mingjingyang/Desktop/OMSA/6040/cse6040-pfx/problem17/problem17.html 3/25 In [3]: edges = DataFrame({'source': ['alice', 'alice', 'dave', 'dave', 'dav e', 'bob'], 'target': ['dave', 'edith', 'alice', 'edith', 'car ol', 'carol']}) edges We can visualize these relationships as a directed graph or directed network , where the people are shown as nodes (circles) and the follows-relationships are shown as edges from source to target. Run the next code cell to see the relationships in our example. In [4]: G = from_pandas_edgelist(edges, source='source', target='target', crea te_using=DiGraph()) figure(figsize=(4, 4)) draw(G, arrows= True , with_labels= True , pos=circular_layout(G), node_size=1200, node_color= None , font_color='w', width=2, arrowsize=20) Out[3]: source target 0 alice dave 1 alice edith 2 dave alice 3 dave edith 4 dave carol 5 bob carol

4/26/24, 11:09 PM problem17 file:///Users/mingjingyang/Desktop/OMSA/6040/cse6040-pfx/problem17/problem17.html 4/25 Observation 0: From the edges data frame, recall that alice follows edith ; therefore, there is an arrow (or edge) pointing from alice to edith . Since alice and dave both follow one another, there is a double-headed arrow between them. Observation 1: One can arguably say there are two distinct "groups" in this picture. One group consists of alice , dave , and edith , who have at least 1 one-way follow-relationships among all pairs of them. Similarly, bob and carol have a one-way relationship between them. However, there is only one relationship between someone from the first group and someone from the second group. In this problem, we might ask a data mining question, namely, whether we can automatically identify these clusters or groups, given only the known relationships. The method you will implement is known as spectral graph partitioning or spectral graph clustering , which is formulated as a linear algebra problem. Exercises Exercise 0 (2 points -- 0.5 exposed, 1.5 hidden). For our analysis, we won't care whether follows or follows , only that there is some interaction between them. To do so, let's write some code to "symmetrize" the edges. That is, if there is a directed edge from node to node , then symmetrize will ensure there is also a directed edge to , unless one already exists. ( Recall Notebook 10! ) For example, a symmetrized version of the edges data frame from above would look like the following: source target alice dave alice edith bob carol carol bob carol dave dave alice dave carol dave edith edith alice edith dave Complete the function, symmetrize_df(edges) , below, so that it symmetrizes edges . Assume that edges is a pandas DataFrame with source and target columns. Your function should return a new pandas DataFrame with the edges symmetrized. Your function should also reset the index, so that the output is a proper "tibble."

4/26/24, 11:09 PM problem17 file:///Users/mingjingyang/Desktop/OMSA/6040/cse6040-pfx/problem17/problem17.html 5/25 Note 0: The order of the edges in the output does not matter. Note 1: You may assume the input data frame has columns named 'source' and 'target' . Note 2: You should drop any duplicate edges. In the example, the edges 'dave' 'alice' and 'alice' 'dave' already exist in the input. Therefore, observe that they appear in the output, but only once each. Note 3: Your function should work even if there is a "self-edge," i.e., an edge . The example above does not contain such a case, but the hidden test might check it. In [5]: def symmetrize_df(edges): assert 'source' in edges.columns assert 'target' in edges.columns ### BEGIN SOLUTION from pandas import concat edges_transpose = edges.rename(columns={'source': 'target', 'targe t': 'source'}) edges_all = concat([edges, edges_transpose], sort= False ) \ .drop_duplicates() \ .reset_index(drop= True ) return edges_all ### END SOLUTION # Demo of your function: symmetrize_df(edges) Out[5]: source target 0 alice dave 1 alice edith 2 dave alice 3 dave edith 4 dave carol 5 bob carol 6 edith alice 7 edith dave 8 carol dave 9 carol bob

4/26/24, 11:09 PM problem17 file:///Users/mingjingyang/Desktop/OMSA/6040/cse6040-pfx/problem17/problem17.html 6/25 In [6]: # Test cell: `ex0_symmetrize_df__visible` (1 point) edges_input = DataFrame({'source': ['alice', 'alice', 'dave', 'dave', 'dave', 'bob'], 'target': ['dave', 'edith', 'alice', 'edit h', 'carol', 'carol']}) edges_output = symmetrize_df(edges_input) # The following comment block suggests there is hidden content in this cell, but there really isn't. ### BEGIN HIDDEN TESTS from os.path import isfile if not isfile(dataset_path('symmetrize_soln.csv')): symmetrize_df_soln0__ = edges_output.sample(frac=1) \ .sort_values(by=['source', 'ta rget']) print(pandas_df_to_markdown_table(symmetrize_df_soln0__)) symmetrize_df_soln0__.to_csv(dataset_path('symmetrize_soln.csv'), index= False ) ### END HIDDEN TESTS edges_output_soln = pd.read_csv(dataset_path('symmetrize_soln.csv')) assert tibbles_are_equivalent(edges_output, edges_output_soln), \ "Your solution does not produce the expected output." print(" \n (Passed.)") (Passed.)

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version