Community Detection Proseminar - Elementary Data Mining Techniques by Simon Grätzer

Page created by Jane Reese
 
CONTINUE READING
Community Detection
         Proseminar - Elementary Data Mining Techniques
         by Simon Grätzer
Freitag, 1. Februar 13                                    1
Content
                     What is Community Detection?

                     Motivation

                     Defining a community

                     Methods to find communities

                         Overlapping communities

                           Clique percolation method

                         Finding a community with query nodes

                     Conclusion

Freitag, 1. Februar 13                                          2
What is Community
                    Detection?
            Different from traditional clustering

            Algorithms use the graph property

            Graphs with a „natural“ origin have a
            structure that is not random

            We try to find these structures by
            analyzing the graph

            A „perfect“ solution has yet to be
            found

Freitag, 1. Februar 13                              3
Motivation
                     Communities can represent parts of a larger system
                     (Like organs in the human body)
                     Communities can be considered as a summary of
                     the graph
                     Communities make it easy to visualize and
                     understand complex systems
                     Communities on the web might represent pages of
                     related topics
                     Community can reveal the properties without
                     releasing the individual privacy information
Freitag, 1. Februar 13                                                    4
Defining a Community
                     There is not exact definition of a community in a
                     graph

                     It depends on the application

                     A general definition:

                         Separation between nodes in different
                         communities

                         Cohesion between nodes in a community

                     The differences between algorithms come down to
                     the precise definition

Freitag, 1. Februar 13                                                   5
Basics
                     For a Graph G = {V, E} and a subgraph C ⊆ G with
                     |G| = |V | = n and |C| = nc

                     φint(C) should have a higher value than the whole
                     graph and φext(C) should be much lower

                     Local definitions see communities as an
                     autonomous entity within a larger system

                     Global definitions see the communities as
                     essential parts of a larger system

                     Vertex similarity: compare individual nodes and
                     group them based on a similarity measure
Freitag, 1. Februar 13                                                   6
Methods

                     Finding overlapping
                     communities

                         Clique percolation
                         method (CPM)

                     Finding communities
                     with query nodes

Freitag, 1. Februar 13                        7
Clique Percolation
                    Method
                     CPM is based on the idea that communities are
                     likely to consist of cliques

                     Assumption: Every node in the same community is
                     connected to nearly every other node

                     A community is build up by a chain of k-cliques
                     which are adjacent.

                     Two k-cliques are adjacent if they share k-1 nodes

                         The largest possible chain is defined as community

                     This is a local definition
Freitag, 1. Februar 13                                                        8
Implementation of CPM

                     The number of possible k-cliques in a graph is
                     quite high

                     Implementations search for maximal k-cliques
                     (NP-hard problem)

                     We build an clique-clique overlap matrix O

                     All entries smaller than k-1 are removed

Freitag, 1. Februar 13                                                9
Parameter k = 3; k = 4
    The results of processing the example graph with the CFinder software

Freitag, 1. Februar 13                                                      10
Drawbacks
                     Even if the underlying problem is NP-hard, for
                     large sparse graphs, this algorithm is reasonably
                     fast

                     Some cases lead to useless results:

                         It looks for cliques not dense subgraphs

                         It requires a large number of cliques, but not too
                         many

Freitag, 1. Februar 13                                                        11
Finding a community
                    with query nodes
                     The goal is to find a subgraph H that contains a
                     given set Q of query nodes and is densely
                     connected.

                     The function f is maximized among all possible
                     choices for H

                         In this case we choose the minimum degree for f

                     Additionally we add a distance constraint d

Freitag, 1. Februar 13                                                     12
Without size restriction -
                    Greedy algorithm
            Choose f = f(H) = minimum degree of a node in H

            We set G0=G then repeat the steps:

                   Obtain Gt+1 by removing a node which violates the
                   distance constraint or has the minimum degree

                   Terminate if either one of the query nodes has minimum
                   degree or the query nodes are no longer connected

            We choose the component of Gt for which the minimum
            degree f(H) is maximized

            This can be implemented in O(n+m)
Freitag, 1. Februar 13                                                      13
Q = {1, 2, 3}
    The greedy algorithm, without size constraint, applied on the example graph

Freitag, 1. Februar 13                                                            14
Communities with size
                    restriction
          A size constraint k makes the problem NP hard (Can be
          shown via a reduction to the Steiner tree problem)

          But it can be assumed that the size of the result set is
          correlated with the distance constraint

          The paper proposes two heuristics:
                GreedyDist repeatedly executes Greedy and decreases d until the size k‘ of the
                graph is small enogh

                GreedyFast restricts the graph to the k‘ closest nodes to the query nodes. Then
                Greedy is invoked

Freitag, 1. Februar 13                                                                            15
Evaluation with the DBLP dataset
    The goal was to find a network of scientific collaboration around Christos Papadimitriou

Freitag, 1. Februar 13                                                                         16
Conclusion

                     A really broad topic with lots of applications

                     Each algorithms is build with different problems in
                     mind

                     Algorithms are difficult to compare, there is no
                     standard way of testing

Freitag, 1. Februar 13                                                    17
Bibliography
                     [1] P. Erdos and A. Renyi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci,
                     5:17 61, 1960.
                     [2] S. Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75 ! 174, 2010.
                     [3] P. F. Jonsson and P. A. Bates*. Global topological features of cancer proteins in the human
                     interactome. Bioinformatics, 2291 2297, 2006.
                     [4] T. H. J. S. J.-P. O. K. Kaski. Spectral and network methods in the analysis of correlation matrices
                     of stock returns. Physica A 383, 147 151, 2007.
                     [5] J. M. Kumpula, M. Kivelä, K. Kaski, and J. Saramäki. Sequential algorithm for fast clique
                     percolation. Phys. Rev. E, 78:026109, Aug 2008.
                     [6] G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping com- munity structure
                     of complex networks in nature and society. Nature, 435:814 818, June 2005.
                     [7] M. E. Porter, K. Schwab, M. E. Porter, K. Schwab, F. Paua, E. T. Herrera, and M. Porter.
                     Communities in networks. Notices of the American Mathematical Society, 1164 1166, 2009.
                     [8] M. Sozio and A. Gionis. The community-search problem and how to plan a successful cocktail
                     party. In Proceedings of the 16th ACM SIGKDD interna- tional conference on Knowledge discovery
                     and data mining, KDD '10, 939 948, New York, NY, USA, 2010. ACM.
                     [9] K.-F. W. Wei Gao. Information Retrieval Technology. Springer Berlin Heidelberg, 2008.

Freitag, 1. Februar 13                                                                                                         18
You can also read