SparkSparse/Project.TODO at master · vincentLiangBerkeley/SparkSparse · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90

Project Week 1:
  ✔ Write up IO routine for test matrices and vectors @done (14-10-24 16:44)
    ✔ Write up "MatrixVectorIO.scala" as an object that provides utility methods @done (14-10-24 16:44)
  ✔ Write up map reduce routine for multiplying COO matrix and local vector @done (14-10-26 15:49)
    ✔ Write up a function SparseUtility.scala that provides multiplication routines(Can we make that a member of Coordinate Matrix?) @done (14-10-26 15:49)
    1. Map (i, j, a(i,j)) to (i, a(i, j) * input(j))
    2. ReduceByKey: Key-type = Long, Value-type = Double, Reduce Function = Addition
    3. Collect and transform into a local vector // This requires an iteration over the entries

  Questions:
    ✔ Can we make multiplication a member of Coordinate Matrix? Or overload the operator *? @done (14-10-30 14:23)
      Later merge into the class
    ✔ Can we generalize the matrix to adept to more data types like Int, Boolean (for pattern matrix)? @done (14-10-30 14:25)
    ✔ If "vectors" have size "Int", what if the size of the matrix is too large for "Int"? (Although not likely) Then why we make matrix Long? @done (14-10-30 14:31)
    ✔ Any other suggestions for testing? Performance analysis? @done (14-10-30 14:37)
      Use Breeze linalg library in unit tests
      Check out AYDIN BULUC

Project Week 2:
  ✔ Write up left multiplication of vectors, the same routine but just change i, j @done (14-10-30 13:08)
  ✔ Write up Scala unit tests using Breeze linear algebra library locally @done (14-11-02 16:28)
  ✔ Learn how to use Git and locally test for results using Breeze library @done (14-11-02 21:32)
  ✔ Set up a new Git repository for the project and upload all the codes @done (14-11-02 21:32)

  Possible Test Programs:
    1. Matrix Market
    2. Random Matrix Java Applet

Project Week 3:
  ✔ Test for multiply suite does not pass: breeze/Storate/DefaultArrayValue class def not found @Debug @done (14-11-03 15:45)
    Does not have to rewrite Breeze matrix IO codes, just transform the Coordinate matrix into Breeze dense matrix
  ✔ Write up new classes for Coordinate matrices and MatrixEntry so that it caters to more data types @Important @done (14-11-03 15:45)
  ✔ Write up IO routines to handle symmetric matrix forms as well @done (14-11-06 12:21)
  ✔ Repartition the entries so that each partition contains a row @done (14-11-08 22:31)
  ✔ Do a "reduce" operation on each partition, and then a "collect" operation over all the partitions. @done (14-11-08 22:31)
  ✔ Optimize the routines for symmetric multiplication, and a partition so that each processor has a subset of the rows. @done (14-11-13 11:15)
  ✔ Test with medium size matrix(nnz = 10000), and large matrices, possibly(nnz = 100000) @done (14-11-08 23:36)
  ☐ Test on Edison by remote connection @Important
  Problem:
    ✔ It seems that the lower the num of partitions, the faster the program @done (14-11-09 17:22)
    ✔ Multiplying matrices as dense ones locally is much faster(100x) then using the Spark routine @done (14-11-09 17:22)
      It seems that the most expensive operation is the "map", that's because for each coordinate I did a "match" operation
      Performance is expected to improve if I know the type beforehand, i.e use implicit num ops
--- ✄ -------------------------- ✄ -------------------------- ✄ -------------------------- ✄ -------------------------- ✄ -------------------------- ✄ ---------------------


Project Week 4:
  ✔ Optimize the code, repartition should happen on input data, that is Entries: RDD[MatrixEntry] when reading the matrix in @done (14-11-11 17:48)
  ✔ Define a "rowForm" and a "colForm" members of the class, each is a repartition of data according to row/col respectively @done (14-11-11 20:51)
    However, the level of parallelism is not configured yet, it depends on the number of processors in the cluster
  ✔ Try to do the local multiplication using Breeze sparse routines @done (14-11-15 22:19)
    The performance is improved a lot by doing this
  ✔ Try to handle the "pattern" format @done (14-11-24 17:01)
  ✔ Implement CSR format, there are two ways to go about it, local and global. @done (14-11-18 00:11)
    1. Partition as subsets of rows
    2. Map each partition to a local breeze.CSCMatrix @Difficult
  ☐ Try to implement "Power Method" on sparse matrices, or even eigenvalue solvers. @Optional

Project Week 5:
  ✔ Implement the multiplication routine on CSR matrix @done (14-11-18 16:41)
    Current problem is that the matrix does not seem to capture the last column of input data
  ☐ Partition the input matrix into even smaller blocks and store them(2D partition)
    Problem is that this could be potentially inefficient because we are multiplying with part of the vector as well
  ✔ A potential problem now is that numRows of the matrix is type Long, but the output is obtained by "collect" which is an array. @done (14-11-24 17:02)
    Solved with a new class Vector, that supports collect and saveAsTextFile
  ✔ Try to use GraphX's MapReduce triglets function to do the multiplication @done (14-11-24 16:12)
    Current partition is not as efficient
  ✔ Test basic cases on current CSR implementation  @done (14-11-23 17:41)
  ☐ Possible optimization is to partition the matrix so that each partition has roughly the same number of nonzeros
    Also in the collection of partitioning problems.

Project Week7 :
  ✔ Figure out how to test codes on Edison system @done (14-12-01 23:04)
  ☐ Read the three papers on partitioning
    1. Hypergraph-partitioning based decomposition
    2. On two dimensional sparse matrix partitioning
    3. A nested disection approach to sparse matrix
  ✔ Also need a new class to capture result (large vectors) @done (14-12-03 23:12)
    1. This class has to support element-wise operations
    2. This class has to also support dot operations
  ✔ But the problem is that if we use this longVector class, the multiplication is super slow, both the map and collect time @done (14-12-04 15:01)

  Spark launch script on Edison:
    1. qsub -I -q ccm_int -l walltime=00:30:00 -l mppwidth=48
    2. module load spark
       ccmlogin -V
    Now you are on a ccm node
    3. start-all.sh (start master and worker nodes)
    Now you can run spark-shell or spark-submit codes