PyFlink-G05

Team Members

Team Member Name	WorkSpace Link
1. JayaShankar Mangina	https://github.com/jyshnkr/PyFlink-G05/tree/main/JayaShankar
2. Pariveshita Thota	https://github.com/jyshnkr/PyFlink-G05/tree/main/Pariveshita
3. Abhilash Ramavaram	https://github.com/jyshnkr/PyFlink-G05/tree/main/Abhilash
4. Madhu A	https://github.com/jyshnkr/PyFlink-G05/tree/main/Madhu
5. Sai Naga Anu Teja Gunda	https://github.com/jyshnkr/PyFlink-G05/tree/main/Anutej
6. Nandini Kandi	https://github.com/jyshnkr/PyFlink-G05/tree/main/Nandini

Description

A Big data project to develop Google Page Rank System using Apache Flink with Python.

Apache Flink

Apache Flink can be described an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Flink provides a high-throughput, low-latency streaming engine as well as support for event-time processing and state management. Flink applications are fault-tolerant in the event of machine failure and support exactly-once semantics. Programs can be written in Java, Scala, Python, and SQL and are automatically compiled and optimized into dataflow programs that are executed in a cluster or cloud environment.

1. Page rank

PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine results. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

2. Algorithm Explanation:

The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. PageRank can be calculated for collections of documents of any size. It is assumed in several research papers that the distribution is evenly divided among all documents in the collection at the beginning of the computational process. The PageRank computations require several passes, called “iterations”, through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value.

3. Algorithm Explaination with Example:

PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine is used to find out the importance of a page to estimate how good a website is.

It is not the only algorithm used by Google to order search engine results.

In this topic we will explain

What is PageRank?

Page rank is vote which is given by all other pages on the web about how important a particular page on the web is.

A link to a page counts as a vote of support.
The number of times a page is refers to by the forward link it adds up to the website value.
The number of times it is taken as an input to the previous page it also adds up to the web value.

Simplified algorithm of PageRank:

Equation:

PR(A) = (1-d) + d[PR(Ti)/C(Ti) + …. + PR(Tn)/C(Tn)]
Where:

PR(A) = Page Rank of a page (page A)

PR(Ti) = Page Rank of pages Ti which link to page A

C(Ti) = Number of outbound links on page Ti

d = Damping factor which can be set between 0 and 1.

Let’s say we have three pages A, B and C. Where,

A linked to B and C

B linked to C

C linked to A

Calculate Page Rank:

Final Page Rank of a page is determined after many more iterations. Now what is happening at each iteration?

Note: Keeping · Standard damping factor = 0.85 · At initial stage assume page rank of all page is equal to 1

Iteration 1:

Page Rank of page A:

PR(A) = (1-d) + d[PR(C)/C(C)]   # As only Page C is linked to page A
           = (1-0.85) + 0.85[1/1] # Number of outbound link of Page C = 1(only to A)
           = 0.15 + 0.85
           = 1

Page Rank of page B:

PR(B) = (1-d) + d[PR(A)/C(A)]                        # As only Page A is linked to page C
           = (1-0.85) + 0.85[1/2]                    # Number of outbound link of Page A = 2 (B and C)
           = 0.15 + 0.425                            # and page rank of A was 1 (calculated from previous
           = 0.575                                   # step)

Page Rank of page C:

As Page A and page B is linked to page C

Number of outbound link of Page A [C(A)] = 2 (ie. Page C and Page B)
Number of outbound link of Page B [C(B)] = 1 (ie. Page C)

PR(A) = 1  (Result from previous step not initial page rank)
PR(B) =  0.575 (Result from previous step)
PR(B) = (1-d) + d[PR(A)/C(A) + PR(B)/C(B)]   
           = (1-0.85) + 0.85[(1/2) + (0.575/1)]         
           = 0.15 + 0.85[0.5 + 0.575]                      
           = 1.06375

This is how page rank is calculated at each iteration. In real world it iteration number can be 100, 1000 or may be more than that to come up with final Page Rank score.

Reference

Contributor's Space

Abhilash Ramavaram

I have added Pagerank explanation in detail with an example

Anu Teja Gunda

I have added the group image and reference Links for the content and examples we used in this repository

JayaShankar Mangina

I have added table in README and allocated space for contributor's workspace links
Added Contributor's space in README where the contributors can add their work & contributions.
Moderated the content on README and made necessary changes.

Madhu A

I learned and researched about page rank and page rank algorithm, And posted the same.

Pariveshita Thota

I have done some groundwork to find some interesting information regarding Apache Flink and have gathered some important content and have created a new file and updated it in the wiki page in GitHub and provided the following reference to my content.

Nandini Kandi

I have initiated the wiki for our repository and created sub-pages namely contributions, work, issues, and also allocated sub-pages for every team member.

References

Apache Flink - https://en.wikipedia.org/wiki/Apache_Flink
Page Rank Definition and Algorithm Explaination - https://www.geeksforgeeks.org/page-rank-algorithm-implementation/
PageRank Example - https://thinkinfi.com/page-rank-algorithm-and-implementation-in-python/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyFlink-G05

Team Members

Description

Apache Flink

1. Page rank

2. Algorithm Explanation:

3. Algorithm Explaination with Example:

What is PageRank?

Simplified algorithm of PageRank:

Calculate Page Rank:

Iteration 1:

Page Rank of page A:

Page Rank of page B:

Page Rank of page C:

Contributor's Space

Abhilash Ramavaram

Anu Teja Gunda

JayaShankar Mangina

Madhu A

Pariveshita Thota

Nandini Kandi

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
Abhilash		Abhilash
Anutej		Anutej
JayaShankar		JayaShankar
Madhu		Madhu
Nandini		Nandini
Pariveshita		Pariveshita
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

PyFlink-G05

Team Members

Description

Apache Flink

1. Page rank

2. Algorithm Explanation:

3. Algorithm Explaination with Example:

What is PageRank?

Simplified algorithm of PageRank:

Calculate Page Rank:

Iteration 1:

Page Rank of page A:

Page Rank of page B:

Page Rank of page C:

Contributor's Space

Abhilash Ramavaram

Anu Teja Gunda

JayaShankar Mangina

Madhu A

Pariveshita Thota

Nandini Kandi

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages