python3 get_commit_hash.py repos.txtit will traverse all repos listed in repos.txt
steps:
- traverse the repo names
- git clone the repos
- through github api, obtain closed pull request related to bug issues
- filter out prs according to keywords
keywords = ["fix", "defect", "error", "bug", "issue", "mistake", "incorrect","fault", "flaw"]reference:Joshua Garcia, Yang Feng, Junjie Shen, Sumaya Almanee, Yuan Xia, and and Qi Alfred Chen. 2020. A comprehensive study of autonomous vehicle bugs. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE '20). Association for Computing Machinery, New York, NY, USA, 385–396. https://doi.org/10.1145/3377811.3380397
5.get corresponding commits of the pr
6.write the commit hashes
root_dir: root directory of commit hash files。
pydriller: get commits from a a repo,and resolve the commits to get the changed methods
- read the repo names
- traverse the commits
- search for changed .rs files (ignore files' add/delete)
- store the changed codes in the level: repo——commit——file——method_before、method__after
- commit's changes are between 6 statements
- code/difftastic
-
code diff
-
transfer diff results to vector: one node to one vector
[repo, commit_hash, change type, parent node type, grand parent type]python3 process.py vector.csv
-
-
preprocess the results of process.py,generate feature vector of commits
-
one commit to multiple nodes
[change type, parent node type, grand parent type] -
dimensions: $$ \rm len(Change\ Type) \times len(Parent\ Node\ Type) \times len(Grandparent\ Node\ Type) $$
-
-
clustering
- HAC
Feature vectors
- change type:
- insert
- delete
- context:
- use TreeCursor to traverse tree-sitter::Tree
common nodes
-
field_expression
-
arguments
-
token_tree
-
scoped_identifier
-
let_declaration
-
block
-
non_special_punctuation
-
type_arguments
fn set_outer_position(&self, pos: PhysicalPosition<i32>) -> Result<(), MatchAccountOwnerError>
-
tuple_struct_pattern: if let/match
let v = Some(5); if let Some(5) = v { println!("{}", n); }
-
match_arm
-
macro_invocation
-
binary_expression
-
expression_statement
-
function_item
-
reference_item
-
meta_item:
#[derive(Debug, Display)] -
parameters
-
parameter
-
meta_arguments
-
call_expression
-
closure_parameters
grid.clear(|c| c.reset(&template));
-
tuple_pattern
-
MatchedPos
pub struct MatchedPos { pub kind: MatchKind, pub pos: SingleLineSpan, } pub enum MatchKind { UnchangedToken { highlight: TokenKind, self_pos: Vec<SingleLineSpan>, opposite_pos: Vec<SingleLineSpan>, }, Novel { highlight: TokenKind, }, NovelLinePart { highlight: TokenKind, self_pos: SingleLineSpan, opposite_pos: Vec<SingleLineSpan>, }, NovelWord { highlight: TokenKind, }, Ignored { highlight: TokenKind, }, } pub struct SingleLineSpan { /// All zero-indexed. pub line: LineNumber, pub start_col: u32, pub end_col: u32, }
-
3 data structures in Difftastic:
-
Tree node
-
Syntax node (Enum)
pub enum Syntax<'a> { List { info: SyntaxInfo<'a>, open_position: Vec<SingleLineSpan>, open_content: String, children: Vec<&'a Syntax<'a>>, close_position: Vec<SingleLineSpan>, close_content: String, num_descendants: u32, }, Atom { info: SyntaxInfo<'a>, position: Vec<SingleLineSpan>, content: String, kind: AtomKind, }, }
-
MatchedPos
-